Analytics on Large Scale, Unstructured, Dynamic Data using Lambda Architecture
Submitted by Rajesh Muppalla (@codingnirvana) on Monday, 9 June 2014
In this talk, I will focus on our experience in using Lambda Architecture at Indix, to build a large scale analytics system on unstructured, dynamically changing data sources using Hadoop, HBase, Scalding, Spark and Solr.
Indix is a product intelligence platform. Our catalog has several million products and billions of price points collected from thousands of e-commerce web sites and is constantly growing. We collect product data via crawling product pages from these web sites. Our parsers extract product attributes from these pages which are then run through a series of machine learning algorithms to classify and extract deeper product attributes. This data gets deduped between stores and then matched across stores and is finally fed into our analytics engine which provides insights to our customers.
Our first attempt at building this system around two years ago was chaotic. We were dealing with e-commerce sites whose pages were unstructured and were constanly changing. Our parsers and machine learning algorithms were also improving regularly. All this meant that we had to run our algorithms on the entire data set very often. It was not uncommon for our data refreshes to run for days which meant high latency for product and price data. In addition to that, our data systems were not human fault tolerant. We had issues where an incorrect algorithm would get accidentally deployed to production and corrupt the data we were serving. Since our data store was mutable and did not mantain these changes, it was not easy to fix these corruption issues.
We realized soon that we had to re-think about our data system from ground up. We needed a simpler approach that would scale, be tolerant to human errors and can evolve with our product.
Lambda architecture, coined by Nathan Marz, the creator of Storm and Cascalog, seemed like a step in the right direction for us.
The system has been in production for more than a year now, handling 3X more data than our older system and most importantly is more robust.
Lambda architecture, at its core, is a set of architecture principles that allows both batch and real-time or stream data processing to work together while building immutability, recomputation and human fault tolerance into the system.
It has three layers - batch, serving and speed.
The batch layer is responsible for computing arbitrary views on the master data. Our master data is an immutable store in HDFS and we compute views using a series of Map Reduce jobs using Scalding and Spark. Our batch system runs recomputation every day on our entire data set.
The serving layer indexes and exposes precomputed views to be queried ad-hoc with low latency. We use HBase, Solr and our own inhouse inmemory implementation for the serving layer.
The speed layer deals only with new data and compensates for the high latency updates of the serving layer by creating realtime views. Our real time latency requirements are in few hours and not in seconds, which allows us to use a micro-batch architecture that is a stripped down version of our batch layer and uses the same technologies.
To get the final result, the batch and realtime views must be queried and the results merged together.
Topics I will cover
- Why Lambda Architecture? What problems did it solve for us?
- Technical Challenges encountered in building the lambda architecture
- Schema Evolution
- HDFS Small Files Issue
- Code re-use between batch and real time systems
- Modeling the data pipelines for each layer
- Open problems
Rajesh Muppalla is a co-founder and Director of Engineering at Indix, where he leads the data platform team that is responsible for collecting, organizing and structuring all the product related data collected from the web.
He is passionate about big data, large scale distributed systems, continuous delivery and algorithms. He also likes mentoring and coaching developers in pursuit of building better software.
Prior to Indix, he was a technical lead on Go-CD, an agile and release management product, at Thoughtworks. The product has been recently open-sourced.
He is a gold medallist in Computer Science from Pune University. In his final year of graduation, his team represented India at Asia finals of Microsoft Imagine Cup (then called Microsoft .NET campus challenge).