In 2014, infrastructure components such as Hadoop, Berkeley Data Stack and other commercial tools have stabilized and are thriving. The challenges have moved higher up the stack from data collection and storage to data analysis and its presentation to users. The focus for this year’s conference on analytics – the infrastructure that powers analytics and how analytics is done.
Talks will cover various forms of analytics including real-time and opportunity analytics, and technologies and models used for analyzing data.
Proposals will be reviewed using 5 criteria:
Domain diversity – proposals will be selected from different domains – medical, insurance, banking, online transactions, retail. If there is more than one proposal from a domain, the one which meets the editorial criteria will be chosen.
Novelty – what has been done beyond the obvious. Insights – what insights does the proposal share with the audience that they did not know earlier. Practical versus theoretical – we are looking for applied knowledge. If the proposal covers material that can be looked up online, it will not be considered.
Conceptual versus tools-centric – tell us why, not how. Tell the audience what was the philosophy underlying your use of an application, not how an application was used. Presentation skills – proposer’s presentation skills will be reviewed carefully and assistance provided to ensure that the material is communicated in the most precise and effective manner to the audience.
For queries about proposals / submissions, write to firstname.lastname@example.org
Data Collection and Transport – for e.g, Opendatatoolkit, Scribe, Kafka, RabbitMQ, etc.
Data Storage, Caching and Management – Distributed storage (such as Gluster, HDFS) or hardware-specific (such as SSD or memory) or databases (Postgresql, MySQL, Infobright) or caching/storage (Memcache, Cassandra, Redis, etc).
Data Processing, Querying and Analysis – Oozie, Azkaban, scikit-learn, Mahout, Impala, Hive, Tez, etc.
Big data and security
Big data and internet of things
Data Usage and BI (Business Intelligence) in different sectors.
Please note: the technology stacks mentioned above indicate latest technologies that will be of interest to the community. Talks should not be on the technologies per se, but how these have been used and implemented in various sectors, enterprises and contexts.
Analytics on Large Scale, Unstructured, Dynamic Data using Lambda Architecture
In this talk, I will focus on our experience in using Lambda Architecture at Indix, to build a large scale analytics system on unstructured, dynamically changing data sources using Hadoop, HBase, Scalding, Spark and Solr.
Indix is a product intelligence platform. Our catalog has several million products and billions of price points collected from thousands of e-commerce web sites and is constantly growing. We collect product data via crawling product pages from these web sites. Our parsers extract product attributes from these pages which are then run through a series of machine learning algorithms to classify and extract deeper product attributes. This data gets deduped between stores and then matched across stores and is finally fed into our analytics engine which provides insights to our customers.
Our first attempt at building this system around two years ago was chaotic. We were dealing with e-commerce sites whose pages were unstructured and were constanly changing. Our parsers and machine learning algorithms were also improving regularly. All this meant that we had to run our algorithms on the entire data set very often. It was not uncommon for our data refreshes to run for days which meant high latency for product and price data. In addition to that, our data systems were not human fault tolerant. We had issues where an incorrect algorithm would get accidentally deployed to production and corrupt the data we were serving. Since our data store was mutable and did not mantain these changes, it was not easy to fix these corruption issues.
We realized soon that we had to re-think about our data system from ground up. We needed a simpler approach that would scale, be tolerant to human errors and can evolve with our product.
Lambda architecture, coined by Nathan Marz, the creator of Storm and Cascalog, seemed like a step in the right direction for us.
The system has been in production for more than a year now, handling 3X more data than our older system and most importantly is more robust.
Lambda architecture, at its core, is a set of architecture principles that allows both batch and real-time or stream data processing to work together while building immutability, recomputation and human fault tolerance into the system.
It has three layers - batch, serving and speed.
The batch layer is responsible for computing arbitrary views on the master data. Our master data is an immutable store in HDFS and we compute views using a series of Map Reduce jobs using Scalding and Spark. Our batch system runs recomputation every day on our entire data set.
The serving layer indexes and exposes precomputed views to be queried ad-hoc with low latency. We use HBase, Solr and our own inhouse inmemory implementation for the serving layer.
The speed layer deals only with new data and compensates for the high latency updates of the serving layer by creating realtime views. Our real time latency requirements are in few hours and not in seconds, which allows us to use a micro-batch architecture that is a stripped down version of our batch layer and uses the same technologies.
To get the final result, the batch and realtime views must be queried and the results merged together.
Topics I will cover
- Why Lambda Architecture? What problems did it solve for us?
- Technical Challenges encountered in building the lambda architecture
- Schema Evolution
- HDFS Small Files Issue
- Code re-use between batch and real time systems
- Modeling the data pipelines for each layer
- Open problems
Rajesh Muppalla is a co-founder and Director of Engineering at Indix, where he leads the data platform team that is responsible for collecting, organizing and structuring all the product related data collected from the web.
He is passionate about big data, large scale distributed systems, continuous delivery and algorithms. He also likes mentoring and coaching developers in pursuit of building better software.
Prior to Indix, he was a technical lead on Go-CD, an agile and release management product, at Thoughtworks. The product has been recently open-sourced.
He is a gold medallist in Computer Science from Pune University. In his final year of graduation, his team represented India at Asia finals of Microsoft Imagine Cup (then called Microsoft .NET campus challenge).