arrow_back Apache Tez - Present and Future
Approximate algorithms for summarizing streaming data
Submitted by Himadri Sarkar (@himadri) on Sunday, 14 June 2015
Section: Full Talk Technical level: Intermediate
1) Introduce two approximate algorithms which are considered cornerstone of big data infrastructure.
2) These algorithms can be used to obtain a first hand summary of massive dataset in a streaming manner
Approximate algorithms can be used for processing huge streams of incoming data using only a single pass. These algorithms consume finite amount of memory and cpu cycles. They enable us to maintain summaries which are sufficient to answer expected queries about the data.
Two such novel algorithms, finding lots of applications in the industry today are
1) Count min sketch (CMS)
This talk aims to:
1) Provide a brief introduction to theoritical aspects behind these algorithms
2) How they can be leveraged to summarize unstructured data for practical purposes.
3) How to choose the tuning parameters pertinent to your needs.
4) Demonstrate how we have used them in Sumologic service.
Interest in approximate algorithms, streaming algorithms
Himadri Sarkar is a Software Engineer at Sumologic India where he is currently working in the are of search performance. Sumo Logic is a cloud-based log management and analytics service that leverages machine-generated big data to deliver real-time IT insights. Search performance team is responsible for delivering all the search related capabilities of the system.