The Fifth Elephant 2015

A conference on data, machine learning, and distributed and parallel computing

Approximate algorithms for summarizing streaming data

Submitted by Himadri Sarkar (@himadri) on Sunday, 14 June 2015


Preview video

Section: Full Talk Technical level: Intermediate

View proposal in schedule

Abstract

1) Introduce two approximate algorithms which are considered cornerstone of big data infrastructure.
2) These algorithms can be used to obtain a first hand summary of massive dataset in a streaming manner

Outline

Approximate algorithms can be used for processing huge streams of incoming data using only a single pass. These algorithms consume finite amount of memory and cpu cycles. They enable us to maintain summaries which are sufficient to answer expected queries about the data.

Two such novel algorithms, finding lots of applications in the industry today are
1) Count min sketch (CMS)
2) HyperLogLog

This talk aims to:
1) Provide a brief introduction to theoritical aspects behind these algorithms
2) How they can be leveraged to summarize unstructured data for practical purposes.
3) How to choose the tuning parameters pertinent to your needs.
4) Demonstrate how we have used them in Sumologic service.

Requirements

Interest in approximate algorithms, streaming algorithms

Speaker bio

Himadri Sarkar is a Software Engineer at Sumologic India where he is currently working in the are of search performance. Sumo Logic is a cloud-based log management and analytics service that leverages machine-generated big data to deliver real-time IT insights. Search performance team is responsible for delivering all the search related capabilities of the system.

Links

Preview video

https://www.youtube.com/watch?v=awC3IJOKks8

Comments

  • Abhishek Porwal (@abhip1987) 4 years ago

    way to go :-)

  • Pratap Dessai (@pratapdessai) 4 years ago

    Good Going Himadri Sarkar

  • Suresh Prajapati (@samtoddler) 4 years ago

    Awesome Sir :)

  • Sumanth N 4 years ago

    Will you be covering results of the application of sketch DS ?

  • Himadri Sarkar (@himadri) Proposer 4 years ago

    Yes I will demonstrate how we have used bloomfilters, CMS, HLL, Reservoir sampling etc with tuning parameters to perform analytics on machine generated (time series) data.

Login with Twitter or Google to leave a comment