The Fifth Elephant 2015

A conference on data, machine learning, and distributed and parallel computing

Approximate algorithms for summarizing streaming data

Submitted by Himadri Sarkar (@himadri) on Sunday, 14 June 2015

videocam
Preview video

Technical level

Intermediate

Section

Full Talk

Status

Confirmed & Scheduled

View proposal in schedule

Vote on this proposal

Login to vote

Total votes:  +45

Objective

1) Introduce two approximate algorithms which are considered cornerstone of big data infrastructure.
2) These algorithms can be used to obtain a first hand summary of massive dataset in a streaming manner

Description

Approximate algorithms can be used for processing huge streams of incoming data using only a single pass. These algorithms consume finite amount of memory and cpu cycles. They enable us to maintain summaries which are sufficient to answer expected queries about the data.

Two such novel algorithms, finding lots of applications in the industry today are
1) Count min sketch (CMS)
2) HyperLogLog

This talk aims to:
1) Provide a brief introduction to theoritical aspects behind these algorithms
2) How they can be leveraged to summarize unstructured data for practical purposes.
3) How to choose the tuning parameters pertinent to your needs.
4) Demonstrate how we have used them in Sumologic service.

Requirements

Interest in approximate algorithms, streaming algorithms

Speaker bio

Himadri Sarkar is a Software Engineer at Sumologic India where he is currently working in the are of search performance. Sumo Logic is a cloud-based log management and analytics service that leverages machine-generated big data to deliver real-time IT insights. Search performance team is responsible for delivering all the search related capabilities of the system.

Links

Preview video

https://www.youtube.com/watch?v=awC3IJOKks8

Comments

  • 2
    Abhishek Porwal (@abhip1987) 3 years ago

    way to go :-)

  • 1
    Pratap Dessai (@pratapdessai) 3 years ago

    Good Going Himadri Sarkar

  • 1
    Suresh Prajapati (@samtoddler) 3 years ago

    Awesome Sir :)

  • 1
    Sumanth N 3 years ago

    Will you be covering results of the application of sketch DS ?

  • 1
    Himadri Sarkar (@himadri) Proposer 3 years ago

    Yes I will demonstrate how we have used bloomfilters, CMS, HLL, Reservoir sampling etc with tuning parameters to perform analytics on machine generated (time series) data.

Login with Twitter or Google to leave a comment