The Fifth Elephant 2015

A conference on data, machine learning, and distributed and parallel computing

Himadri Sarkar

@himadri

Approximate algorithms for summarizing streaming data

Submitted Jun 15, 2015

  1. Introduce two approximate algorithms which are considered cornerstone of big data infrastructure.
  2. These algorithms can be used to obtain a first hand summary of massive dataset in a streaming manner

Outline

Approximate algorithms can be used for processing huge streams of incoming data using only a single pass. These algorithms consume finite amount of memory and cpu cycles. They enable us to maintain summaries which are sufficient to answer expected queries about the data.

Two such novel algorithms, finding lots of applications in the industry today are

  1. Count min sketch (CMS)
  2. HyperLogLog

This talk aims to:

  1. Provide a brief introduction to theoritical aspects behind these algorithms
  2. How they can be leveraged to summarize unstructured data for practical purposes.
  3. How to choose the tuning parameters pertinent to your needs.
  4. Demonstrate how we have used them in Sumologic service.

Requirements

Interest in approximate algorithms, streaming algorithms

Speaker bio

Himadri Sarkar is a Software Engineer at Sumologic India where he is currently working in the are of search performance. Sumo Logic is a cloud-based log management and analytics service that leverages machine-generated big data to deliver real-time IT insights. Search performance team is responsible for delivering all the search related capabilities of the system.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures