The Fifth Elephant 2017

On data engineering and application of ML in diverse domains

Up next

Using Probabilistic Data Structures to Build Real-Time Monitoring Dashboards


Rahul Ramesh


Performing basic operations like finding an element in a set or calculating its cardinality for a few thousands of data points is child’s play. However, it becomes complex and prohibitively expensive as the data-set grows into the millions and covers multiple dimensions.

One way of addressing this problem is by first indexing the data in a database, and then finding its cardinality or checking if an element is present in the database. However, this approach is not optimized for streaming data. Is it possible to perform these operations in a fixed amount of time with acceptable levels of trade-off over accuracy?

At DataWeave, we have managed to crawl millions of URLs every day, and analyze a large number of data points in real-time, with low error rates.

This talk presents an innovative way to build a monitoring dashboard using two probabilistic data structures - Bloom Filters and HyperLogLog.


1) Sketching
2) Bloom Filters
3) HyperLogLog
4) Practical Use Cases
5) Realtime dashboard using Bloom Filters and HLL

Speaker bio

I work as a Software Engineer in the data platforms team at DataWeave, a provider of Competitive Intelligence as a Service for retailers and consumer brands. I design and manage dataflows to various ‘Datastores’ maintained by the company. I also ensure that all datastores are working at optimum capacity, and data consistency is maintained across them.

I have more than 10 years of experience in the software industry with extensive experience in building core networks in the telecommunications domain. I hold a Master’s degree from IIIT-Bangalore.