The Fifth Elephant 2017

On data engineering and application of ML in diverse domains

Agam Jain


Learnings from building TV viewership platform for 100 Million users at zapr

Submitted Apr 30, 2017

Zapr Media Labs has come a long way from tracking TV viewership of around 5 Million users two years back to around 100 Million users currently. We want to share learnings while building a complex audio signal processing based platform which has gone through this sort of hyper growth; which involves processing more than Billion signals per day; producing tera bytes of raw organic data and processing peta bytes of data on a daily basis.
The talk would focus around technologies we have used and why they worked better than others. It would also explain about the evolution which has happened during this period, which all data driven companies can benefit from.


  • Talk about what we do at zapr
  • offline media consumption of users (

  • what our raw and final data looks like

  • from raw audio fingerprints generated from Mobile App to a user’s viewership record

  • what we need to process

  • outline of transformations required on the raw data
  • Data Sinks
  • Fingerprint Processing System
  • Data Enrichment/Aggregation System

  • how we moved from a vertical to horizontally scalable system

  • vaious technology choices
  • scale out to a worker based Sample Processing
  • How to schedule jobs?
  • immutable data approach
  • message processing pipeline

  • evolution of tech used in the Viewership Infrastructure

  • from a monolith using php, mongo
  • to a netty, kafka (cornerstone), aerospike, samza, s3 (cornerstone), druid

Speaker bio

Im Agam Jain, ive been at zapr since its inception in early 2013. i joined here as a college intern
when the company strength was 5 people (including 3 founders)
and over the next 3 years i worked on many internal project and one of them was the Cloud based Matching Infrastructure.
Wherein we build a system which worked for us when we were processing data from a few thousand users and was very cost-effective as well.
Over time we’ve worked and reworked this setup from a monolith to a pipeline of events which is handling the present scale of 100 million users




{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}