The Fifth Elephant 2017

On data engineering and application of ML in diverse domains


Learnings from building TV viewership platform for 100 Million users at zapr

Submitted by Agam Jain (@agamjain) on Sunday, 30 April 2017

Preview video

Section: Full talk for data engineering track Technical level: Intermediate


Zapr Media Labs has come a long way from tracking TV viewership of around 5 Million users two years back to around 100 Million users currently. We want to share learnings while building a complex audio signal processing based platform which has gone through this sort of hyper growth; which involves processing more than Billion signals per day; producing tera bytes of raw organic data and processing peta bytes of data on a daily basis.
The talk would focus around technologies we have used and why they worked better than others. It would also explain about the evolution which has happened during this period, which all data driven companies can benefit from.


  • Talk about what we do at zapr
  • offline media consumption of users (

  • what our raw and final data looks like

  • from raw audio fingerprints generated from Mobile App to a user’s viewership record

  • what we need to process

  • outline of transformations required on the raw data
  • Data Sinks
  • Fingerprint Processing System
  • Data Enrichment/Aggregation System

  • how we moved from a vertical to horizontally scalable system

  • vaious technology choices
  • scale out to a worker based Sample Processing
  • How to schedule jobs?
  • immutable data approach
  • message processing pipeline

  • evolution of tech used in the Viewership Infrastructure

  • from a monolith using php, mongo
  • to a netty, kafka (cornerstone), aerospike, samza, s3 (cornerstone), druid

Speaker bio

Im Agam Jain, ive been at zapr since its inception in early 2013. i joined here as a college intern when the company strength was 5 people (including 3 founders) and over the next 3 years i worked on many internal project and one of them was the Cloud based Matching Infrastructure. Wherein we build a system which worked for us when we were processing data from a few thousand users and was very cost-effective as well. Over time we’ve worked and reworked this setup from a monolith to a pipeline of events which is handling the present scale of 100 million users



Preview video


  • Abhishek Balaji (@booleanbalaji) Reviewer 2 years ago

    Hi Agam,

    Please upload the draft slides for the talk and a 2 minute preview video outlining what you would like to cover in your talk, the key takeaways for the audience and the how your approach worked better than others.

    • Zainab Bawa (@zainabbawa) Reviewer 2 years ago

      Agam, we are awaiting your preview video to close the decision on this talk.

  • Agam Jain (@agamjain) Proposer 2 years ago

    Uploaded the slides as well as the preview video. Also incorporated your suggestions
    I’m also working on the architecture diagrams at every stage, will upload the slides by tomorrow

Login with Twitter or Google to leave a comment