Learnings from building TV viewership platform for 100 Million users at zapr
Submitted by Agam Jain (@agamjain) on Sunday, 30 April 2017
Section: Full talk for data engineering track Technical level: Intermediate
Zapr Media Labs has come a long way from tracking TV viewership of around 5 Million users two years back to around 100 Million users currently. We want to share learnings while building a complex audio signal processing based platform which has gone through this sort of hyper growth; which involves processing more than Billion signals per day; producing tera bytes of raw organic data and processing peta bytes of data on a daily basis.
The talk would focus around technologies we have used and why they worked better than others. It would also explain about the evolution which has happened during this period, which all data driven companies can benefit from.
- Talk about what we do at zapr
offline media consumption of users (http://zapr.in)
what our raw and final data looks like
from raw audio fingerprints generated from Mobile App to a user’s viewership record
what we need to process
- outline of transformations required on the raw data
- Data Sinks
- Fingerprint Processing System
Data Enrichment/Aggregation System
how we moved from a vertical to horizontally scalable system
- vaious technology choices
- scale out to a worker based Sample Processing
- How to schedule jobs?
- immutable data approach
message processing pipeline
evolution of tech used in the Viewership Infrastructure
- from a monolith using php, mongo
- to a netty, kafka (cornerstone), aerospike, samza, s3 (cornerstone), druid
Im Agam Jain, ive been at zapr since its inception in early 2013. i joined here as a college intern when the company strength was 5 people (including 3 founders) and over the next 3 years i worked on many internal project and one of them was the Cloud based Matching Infrastructure. Wherein we build a system which worked for us when we were processing data from a few thousand users and was very cost-effective as well. Over time we’ve worked and reworked this setup from a monolith to a pipeline of events which is handling the present scale of 100 million users