The Fifth Elephant 2019

Gathering of 1000+ practitioners from the data ecosystem

Tickets

Analysing high throughput Data in Real Time

Submitted by Namit Mahuvakar (@nomnom) on Saturday, 15 June 2019

Session type: Short talk of 20 mins Session type: Full talk of 40 mins

View proposal in schedule

Abstract

Analysing high throughput Data in Real Time

Namit Mahuvakar
Data Engineering at Hotstar

At Hotstar, India’s largest premium streaming and entertainment platform, we generate more than 15 Billion clickstream events per day. This data is generated from multiple sources and by multiple teams. We built Bifrost, our internal Data Management Platform, as a single platform that allows users to ingest data of any kind & shape, and allow users to query the streaming and stationary data with ease. The data ingestion API abstracts the underlying complexities of producing, consuming, and processing data. It is built to be highly available, durable and resilient since this is the single entry point for all data coming into Hotstar

Kafka is the backbone of our real-time data platform, data is ingested through the in-house fault tolerant solutions around the ingestion API layer which is written entirely in Go to reliably ingest TBs of validated data each day at a peak of a Million messages per second.
In this presentation we will discuss the promises and use of Stream processing over Kafka Streams leveraging KSQL to analyse the ingested events to solve certain real-time use cases such as Playback Failure Rate which is a fundamental metric for over the top media streaming platforms.

Outline

  • Introduction
  • About Hotstar
  • Stream Processing @Hotstar
    • What is Stream Processing and Why was it required
    • Problems that lead to usage
      • Video Player Metricing
      • Social Signals
      • User Targeting
  • Case Study - Video Player Metrics
    • What are the P1 metrics
    • How did we solve and compute them real time
  • Case Study - Social Signals
    • What are the Social Signals
    • How did we solve engagement in real time
  • Key Take Away Discussion
    • Why and when should we use Stream processing
  • Q&A

Requirements

Basic Knowledge on - * Kafka * Stream Processing * HDFS * SQL

Speaker bio

Currently, Data Engineering at Hotstar. Previously at WebEngage and co-founder at CareODrive. Interested in spreading/sharing knowledge and in solving problems at a scale that matters. Previously held talks at Golang Meet-Ups, Bangalore, India and the 21CF Global Data Summit. Big fan of Radiohead, hit me up for a jam session any time.

Links

Slides

https://docs.google.com/presentation/d/1LDjmMYOCFZckDvIZuioVkvVeHE0ORC2iv4tbPhVhXm0/edit?usp=sharing

Comments

  • Abhishek Balaji (@booleanbalaji) Reviewer 4 months ago

    Hi Namit,

    Thank you for submitting a proposal. We’re moving this to evaluation. Please update your slides to make sure they cover the following:

    • Problem statement/context, which the audience can relate to and understand. The problem statement has to be a problem (based on this context) that can be generalized for all.
    • What were the tools/frameworks available in the market to solve this problem? How did you evaluate these, and what metrics did you use for the evaluation? Why did you pick the option that you did?
    • Explain how the situation was before the solution you picked/built and how it changed after implementing the solution you picked and built? Show before-after scenario comparisons & metrics.
    • What compromises/trade-offs did you have to make in this process?
    • What is the one takeaway that you want participants to go back with at the end of this talk? What is it that participants should learn/be cautious about when solving similar problems?

    We need your updated slides and preview video by Jun 27, 2019 to evaluate your proposal. If we do not receive an update, we’d be moving your proposal for evaluation under a future event.

    • Namit Mahuvakar (@nomnom) Proposer 4 months ago

      Hey Abhishek,
      Can you have a look at the updated slides and context, working on the video in parallel.
      Thanks :)

      • Abhishek Balaji (@booleanbalaji) Reviewer 4 months ago

        Hi Namit,

        I did take a look at the updated slides. Here’s the feedback:

        • The problem is still not defined. The slides are sparse in defining what the problem is and what the approach you followed.
        • Currently reads more like documentation of your architecure. How would this be useful for someone in the audience?
        • The audience at The Fifth Elephant are likely to already know about batch and stream processing. What is novel in this presentation that someone cannot find on the internet?

        Do rework your slides based on the questions posed in the previous comment and make sure to define the problem you’re trying to solve.

  • Namit Mahuvakar (@nomnom) Proposer 4 months ago

    Hi Abhishek,
    Thanks, Updating the proper Outline and slides to reflect the pointers

    • Abhishek Balaji (@booleanbalaji) Reviewer 3 months ago (edited 3 months ago)

      Hi Namit,

      We’d like to schedule a rehearsal and move your proposal for evaluation. You’ll get an email invitation for a rehearsal call along with other instructions.

      Here’s some feedback for your talk, which we expect to be incorporated before your rehearsal:

      • The topic is interesting, but the presentation looks like a tutorial on streaming.
      • Audience at The Fifth Elephant want to hear the challenges in implementation that Hotstar faced
      • Add more hotstar specific details or a great story about hotstar journey/battles
      • Add more examples/metrics/case studies drawing from current sporting events like the Cricket World Cup as well as launching of new content.
      • Namit Mahuvakar (@nomnom) Proposer 3 months ago (edited 3 months ago)

        Hey Abhishek/Rajat,

        • Updated the Slides to try to make the presentation a more problem oriented

        • Working On including a couple more slides of problems and case studies

        Trying to relay the main focus on how to calculate PO metrics during big ticket events such as IPL/WC at a scale of 18+ Mil live concurrent users.

        • Namit Mahuvakar (@nomnom) Proposer 3 months ago

          Hey Abbhishek,
          Made some more changes to the slides

      • Namit Mahuvakar (@nomnom) Proposer 3 months ago

        Hey Abhishek,

        • Updates the slides numbered - 6-8 and 15-19 with better images and lesser more consolidated text
  • Rajat Venkatesh (@vrajatblr) 3 months ago

    Hi Namit,
    I want to re-iterate feedback by Balaji. The talk is very architecture heavy and lacks context to really appreciate the pros of stream vs batch processing. For example, take a look at the blog by Uber about AthenaX (https://eng.uber.com/athenax/). The blog starts with a business requirement, translates that requirements by internal departments to the architecture (and the need for a new project) to satisfy the business requirements. This part is missing. I feel Slide 13 addresses the business requirement but should come earlier in the talk.

    • Namit Mahuvakar (@nomnom) Proposer 3 months ago

      Hey Rajat,
      Abhishek also gave me feedback on similar grounds, I’ll update the slides by tomorrow as we have a rehersal day after, will ping once I’m done updating. * I am ready with a working DEMO with mocked data for the problem I:ve stated * As you mentioned will make the slides more problem oriented followed by leaning into the details of how its done addressing the proper pros and cons

      • Abhishek Balaji (@booleanbalaji) Reviewer 3 months ago

        Hi Namit,
        Couple of changes -

        1. your key takeaway slide currently is the summary slide. So rename the key takeaways to a summary. Your key takeaway slide comes in much earlier at Slide #5, so you can use this as your takeaways slide.

        2. Use some screenshots to show what exactly the segmentation is for. For instance a Swiggy ad shown during live sports would be a result of the targeted segmentation of users when watching the livestream. This needs to be made clear.

        3. More clarity on where your processing fits into the entire pipeline and where the components are used. Would be useful to relate to the full lifecycle of a live sporting event.

Login with Twitter or Google to leave a comment