The Fifth Elephant 2019

Gathering of 1000+ practitioners from the data ecosystem

Tickets

Anatomy of a production ML feature engineering platform

Submitted by Venkata Pingali (@pingali) on Monday, 1 April 2019


Preview video

Session type: Full talk of 40 mins

View proposal in schedule

Abstract

This talk addresses the following questions:

  1. What should a production ML feature engineering platform have and why?
  2. When do I need one?
  3. What are my options if I have to build one?

This talk draws upon the Scribble’s experience in building and evolving a production feature engineering platform, and the many conversations we have had with user data scientists. The talk will focus on the learnings, and not on the Scribble product itself, and expand on the talk from Fifth Elephant Mumbai in Jan 2019 on reducing costs.

Outline

Rough Outline:

  • Objectives of a feature engineering platform (5 mins)

    • Reduce time to market
    • Enhance robustness of models
    • Enable explainability
  • Points of friction & required capabilities (20 mins)

    • What is in my data? (catalog)
    • Is my input data complete and correct? (health)
    • How do I link existing side information (augment/enrich)
    • How to capture tacit knowledge/signal (labeling)
    • How do I reliably prepare my training datasets (pipelines)
    • How do I check audit & validate what has been computed (audit)
    • How do I discover what is being computed and used? (marketplace)
    • How do I export and track exported discovered features for model dev (search)
    • How do I link the features to performance? (monitor)
    • How do I reuse the features in the streaming path? (library)
  • Economics of Feature Engineering (5 mins)

    • Feature computation expensive, and each has a price
    • Amortization happens over time & across models
    • Process discipline required
    • Questions to ask:
      1. How many models will I have over time?
      2. How defensible should they be?
      3. How available should they be?
      4. How many features will they need?
  • Approaches to building one (5 mins)

    • FEAST (Go-JEK; Thought through but tied to GCP)
    • Combine standalone components (OSS exists but incur integration costs)
    • Thirdparty (Move fast but incur platform costs)

Requirements

Familiarity with data science process

Speaker bio

Dr. Venkata Pingali is Co-Founder and CEO of Scribble Data, an ML Engineering company based in Bangalore and Denver. Scribble’s flagship enterprise product, Enrich, accelerates ML productionization in enterprises. Before starting Scribble Data, Dr. Pingali was VP of Analytics at a political data consulting firm. He has a BTech from IIT Mumbai and a PhD from USC in Computer Science.

Links

Slides

https://docs.google.com/presentation/d/1PykTQ-qE8B-13RaAH9g-aGHrwsAr-8gDLyUJMjh-ayw/edit?usp=sharing

Preview video

https://youtu.be/KSLeIay_b4k

Comments

  • Anwesha Sarkar (@anweshaalt) Reviewer 7 months ago

    Thank you for submitting the proposal. Submit your preview video by 20th April (latest) it helps us to close the review process.

    • Venkata Pingali (@pingali) Proposer 7 months ago
  • Zainab Bawa (@zainabbawa) Reviewer 6 months ago

    This looks like an interesting presentation, one that may open the Pandora’s Box, if you may. :) One question I have is: why have you specifically chosen the examples that you have chosen: Feast, Scribble Enrich and examples of build in-house? You may want to mention why these examples, and point to references of other examples that people may want to look up.

    • Venkata Pingali (@pingali) Proposer 6 months ago

      Since the talk about feature engineering, I picked ones that explicitly claim feature engineering as purpose. These components are embedded in other systems (e.g., Michaelangelo of Uber) that are absolutely worth studying. In fact, strongly recommended.

      I will update the flow.

      • Zainab Bawa (@zainabbawa) Reviewer 6 months ago

        Noted.

  • Abhishek Balaji (@booleanbalaji) Reviewer 6 months ago

    Thanks for rehearsing today Venkat. Here’s the feedback from your talk:

    • Time taken: 27 mins
    • The talk is well paced, message is clear and the expectations were met.
    • Show examples of scale talked about in the presentation.
    • Add examples to help understand the issues around namespaces and versioning
    • Explain the audit interface with more examples – add logos or graphics to help audience grasp better
    • Try to add a story after the introduction. It gets very dry after an engaging first 10 mins.
    • Add visuals, flow charts and infrographs where necessary.

    Please update your slides to incorporate the feedback by 31 May. We’ll update you on the status after evaluating your revised slides.

    • Venkata Pingali (@pingali) Proposer 6 months ago

      Will do. Thanks to all the attendees.

  • Venkata Pingali (@pingali) Proposer 6 months ago

    Posted updated slides. Summary of changes:

    1. Added more visuals to the second part of the presentation (screenshots/graphics)
    2. Added pic of the audit interface to make the whole thing real

Login with Twitter or Google to leave a comment