The Fifth Elephant 2019

Gathering of 1000+ practitioners from the data ecosystem

Anatomy of a production ML feature engineering platform

Submitted by Venkata Pingali (@pingali) on Apr 1, 2019

Session type: Full talk of 40 mins Status: Confirmed & scheduled

Abstract

This talk addresses the following questions:

  1. What should a production ML feature engineering platform have and why?
  2. When do I need one?
  3. What are my options if I have to build one?

This talk draws upon the Scribble’s experience in building and evolving a production feature engineering platform, and the many conversations we have had with user data scientists. The talk will focus on the learnings, and not on the Scribble product itself, and expand on the talk from Fifth Elephant Mumbai in Jan 2019 on reducing costs.

Outline

Rough Outline:

  • Objectives of a feature engineering platform (5 mins)

    • Reduce time to market
    • Enhance robustness of models
    • Enable explainability
  • Points of friction & required capabilities (20 mins)

    • What is in my data? (catalog)
    • Is my input data complete and correct? (health)
    • How do I link existing side information (augment/enrich)
    • How to capture tacit knowledge/signal (labeling)
    • How do I reliably prepare my training datasets (pipelines)
    • How do I check audit & validate what has been computed (audit)
    • How do I discover what is being computed and used? (marketplace)
    • How do I export and track exported discovered features for model dev (search)
    • How do I link the features to performance? (monitor)
    • How do I reuse the features in the streaming path? (library)
  • Economics of Feature Engineering (5 mins)

    • Feature computation expensive, and each has a price
    • Amortization happens over time & across models
    • Process discipline required
    • Questions to ask:
      1. How many models will I have over time?
      2. How defensible should they be?
      3. How available should they be?
      4. How many features will they need?
  • Approaches to building one (5 mins)

    • FEAST (Go-JEK; Thought through but tied to GCP)
    • Combine standalone components (OSS exists but incur integration costs)
    • Thirdparty (Move fast but incur platform costs)

Requirements

Familiarity with data science process

Speaker bio

Dr. Venkata Pingali is Co-Founder and CEO of Scribble Data, an ML Engineering company based in Bangalore and Denver. Scribble’s flagship enterprise product, Enrich, accelerates ML productionization in enterprises. Before starting Scribble Data, Dr. Pingali was VP of Analytics at a political data consulting firm. He has a BTech from IIT Mumbai and a PhD from USC in Computer Science.

Links

Slides

https://docs.google.com/presentation/d/1PykTQ-qE8B-13RaAH9g-aGHrwsAr-8gDLyUJMjh-ayw/edit?usp=sharing

Preview video

https://www.youtube.com/watch?v=KSLeIay_b4k

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('You need to be a participant to comment.') }}

{{ formTitle }}
{{ gettext('Post a comment...') }}
{{ gettext('New comment') }}

{{ errorMsg }}