Reducing Cost of Production AI: Feature Engineering Case Study
Submitted by Venkata Pingali (@venkatapingali) on Monday, 1 October 2018
The number and complexity of datasets, usecases, and models are rapidly growing. However, the number of ML/AI applications in production are growing much more slowly. AI in production is suffering from a multiple challenges that vary by domain. We focus on a common activity - machine learning feature engineering involving textual data. It accounts for 40-80% of time and contributes significantly to the cost of ML applications using business data. We do not address other aspects of AI.
We identify the cost dimensions in feature engineering for business data and share ways to reduce the cost of this step.
The methods have been tested every day for more than a year. We enabled customer modeling in production across multiple deployments that covered 2M+ people, consumed 800GB of data, and computed upto 500 features for each person.
This approach draws upon the experience with building a less flexible and more expensive solution approach in 2016 using Hive and Pandas. It serves as an imperfect baseline, and our current approach is conservatively 3x improvement over the baseline.
- Feature Engineering Overview
- Typical Feature Engineering Cycle
- Detailed Cost Drivers
- Examples: Reconciliation & auditing, change management
- Indicative Quantitative Improvement
- Detailed discussion of each driver
Have an ML system in production or plan to have one.
Dr. Venkata Pingali is Co-Founder and CEO of Scribble Data, an ML Engineering company based in Bangalore and Denver. Scribble’s flagship enterprise product, Enrich, accelerates ML product development in enterprises. Before starting Scribble Data, Dr. Pingali was VP of Analytics at a political data consulting firm. He has a BTech from IIT Mumbai and a PhD from USC in Computer Science.