Building a large scale fully automatic machine learning platform from scratch
Submitted by Dipayan Maiti (@dipayanm) on Saturday, 30 April 2016
Data science is hard, expensive and needs a combination of math, statistics and software engineering skills. Mass adoption of data science is only possible if self-service machine learning platforms are built. We have built Insight Jedi, the first fully automatic machine learning platform that automates the complete data-to-decisions workflow covering data cleanup, feature generation, feature filtering, model building, insights generation, and predictions and recommendations. In the talk we will show how Insight Jedi was built from scratch - specifically we will discuss about design considerations, key problems involved and algorithms used.
We define the multiple users of such a platform i.e. business end-users, analysts and other API driven platforms that need a predictive layer. We cover the requirements and most important design considerations of each user base, and how balancing all of them makes it such a hard platform architecture problem. Our assumptions here will drive the platform architecture and backend algorithms.
Three big problems
What are they? Why are they so tough to do automatically for any any data and any business problem?
- Data cleanup - Feature engineering + filtering - Insights
Automatic data cleanup:
Road blocks: when is the endproduct of automatic data cleanup usuable, generalisation to a consistent set of rules, making it comprehensible to user.
Solutions: Figuring out datatypes, identifying junk variables, outliers and redundant variables, creating automatic visualisations, treating variables that are cumbersome for predictive models - all AUTOMATICALLY.
Showing how it is done in Insight Jedi.
Automatic feature engineering:
Road Blocks: Speed & storage, metadata, algorithms and architecture for repeated use.
Solutions: Hidden features can be abstracted by mathematical transforms over one or more columns in the dataset. A very large library of mathematical transforms will cover features appropriate for any business context. A brief description of Insight Jedi transforms library (covering dates, data buckets, times, paths, texts, geography, weather etc) that creates 5000 features from just 30 columns.
Automatic feature filtering:
Road Blocks: Scale (5000 features mean more than 10^(250000000) cases to search), stopping criteria, speed.
Hybrid method with automatic stopping criteria based on only feature relationships (i.e. model agnostic and hence fast) and also on an underlying predictive model. Data structures and metadata strictly designed for feature filtering.
Show how a dataset of 30 columns is transformed automatically in to a dataset of 5000 columns - 5000 columns correspond to 5000 engineered features. From there how very few (about 30) hidden features are extracted to build a very accurate predictive model.
Difference between a model based insight and model free insight.
Road Blocks: Define the concept of an automated insights, building model free and model based insights, handling robustness i.e. consistent results when data slightly changes, balancing statistical rigor with business rules.
An automated platform where each insight is defined by a generic question and is the answer automatically bundles required data, statistical tests, and visualisations.
Swarna is the principal architect of the platform. He has built the automated data pre-processing, feature engineering and insights generation modules. Dipayan is the prinicipal statistician and responsible for the algorithms in the feature filtering, and model building. The two of us have built Insight Jedi which sits at the intersection of maths, statistics, engineering, design, and business.