Building a large scale fully automatic machine learning platform from scratch

Jul 2016

25 Mon

26 Tue

27 Wed

28 Thu 08:30 AM – 06:25 PM IST

29 Fri 08:30 AM – 06:15 PM IST

30 Sat 08:45 AM – 05:00 PM IST

31 Sun 08:15 AM – 06:00 PM IST

Make a submission

NIMHANS Convention Centre

The Fifth Elephant is India’s most renowned data science conference. It is a space for discussing some of the most cutting edge developments in the fields of machine learning, data science and technology that powers data collection and analysis.

Machine Learning, Distributed and Parallel Computing, and High-performance Computing continue to be the themes for this year’s edition of Fifth Elephant.

We are now accepting submissions for our next edition which will take place in Bangalore 28-29 July 2016.

#Tracks

We are looking for application level and tool-centric talks and tutorials on the following topics:

Deep Learning
Text Mining
Computer Vision
Social Network Analysis
Large-scale Machine Learning (ML)
Internet of Things (IoT)
Computational Biology
ML in healthcare
ML in education
ML in energy and ecology
ML in agriculrure
Analytics for emerging markets
ML in e-governance
ML in smart cities
ML in defense

The deadline for submitting proposals is 30th April 2016

Format

This year’s edition spans two days of hands-on workshops and conference. We are inviting proposals for:

Full-length 40 minute talks.
Crisp 15-minute talks.
Sponsored sessions, 15 minute duration (limited slots available; subject to editorial scrutiny and approval).
Hands-on Workshop sessions, 3 and 6 hour duration.

Selection process

Proposals will be filtered and shortlisted by an Editorial Panel. We urge you to add links to videos / slide decks when submitting proposals. This will help us understand your past speaking experience. Blurbs or blog posts covering the relevance of a particular problem statement and how it is tackled will help the Editorial Panel better judge your proposals.

We expect you to submit an outline of your proposed talk – either in the form of a mind map or a text document or draft slides within two weeks of submitting your proposal.

We will notify you about the status of your proposal within three weeks of submission.

Selected speakers must participate in one-two rounds of rehearsals before the conference. This is mandatory and helps you to prepare well for the conference.

There is only one speaker per session. Entry is free for selected speakers. As our budget is limited, we will prefer speakers from locations closer home, but will do our best to cover for anyone exceptional. HasGeek will provide a grant to cover part of your travel and accommodation in Bangalore. Grants are limited and made available to speakers delivering full sessions (40 minutes or longer).

Commitment to open source

HasGeek believes in open source as the binding force of our community. If you are describing a codebase for developers to work with, we’d like it to be available under a permissive open source licence. If your software is commercially licensed or available under a combination of commercial and restrictive open source licences (such as the various forms of the GPL), please consider picking up a sponsorship. We recognise that there are valid reasons for commercial licensing, but ask that you support us in return for giving you an audience. Your session will be marked on the schedule as a sponsored session.

Key dates and deadlines

Revised paper submission deadline: 17 June 2016
Confirmed talks announcement (in batches): 13 June 2016
Schedule announcement: 30 June 2016
Conference dates: 28-29 July 2016

##Venue
The Fifth Elephant will be held at the NIMHANS Convention Centre, Dairy Circle, Bangalore.

##Contact
For more information about speaking proposals, tickets and sponsorships, contact info@hasgeek.com or call +91-7676332020.

Hosted by

The Fifth Elephant

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more

All submissions

Previous Next

Building a large scale fully automatic machine learning platform from scratch

Submitted Apr 30, 2016

Section: Full talk Technical level: Advanced

Data science is hard, expensive and needs a combination of math, statistics and software engineering skills. Mass adoption of data science is only possible if self-service machine learning platforms are built. We have built Insight Jedi, the first fully automatic machine learning platform that automates the complete data-to-decisions workflow covering data cleanup, feature generation, feature filtering, model building, insights generation, and predictions and recommendations. In the talk we will show how Insight Jedi was built from scratch - specifically we will discuss about design considerations, key problems involved and algorithms used.

Outline

Design Considerations

We define the multiple users of such a platform i.e. business end-users, analysts and other API driven platforms that need a predictive layer. We cover the requirements and most important design considerations of each user base, and how balancing all of them makes it such a hard platform architecture problem. Our assumptions here will drive the platform architecture and backend algorithms.

Three big problems

What are they? Why are they so tough to do automatically for any any data and any business problem?

Data cleanup
Feature engineering + filtering
Insights

Automatic data cleanup:

Road blocks: when is the endproduct of automatic data cleanup usuable, generalisation to a consistent set of rules, making it comprehensible to user.

Solutions: Figuring out datatypes, identifying junk variables, outliers and redundant variables, creating automatic visualisations, treating variables that are cumbersome for predictive models - all AUTOMATICALLY.

Showing how it is done in Insight Jedi.

Automatic feature engineering:

Road Blocks: Speed & storage, metadata, algorithms and architecture for repeated use.

Solutions: Hidden features can be abstracted by mathematical transforms over one or more columns in the dataset. A very large library of mathematical transforms will cover features appropriate for any business context. A brief description of Insight Jedi transforms library (covering dates, data buckets, times, paths, texts, geography, weather etc) that creates 5000 features from just 30 columns.

Automatic feature filtering:

Road Blocks: Scale (5000 features mean more than 10^(250000000) cases to search), stopping criteria, speed.

Solutions:
Hybrid method with automatic stopping criteria based on only feature relationships (i.e. model agnostic and hence fast) and also on an underlying predictive model. Data structures and metadata strictly designed for feature filtering.

Show how a dataset of 30 columns is transformed automatically in to a dataset of 5000 columns - 5000 columns correspond to 5000 engineered features. From there how very few (about 30) hidden features are extracted to build a very accurate predictive model.

Automatic Insights:

Difference between a model based insight and model free insight.

Road Blocks: Define the concept of an automated insights, building model free and model based insights, handling robustness i.e. consistent results when data slightly changes, balancing statistical rigor with business rules.

Solutions:
An automated platform where each insight is defined by a generic question and is the answer automatically bundles required data, statistical tests, and visualisations.

Speaker bio

Swarna is the principal architect of the platform. He has built the automated data pre-processing, feature engineering and insights generation modules. Dipayan is the prinicipal statistician and responsible for the algorithms in the feature filtering, and model building. The two of us have built Insight Jedi which sits at the intersection of maths, statistics, engineering, design, and business.