The Fifth Elephant 2014

A conference on big data and analytics

Up next

Real world machine learning


Harshad Saykhedkar


We will become familiar with real world machine learning in a hands on, intuitive way. Rather than taking the algorithm and its results as a black box provided by a library and learning in a cookbook style, we will try to understand the why of the problem. Participants will also appreciate the importance of each phase (data exploration, data cleaning and extraction, modeling, evaluation) of machine learning pipeline. By the end of the workshop, participants will

  • Be familiar with all phases of ML pipeline.

  • Get a sense of how real business problems can be tackled using machine learning.

  • Get a hands on training on scikit-learn machine learning library for Python.

  • Understand the why and how of many machine learning algorithms.

  • Understand how to interpret the results thrown up by machine learning libraries.

  • Have fun on the way!

If time permits, we will also discuss topics like

  • Trade-offs in choosing machine learning stack for real world production applications (e.g when should we worry about big data, what about distributed computations, is X suitable for billions of rows)

  • How much knowledge of statistics and maths is absolutely essential for ML? What resources can help you on this front?


Machine learning has evolved to a very popular, rapidly changing, sometimes over-hyped domain with extremely diverse set of ideas. Discussion about machine learning often tends to get lost into jargon of tools, market buzzwords, libraries and diverted from real purpose which is insights! This workshop will focus on insights and practical applications.

What background is essential to understand this workshop ?

This is introductory session and participants are not expected to know 100 machine learning algorithms or have a PhD in Maths. That said, the following are bare minimum requirements,

  • Reasonable knowledge of at-least some programming language (C, C++, Python, Ruby, Java, R, Matlab, Julia and so on).

  • Some familiarity with machine learning (for example, it should be enough to vaguely understand what linear regression is)

  • Curiosity!

Those who are not familiar with Python are advised to go through the tutorial given here. Python is an extremely elegant and simple language, you’d get started in no time!


We will be using Python Scientific stack for this workshop and hence installation of following tools is absolute must.

  1. Python (version 2.7, avoid version 3 as migration of SciPy tools is still a work in progress). Most linux installations come with Python 2.7 installed by default.

  2. Numpy and Scikit-learn

  3. Pandas

The installations steps are detailed on respective websites

Installation of Python

Installation of scikit-learn

Installation of Pandas

A real world dataset to be used for the workshop will be supplied about a week before the actual date.

Speaker bio

Harshad is senior data scientist at Sokrati, a digital advertising startup based out of Pune, where he works closely with the engineering team to extract meaning out of millions of data points from advertising world. He’s been applying machine learning to real world problems in telecom, banking and advertising since last 4 years. He has mostly worked with tools like R, Python, SAS and lately fallen in love with Clojure ecosystem too. He conducted a similar session in Fifth Elephant 2013, focussed on R and text mining. Harshad holds a master’s degree in Operations Research from IIT, Mumbai