Understanding supervised machine learning hands on!
Submitted by Harshad Saykhedkar (@harshss) on Monday, 25 May 2015
Abstract
If you have ever been in a “black box” operating mode where you are throwing more data/complex models at a machine learning problem without a clue about why it is working or not working, this workshop is for you! The workshop will primarily focus on understanding supervised machine learning.
Outline
What will participants gain?
Here’s a mind map showing the overall picture of what will be covered in the workshop.
 Design of large software systems is a study and practice of making tradeoffs (e.g. CAP theorem, time Vs. space complexities, time to build Vs. maintainability). Same is true for machine learning applications. This workshop will help you to clearly understand what those tradeoffs are and how to make one.
 Black box way of building ML applications (use X because company F/G/H uses it) can only get us to a point. Workshop will instead help you understand all core ideas of ML in clear intuitive fashion.
 There are multiple problems in a ML application : modelling, information representation, nature of costs etc. The workshop will give you the big picture and practical advice on tackling the problems.
 Understand and apply some/all the following models,
 simple neighbourhood based models
 regression models
 decision trees / random forests / ensemble methods
 support vector machines
 neural networks / (if time permits) basics of deep learning
 Gain sound understanding of
 training, testing, cross validation, evaluation.
 feature engineering practices for various domains.
 how to debug models and decide next steps.
Workshop schedule / plan?
This will be a 4 hour workshop with a short break in the middle. The broad outline is as follows,
 Introduction : 10 minutes
 Core ideas, cost functions, likelihoods, optimizations and best fit : 20 minutes
 Information representation, simple representations, linear and generalized linear models : 30 minutes
 Complex and nonlinear representations, feature engineering : 20 minutes
 More models : SVMs, tree based models, neural networks, introduction to deep learning : 50 minutes
 Domain understanding, asymetric costs, evaluation methods and metrics : 30 minutes
 Tradeoffs : model complexity Vs. representation complexity, interpretability, cost of gathering data, model selections : 30 minutes
 Summary, big picture, question and answers : 30 minutes
FAQ

How much machine learning should I know already?
We expect you to know bare minimum basics like supervised Vs. unsupervised machine learning model. If you know what is a linear regression, it should be good enough.

I don’t know Python. Is this workshop for me?
Yes. As long as you know basics of programming and have written atleast some code in any language.

How much programming should I know to attend?
You should know basic programming (loops, conditional expressions, variable assignments, reading files, performing some data manipuation on them).

Why not cover unsupervised learning/semisupervised learning/some other fancy model X?
We will focus on depth and try to cover few topics well.

Will the workshop cover Apache Spark/Hadoop/Mahout or X library/ecosystem?
No. This is an ideas/algorithms talk and libraries will just serve as means for understanding. Different libraries/ecosystems are likely to be covered in depth by other speakers.
Requirements

What about the data and the code to be used at the time of workshop?
We will using this github repository to share code and data. Please make sure that you clone this repository or download the data folder beforehand. You can also download the data from UCI ML repository page here.

What will be the choice of libraries and language?
 Python (2.7.x), numpy, scipy, scikitlearn and pandas are the required libraries for the workshop.
 Scipy stack (Numpy, Scipy, matplotlib, pandas) Installation instructions are given here..
 Scikitlearn Installation instructions are given here.. If the website doesn’t work, you should be able to install it through pip (python package manager) using pip install scikitlearn
 Gensim installation is optional. Installation instructions are given here.
 The code will be tested for Scikitlearn 0.16 and Pandas 0.16.1. Make sure that you have latest versions installed, especially for pandas.
 You can install pip (Python package manager) and then pip install numpy, pip install scipy, pip install scikitlearn and pip install pandas should get you ready for workshop.

Can I install some of the dependencies at the time of workshop?
Big No. Internet support might be shacky. Also, these libraries are pretty heavy. It will not be possible to download and install them at the time of workshop. So make sure that all dependencies are installed before hand.

Can I use a Windows based machine?
Sure, as long as you get all the dependencies installed before the workshop. Given time limitations, we won’t have any installation support at the time of workshop.

Build dependencies for scikitlearn
Scikitlearn depends on some C libraries. The installation instructions given on the page listed above covers installation of dependencies very well. Please refer to those.
Speaker bio
Harshad leads the machine learning and data team @ Sokrati, an advertising technology and analytics company based out of Pune. He has spent 6 years in applying statistical models in variety of domains like insurance, banking, telecom and advertising. He has experience with many tools in the data ecosystem like Python, R, Clojure, Hadoop, Spark etc. He spends time learning theory and applications of machine learning models from simple regression to deep learning. Harshad holds a master’s degree from Indian Institute of Techonology, Mumbai.
Links
 Link to a similar but more breadth focusing workshop from Fifth Elephant, 2014 https://hasgeek.tv/fifthelephant/2014workshops/959realworldmachinelearning
 Link to a workshop from run up event held in Mumbai in 2014 https://hasgeek.tv/fifthelephant/2014machinelearningworkshop
Do we need to have a working Xcode for this session ?