Understanding supervised machine learning hands on!

Jul 2015

13 Mon

14 Tue

15 Wed

16 Thu 08:30 AM – 06:35 PM IST

17 Fri 08:30 AM – 06:30 PM IST

18 Sat 09:00 AM – 06:30 PM IST

19 Sun

Make a submission

NIMHANS Convention center

Track 1: Discovering Insights and Driving Decisions

This track is about general, novel, fundamental, and advanced techniques for making sense of data and driving decisions from data. This could encompass applications of the following ML paradigms:

Statistical Visualizations

Unsupervised Learning

Supervised Learning

Semi-Supervised Learning

Active Learning

Reinforcement Learning

Monte-carlo techniques and probabilistic programming

Deep Learning

Across various data modalities including multi-variate, text, speech, time series, images, video, transactions, etc.

Track 2: Speed at Scale

This track is about tools and processes for collecting, indexing, and processing vast amounts of data. The theme includes:

Distributed and Parallel Computing

Real Time Analytics and Stream Processing

MapReduce and Graph Computing frameworks

Kafka, Spark, Hadoop, MPI

Stories of parallelizing sequential programs

Cost/Security/Disaster Management of Data

Commitment to Open Source

HasGeek believes in open source as the binding force of our community. If you are describing a codebase for developers to work with, we’d like it to be available under a permissive open source license. If your software is commercially licensed or available under a combination of commercial and restrictive open source licenses (such as the various forms of the GPL), please consider picking up a sponsorship. We recognize that there are valid reasons for commercial licensing, but ask that you support us in return for giving you an audience. Your session will be marked on the schedule as a sponsored session.

Understanding supervised machine learning hands on!

Submitted May 25, 2015

Section: Workshop Technical level: Beginner

If you have ever been in a “black box” operating mode where you are throwing more data/complex models at a machine learning problem without a clue about why it is working or not working, this workshop is for you! The workshop will primarily focus on understanding supervised machine learning.

Outline

What will participants gain?

Here’s a mind map showing the overall picture of what will be covered in the workshop.

Design of large software systems is a study and practice of making trade-offs (e.g. CAP theorem, time Vs. space complexities, time to build Vs. maintainability). Same is true for machine learning applications. This workshop will help you to clearly understand what those trade-offs are and how to make one.
Black box way of building ML applications (use X because company F/G/H uses it) can only get us to a point. Workshop will instead help you understand all core ideas of ML in clear intuitive fashion.
There are multiple problems in a ML application : modelling, information representation, nature of costs etc. The workshop will give you the big picture and practical advice on tackling the problems.
Understand and apply some/all the following models,
- simple neighbourhood based models
- regression models
- decision trees / random forests / ensemble methods
- support vector machines
- neural networks / (if time permits) basics of deep learning
Gain sound understanding of
- training, testing, cross validation, evaluation.
- feature engineering practices for various domains.
- how to debug models and decide next steps.

Workshop schedule / plan?

This will be a 4 hour workshop with a short break in the middle. The broad outline is as follows,

Introduction : 10 minutes
Core ideas, cost functions, likelihoods, optimizations and best fit : 20 minutes
Information representation, simple representations, linear and generalized linear models : 30 minutes
Complex and non-linear representations, feature engineering : 20 minutes
More models : SVMs, tree based models, neural networks, introduction to deep learning : 50 minutes
Domain understanding, asymetric costs, evaluation methods and metrics : 30 minutes
Trade-offs : model complexity Vs. representation complexity, interpretability, cost of gathering data, model selections : 30 minutes
Summary, big picture, question and answers : 30 minutes

FAQ

How much machine learning should I know already?

We expect you to know bare minimum basics like supervised Vs. unsupervised machine learning model. If you know what is a linear regression, it should be good enough.
I don’t know Python. Is this workshop for me?

Yes. As long as you know basics of programming and have written atleast some code in any language.
How much programming should I know to attend?

You should know basic programming (loops, conditional expressions, variable assignments, reading files, performing some data manipuation on them).
Why not cover unsupervised learning/semi-supervised learning/some other fancy model X?

We will focus on depth and try to cover few topics well.
Will the workshop cover Apache Spark/Hadoop/Mahout or X library/ecosystem?

No. This is an ideas/algorithms talk and libraries will just serve as means for understanding. Different libraries/ecosystems are likely to be covered in depth by other speakers.

Requirements

What about the data and the code to be used at the time of workshop?

We will using this github repository to share code and data. Please make sure that you clone this repository or download the data folder beforehand. You can also download the data from UCI ML repository page here.
What will be the choice of libraries and language?
- Python (2.7.x), numpy, scipy, scikit-learn and pandas are the required libraries for the workshop.
- Scipy stack (Numpy, Scipy, matplotlib, pandas) Installation instructions are given here..
- Scikit-learn Installation instructions are given here.. If the website doesn’t work, you should be able to install it through pip (python package manager) using pip install scikit-learn
- Gensim installation is optional. Installation instructions are given here.
- The code will be tested for Scikit-learn 0.16 and Pandas 0.16.1. Make sure that you have latest versions installed, especially for pandas.
- You can install pip (Python package manager) and then pip install numpy, pip install scipy, pip install scikit-learn and pip install pandas should get you ready for workshop.
Can I install some of the dependencies at the time of workshop?

Big No. Internet support might be shacky. Also, these libraries are pretty heavy. It will not be possible to download and install them at the time of workshop. So make sure that all dependencies are installed before hand.
Can I use a Windows based machine?

Sure, as long as you get all the dependencies installed before the workshop. Given time limitations, we won’t have any installation support at the time of workshop.
Build dependencies for scikit-learn

Scikit-learn depends on some C libraries. The installation instructions given on the page listed above covers installation of dependencies very well. Please refer to those.

Speaker bio

Harshad leads the machine learning and data team @ Sokrati, an advertising technology and analytics company based out of Pune. He has spent 6 years in applying statistical models in variety of domains like insurance, banking, telecom and advertising. He has experience with many tools in the data ecosystem like Python, R, Clojure, Hadoop, Spark etc. He spends time learning theory and applications of machine learning models from simple regression to deep learning. Harshad holds a master’s degree from Indian Institute of Techonology, Mumbai.

Slides

http://www.slideshare.net/HarshadSaykhedkar/ml-workshop-jul2015mm

The Fifth Elephant 2015

Track 1: Discovering Insights and Driving Decisions

Track 2: Speed at Scale

Commitment to Open Source

Workshops