arrow_back Critical pipe fittings: What every data pipeline requires
Processing large data with Apache Spark arrow_forward
Understanding supervised machine learning hands on!
Submitted by Harshad Saykhedkar (@harshss) on Monday, 25 May 2015
If you have ever been in a “black box” operating mode where you are throwing more data/complex models at a machine learning problem without a clue about why it is working or not working, this workshop is for you! The workshop will primarily focus on understanding supervised machine learning.
What will participants gain?
- Design of large software systems is a study and practice of making trade-offs (e.g. CAP theorem, time Vs. space complexities, time to build Vs. maintainability). Same is true for machine learning applications. This workshop will help you to clearly understand what those trade-offs are and how to make one.
- Black box way of building ML applications (use X because company F/G/H uses it) can only get us to a point. Workshop will instead help you understand all core ideas of ML in clear intuitive fashion.
- There are multiple problems in a ML application : modelling, information representation, nature of costs etc. The workshop will give you the big picture and practical advice on tackling the problems.
- Understand and apply some/all the following models,
- simple neighbourhood based models
- regression models
- decision trees / random forests / ensemble methods
- support vector machines
- neural networks / (if time permits) basics of deep learning
- Gain sound understanding of
- training, testing, cross validation, evaluation.
- feature engineering practices for various domains.
- how to debug models and decide next steps.
Workshop schedule / plan?
This will be a 4 hour workshop with a short break in the middle. The broad outline is as follows,
- Introduction : 10 minutes
- Core ideas, cost functions, likelihoods, optimizations and best fit : 20 minutes
- Information representation, simple representations, linear and generalized linear models : 30 minutes
- Complex and non-linear representations, feature engineering : 20 minutes
- More models : SVMs, tree based models, neural networks, introduction to deep learning : 50 minutes
- Domain understanding, asymetric costs, evaluation methods and metrics : 30 minutes
- Trade-offs : model complexity Vs. representation complexity, interpretability, cost of gathering data, model selections : 30 minutes
- Summary, big picture, question and answers : 30 minutes
How much machine learning should I know already?
We expect you to know bare minimum basics like supervised Vs. unsupervised machine learning model. If you know what is a linear regression, it should be good enough.
I don’t know Python. Is this workshop for me?
Yes. As long as you know basics of programming and have written atleast some code in any language.
How much programming should I know to attend?
You should know basic programming (loops, conditional expressions, variable assignments, reading files, performing some data manipuation on them).
Why not cover unsupervised learning/semi-supervised learning/some other fancy model X?
We will focus on depth and try to cover few topics well.
Will the workshop cover Apache Spark/Hadoop/Mahout or X library/ecosystem?
No. This is an ideas/algorithms talk and libraries will just serve as means for understanding. Different libraries/ecosystems are likely to be covered in depth by other speakers.
What about the data and the code to be used at the time of workshop?
We will using this github repository to share code and data. Please make sure that you clone this repository or download the data folder beforehand. You can also download the data from UCI ML repository page here.
What will be the choice of libraries and language?
- Python (2.7.x), numpy, scipy, scikit-learn and pandas are the required libraries for the workshop.
- Scipy stack (Numpy, Scipy, matplotlib, pandas) Installation instructions are given here..
- Scikit-learn Installation instructions are given here.. If the website doesn’t work, you should be able to install it through pip (python package manager) using pip install scikit-learn
- Gensim installation is optional. Installation instructions are given here.
- The code will be tested for Scikit-learn 0.16 and Pandas 0.16.1. Make sure that you have latest versions installed, especially for pandas.
- You can install pip (Python package manager) and then pip install numpy, pip install scipy, pip install scikit-learn and pip install pandas should get you ready for workshop.
Can I install some of the dependencies at the time of workshop?
Big No. Internet support might be shacky. Also, these libraries are pretty heavy. It will not be possible to download and install them at the time of workshop. So make sure that all dependencies are installed before hand.
Can I use a Windows based machine?
Sure, as long as you get all the dependencies installed before the workshop. Given time limitations, we won’t have any installation support at the time of workshop.
Build dependencies for scikit-learn
Scikit-learn depends on some C libraries. The installation instructions given on the page listed above covers installation of dependencies very well. Please refer to those.
Harshad leads the machine learning and data team @ Sokrati, an advertising technology and analytics company based out of Pune. He has spent 6 years in applying statistical models in variety of domains like insurance, banking, telecom and advertising. He has experience with many tools in the data ecosystem like Python, R, Clojure, Hadoop, Spark etc. He spends time learning theory and applications of machine learning models from simple regression to deep learning. Harshad holds a master’s degree from Indian Institute of Techonology, Mumbai.