The Fifth Elephant 2015

A conference on data, machine learning, and distributed and parallel computing

Machine Learning, Distributed and Parallel Computing, and High-performance Computing are the themes for this year’s edition of Fifth Elephant.

The deadline for submitting a proposal is 15th June 2015

We are looking for talks and workshops from academics and practitioners who are in the business of making sense of data, big and small.

Track 1: Discovering Insights and Driving Decisions

This track is about general, novel, fundamental, and advanced techniques for making sense of data and driving decisions from data. This could encompass applications of the following ML paradigms:

  • Statistical Visualizations
  • Unsupervised Learning
  • Supervised Learning
  • Semi-Supervised Learning
  • Active Learning
  • Reinforcement Learning
  • Monte-carlo techniques and probabilistic programming
  • Deep Learning

Across various data modalities including multi-variate, text, speech, time series, images, video, transactions, etc.

Track 2: Speed at Scale

This track is about tools and processes for collecting, indexing, and processing vast amounts of data. The theme includes:

  • Distributed and Parallel Computing
  • Real Time Analytics and Stream Processing
  • MapReduce and Graph Computing frameworks
  • Kafka, Spark, Hadoop, MPI
  • Stories of parallelizing sequential programs
  • Cost/Security/Disaster Management of Data

Commitment to Open Source

HasGeek believes in open source as the binding force of our community. If you are describing a codebase for developers to work with, we’d like it to be available under a permissive open source license. If your software is commercially licensed or available under a combination of commercial and restrictive open source licenses (such as the various forms of the GPL), please consider picking up a sponsorship. We recognize that there are valid reasons for commercial licensing, but ask that you support us in return for giving you an audience. Your session will be marked on the schedule as a sponsored session.

Workshops

If you are interested in conducting a hands-on session on any of the topics falling under the themes of the two tracks described above, please submit a proposal under the workshops section. We also need you to tell us about your past experience in teaching and/or conducting workshops.

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more

Harshad Saykhedkar

@harshss

Understanding supervised machine learning hands on!

Submitted May 25, 2015

If you have ever been in a “black box” operating mode where you are throwing more data/complex models at a machine learning problem without a clue about why it is working or not working, this workshop is for you! The workshop will primarily focus on understanding supervised machine learning.

Outline

What will participants gain?

Here’s a mind map showing the overall picture of what will be covered in the workshop.

  • Design of large software systems is a study and practice of making trade-offs (e.g. CAP theorem, time Vs. space complexities, time to build Vs. maintainability). Same is true for machine learning applications. This workshop will help you to clearly understand what those trade-offs are and how to make one.
  • Black box way of building ML applications (use X because company F/G/H uses it) can only get us to a point. Workshop will instead help you understand all core ideas of ML in clear intuitive fashion.
  • There are multiple problems in a ML application : modelling, information representation, nature of costs etc. The workshop will give you the big picture and practical advice on tackling the problems.
  • Understand and apply some/all the following models,
    • simple neighbourhood based models
    • regression models
    • decision trees / random forests / ensemble methods
    • support vector machines
    • neural networks / (if time permits) basics of deep learning
  • Gain sound understanding of
    • training, testing, cross validation, evaluation.
    • feature engineering practices for various domains.
    • how to debug models and decide next steps.

Workshop schedule / plan?

This will be a 4 hour workshop with a short break in the middle. The broad outline is as follows,

  • Introduction : 10 minutes
  • Core ideas, cost functions, likelihoods, optimizations and best fit : 20 minutes
  • Information representation, simple representations, linear and generalized linear models : 30 minutes
  • Complex and non-linear representations, feature engineering : 20 minutes
  • More models : SVMs, tree based models, neural networks, introduction to deep learning : 50 minutes
  • Domain understanding, asymetric costs, evaluation methods and metrics : 30 minutes
  • Trade-offs : model complexity Vs. representation complexity, interpretability, cost of gathering data, model selections : 30 minutes
  • Summary, big picture, question and answers : 30 minutes

FAQ

  • How much machine learning should I know already?

    We expect you to know bare minimum basics like supervised Vs. unsupervised machine learning model. If you know what is a linear regression, it should be good enough.

  • I don’t know Python. Is this workshop for me?

    Yes. As long as you know basics of programming and have written atleast some code in any language.

  • How much programming should I know to attend?

    You should know basic programming (loops, conditional expressions, variable assignments, reading files, performing some data manipuation on them).

  • Why not cover unsupervised learning/semi-supervised learning/some other fancy model X?

    We will focus on depth and try to cover few topics well.

  • Will the workshop cover Apache Spark/Hadoop/Mahout or X library/ecosystem?

    No. This is an ideas/algorithms talk and libraries will just serve as means for understanding. Different libraries/ecosystems are likely to be covered in depth by other speakers.

Requirements

  • What about the data and the code to be used at the time of workshop?

    We will using this github repository to share code and data. Please make sure that you clone this repository or download the data folder beforehand. You can also download the data from UCI ML repository page here.

  • What will be the choice of libraries and language?

    • Python (2.7.x), numpy, scipy, scikit-learn and pandas are the required libraries for the workshop.
    • Scipy stack (Numpy, Scipy, matplotlib, pandas) Installation instructions are given here..
    • Scikit-learn Installation instructions are given here.. If the website doesn’t work, you should be able to install it through pip (python package manager) using pip install scikit-learn
    • Gensim installation is optional. Installation instructions are given here.
    • The code will be tested for Scikit-learn 0.16 and Pandas 0.16.1. Make sure that you have latest versions installed, especially for pandas.
    • You can install pip (Python package manager) and then pip install numpy, pip install scipy, pip install scikit-learn and pip install pandas should get you ready for workshop.
  • Can I install some of the dependencies at the time of workshop?

    Big No. Internet support might be shacky. Also, these libraries are pretty heavy. It will not be possible to download and install them at the time of workshop. So make sure that all dependencies are installed before hand.

  • Can I use a Windows based machine?

    Sure, as long as you get all the dependencies installed before the workshop. Given time limitations, we won’t have any installation support at the time of workshop.

  • Build dependencies for scikit-learn

    Scikit-learn depends on some C libraries. The installation instructions given on the page listed above covers installation of dependencies very well. Please refer to those.

Speaker bio

Harshad leads the machine learning and data team @ Sokrati, an advertising technology and analytics company based out of Pune. He has spent 6 years in applying statistical models in variety of domains like insurance, banking, telecom and advertising. He has experience with many tools in the data ecosystem like Python, R, Clojure, Hadoop, Spark etc. He spends time learning theory and applications of machine learning models from simple regression to deep learning. Harshad holds a master’s degree from Indian Institute of Techonology, Mumbai.

Slides

http://www.slideshare.net/HarshadSaykhedkar/ml-workshop-jul2015mm

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more