The Fifth Elephant 2018

The Fifth Elephant 2018

The seventh edition of India's best data conference

Ankit Mahato

@ankitmahato

Machine Learning using Orange - It's Fruitful and Fun!

Submitted Mar 30, 2018

IPython/Jupyter notebook is widely used for data analysis in the data science community. This notebook style programming belongs to an imperative paradigm which is linear in nature. In the past decade, Visual programming paradigm has gained a lot of popularity as it is user-centric in nature and driven by data streams.

In this workshop, we will visually uncover the various aspects of a Analytics Pipeline using Orange 3, a Python based open source interactive data analysis, machine learning and data visualization workbench. Its simple “drag-and-drop” based workflow design interface makes it ideal for novices, and its modular design, extensibility and python integration makes it powerful for advanced users.

The workshop will take a case-based approach and explore real life problems/datasets. We will begin with the building of basic analytics pipeline using built-in Orange widgets, which will further evolve into complex pipeline covering advanced topics like - Advanced data preparation, Unsupervised and Supervised learning, GPU computing (pyCUDA), In-database analytics, Using external ML toolkit, Exporting developed models for scoring, Custom widget development, etc. For these advanced topics the audience will be made familiar with the GUI and computational concepts involved in the development of add-on (custom-built) widgets for Orange.

Outline

Hands-on experience of the various aspects of Data Analytics Pipeline will be provided in this workshop using real life analytic use cases (Internet of Things, Sales Forecasting, Sentiment Analysis):

  • Data Access (files and external data sources)
  • Data Exploration
  • Data Transformation/Filtering
  • Data Analysis
  • Model development using supervised/unsupervised machine learning algorithms (in-built, scikit-learn, in-database, nltk)
  • Basic and advanced Visualization (in-built, matplotlib)
  • Exporting developed model (PMML, PFA)
  • Champion/Challenger model experiments

Workshop Breakup (3 Hrs)-

  • Introduction Dataflow Programming & Orange - 40 mins
  • Exercise 1 (Data Access, Cleanup & Exploration)- 20 mins
  • Exercise 2 (IoT)- 30 mins
  • Exercise 3 (Supervised Learning) - 30 mins
  • Exercise 4 (Forecasting)- 30 mins
  • Exercise 5 (Text Mining) - 30 mins

These timings are tentative and based on the interest of audience, the workshop can go deeper into the some topics like supervised learning, advanced visualization, GPU computing, etc.

Number of workshop attendees -
I have the experience of conducting similar workshop in SciPy India 2017 with approximately 80 attendees.
More attendees, more enthusiam from my side. :)

Requirements

Knowledge prerequisites:

This workshop has no prerequisites, but it would be great to know Basic Python Programming (development of simple functions and classes) for widget development section.

Software prerequisites:

Install Python 3.5 or 3.6
Install the following packages:
pip install Orange3 matplotlib Orange3-Text twython PyQt5

Make sure Orange Canvas is up and running:
python -m Orange.canvas

Optional Setup:
pycuda - python library for gpu computing.
This will require installation of CUDA toolkit and Microsoft Visual C++ 2015 Build Tools (for Windows)

Speaker bio

Ankit is a Product Manager with 4 years of industrial experience in machine learning, quantitative modelling, data analytics and visualization. Over the years, he has developed an expertise in handling the entire data analytics pipeline comprising – ingestion, exploration, transformation, modeling and deployment. He is a polyglot programmer with an extensive knowledge of algorithms, statistics and parallel programming. He has shipped multiple releases of DB Lytix, a comprehensive library of over 800 mathematical and statistical functions used widely in data mining, machine learning and analytics applications, including “big data analytics”.

A die hard Pythonista, Ankit is an open source contributor and a former Google Summer of Code 2013 scholar (under Python Software Foundation) (Link 1). Currently, he is contributing to the following open source projects:

  1. opendatagroup/hadrian - Implementations of the Portable Format for Analytics (PFA) (Link 2)
  2. Fuzzy-Logix/AdapteR - Advanced analytics package that enables R users to perform in-database analytics (Link 3)

An IIT Kanpur alumnus, Ankit is also an active researcher with publications in international journal and conferences. He is actively working in the domain of IoT Analytics and recently presented his work - “In-database Analytics in the Age of Smart Meters” in the 5th IIMA International Conference on Advanced Data Analysis, Business Analytics and Intelligence, 2017. He also presented his paper - “Smart Meter Data Analytics using Orange” in Scipy India 2017, Mumbai.

Previous Workshop Experience:

  • Scientific Computing using Orange in SciPy India 2017, Mumbai.
  • Making Machine Learning Fruitful and Fun using Orange in PyCon India 2017, New Delhi.
  • High Performance Computing, IIT Kanpur, 2013.

LinkedIn - https://www.linkedin.com/in/ankitmahato

Link 1 - https://www.google-melange.com/archive/gsoc/2013/orgs/python/projects/ankitmahato.html
Link 2 - https://github.com/opendatagroup/hadrian
Link 3 - https://github.com/Fuzzy-Logix/AdapteR

Slides

https://www.slideshare.net/ankitmahato/machine-learning-using-orange-its-fruitful-and-fun/ankitmahato/machine-learning-using-orange-its-fruitful-and-fun

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures