In 2014, infrastructure components such as Hadoop, Berkeley Data Stack and other commercial tools have stabilized and are thriving. The challenges have moved higher up the stack from data collection and storage to data analysis and its presentation to users. The focus for this year’s conference on analytics – the infrastructure that powers analytics and how analytics is done.
Talks will cover various forms of analytics including real-time and opportunity analytics, and technologies and models used for analyzing data.
Proposals will be reviewed using 5 criteria:
Domain diversity – proposals will be selected from different domains – medical, insurance, banking, online transactions, retail. If there is more than one proposal from a domain, the one which meets the editorial criteria will be chosen.
Novelty – what has been done beyond the obvious. Insights – what insights does the proposal share with the audience that they did not know earlier. Practical versus theoretical – we are looking for applied knowledge. If the proposal covers material that can be looked up online, it will not be considered.
Conceptual versus tools-centric – tell us why, not how. Tell the audience what was the philosophy underlying your use of an application, not how an application was used. Presentation skills – proposer’s presentation skills will be reviewed carefully and assistance provided to ensure that the material is communicated in the most precise and effective manner to the audience.
For queries about proposals / submissions, write to email@example.com
Data Collection and Transport – for e.g, Opendatatoolkit, Scribe, Kafka, RabbitMQ, etc.
Data Storage, Caching and Management – Distributed storage (such as Gluster, HDFS) or hardware-specific (such as SSD or memory) or databases (Postgresql, MySQL, Infobright) or caching/storage (Memcache, Cassandra, Redis, etc).
Data Processing, Querying and Analysis – Oozie, Azkaban, scikit-learn, Mahout, Impala, Hive, Tez, etc.
Big data and security
Big data and internet of things
Data Usage and BI (Business Intelligence) in different sectors.
Please note: the technology stacks mentioned above indicate latest technologies that will be of interest to the community. Talks should not be on the technologies per se, but how these have been used and implemented in various sectors, enterprises and contexts.
Real world machine learning
We will become familiar with real world machine learning in a hands on, intuitive way. Rather than taking the algorithm and its results as a black box provided by a library and learning in a cookbook style, we will try to understand the why of the problem. Participants will also appreciate the importance of each phase (data exploration, data cleaning and extraction, modeling, evaluation) of machine learning pipeline. By the end of the workshop, participants will
Be familiar with all phases of ML pipeline.
Get a sense of how real business problems can be tackled using machine learning.
Get a hands on training on scikit-learn machine learning library for Python.
Understand the why and how of many machine learning algorithms.
Understand how to interpret the results thrown up by machine learning libraries.
Have fun on the way!
If time permits, we will also discuss topics like
Trade-offs in choosing machine learning stack for real world production applications (e.g when should we worry about big data, what about distributed computations, is X suitable for billions of rows)
How much knowledge of statistics and maths is absolutely essential for ML? What resources can help you on this front?
Machine learning has evolved to a very popular, rapidly changing, sometimes over-hyped domain with extremely diverse set of ideas. Discussion about machine learning often tends to get lost into jargon of tools, market buzzwords, libraries and diverted from real purpose which is insights! This workshop will focus on insights and practical applications.
What background is essential to understand this workshop ?
This is introductory session and participants are not expected to know 100 machine learning algorithms or have a PhD in Maths. That said, the following are bare minimum requirements,
Reasonable knowledge of at-least some programming language (C, C++, Python, Ruby, Java, R, Matlab, Julia and so on).
Some familiarity with machine learning (for example, it should be enough to vaguely understand what linear regression is)
Those who are not familiar with Python are advised to go through the tutorial given here. Python is an extremely elegant and simple language, you’d get started in no time!
We will be using Python Scientific stack for this workshop and hence installation of following tools is absolute must.
Python (version 2.7, avoid version 3 as migration of SciPy tools is still a work in progress). Most linux installations come with Python 2.7 installed by default.
Numpy and Scikit-learn
The installations steps are detailed on respective websites
A real world dataset to be used for the workshop will be supplied about a week before the actual date.
Harshad is senior data scientist at Sokrati, a digital advertising startup based out of Pune, where he works closely with the engineering team to extract meaning out of millions of data points from advertising world. He’s been applying machine learning to real world problems in telecom, banking and advertising since last 4 years. He has mostly worked with tools like R, Python, SAS and lately fallen in love with Clojure ecosystem too. He conducted a similar session in Fifth Elephant 2013, focussed on R and text mining. Harshad holds a master’s degree in Operations Research from IIT, Mumbai
- Github repo for the content of the workshop is here. It will include code, presentation slides and all the sample data files required for the workshop. Participants need to download the sample data file ‘bank-full.csv’. https://github.com/harshadss/my-presentations/tree/master/fifth_elephant_2014/ml_workshop
- This is a video of similar session, done as a part of run up for Fifth Elephant 2014 in Mumbai. https://hasgeek.tv/fifthelephant/2014-machine-learning-workshop