The Fifth Elephant 2016

India's most renowned data science conference

The Fifth Elephant is India’s most renowned data science conference. It is a space for discussing some of the most cutting edge developments in the fields of machine learning, data science and technology that powers data collection and analysis.

Machine Learning, Distributed and Parallel Computing, and High-performance Computing continue to be the themes for this year’s edition of Fifth Elephant.

We are now accepting submissions for our next edition which will take place in Bangalore 28-29 July 2016.

Tracks

We are looking for application level and tool-centric talks and tutorials on the following topics:

  1. Deep Learning
  2. Text Mining
  3. Computer Vision
  4. Social Network Analysis
  5. Large-scale Machine Learning (ML)
  6. Internet of Things (IoT)
  7. Computational Biology
  8. ML in healthcare
  9. ML in education
  10. ML in energy and ecology
  11. ML in agriculrure
  12. Analytics for emerging markets
  13. ML in e-governance
  14. ML in smart cities
  15. ML in defense

The deadline for submitting proposals is 30th April 2016

Format

This year’s edition spans two days of hands-on workshops and conference. We are inviting proposals for:

  • Full-length 40 minute talks.
  • Crisp 15-minute talks.
  • Sponsored sessions, 15 minute duration (limited slots available; subject to editorial scrutiny and approval).
  • Hands-on Workshop sessions, 3 and 6 hour duration.

Selection process

Proposals will be filtered and shortlisted by an Editorial Panel. We urge you to add links to videos / slide decks when submitting proposals. This will help us understand your past speaking experience. Blurbs or blog posts covering the relevance of a particular problem statement and how it is tackled will help the Editorial Panel better judge your proposals.

We expect you to submit an outline of your proposed talk – either in the form of a mind map or a text document or draft slides within two weeks of submitting your proposal.

We will notify you about the status of your proposal within three weeks of submission.

Selected speakers must participate in one-two rounds of rehearsals before the conference. This is mandatory and helps you to prepare well for the conference.

There is only one speaker per session. Entry is free for selected speakers. As our budget is limited, we will prefer speakers from locations closer home, but will do our best to cover for anyone exceptional. HasGeek will provide a grant to cover part of your travel and accommodation in Bangalore. Grants are limited and made available to speakers delivering full sessions (40 minutes or longer).

Commitment to open source

HasGeek believes in open source as the binding force of our community. If you are describing a codebase for developers to work with, we’d like it to be available under a permissive open source licence. If your software is commercially licensed or available under a combination of commercial and restrictive open source licences (such as the various forms of the GPL), please consider picking up a sponsorship. We recognise that there are valid reasons for commercial licensing, but ask that you support us in return for giving you an audience. Your session will be marked on the schedule as a sponsored session.

Key dates and deadlines

  • Revised paper submission deadline: 17 June 2016
  • Confirmed talks announcement (in batches): 13 June 2016
  • Schedule announcement: 30 June 2016
  • Conference dates: 28-29 July 2016

Venue

The Fifth Elephant will be held at the NIMHANS Convention Centre, Dairy Circle, Bangalore.

Contact

For more information about speaking proposals, tickets and sponsorships, contact info@hasgeek.com or call +91-7676332020.

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more
Nischal HP

Nischal HP

@nischalhp

Building a scalable Data Science Platform ( Luigi, Apache Spark, Pandas, Flask)

Submitted Jun 14, 2016

“In theory, there is no difference between theory and practice. But in practice, there is.” - Yogi Berra

Once the task of prototyping a data science solution has been accomplished on a local machine, the real challenge begins in how to make it work in production. To ensure that the plumbing of the data pipeline will work in production at scale is both an art and a science. The science involves understanding the different tools and technologies needed to make the data pipeline connect, while the art involves making the trade-offs needed to tune the data pipeline so that it flows.

In this workshop, you will learn how to build a scalable data science platform with set up and conduct data engineering using Pandas and Luigi, build a machine learning model with Apache Spark and deploy it as predictive api with Flask

Outline

The biggest challenge in building a data science platform is to glue all the moving pieces together. Typically, a data science platform consists of:

  • Data engineering - involves a lot of ETL and feature engineering.
  • Machine learning - involves writing a bunch of machine learning models and persistence of the model
  • API - involves exposing end points to the outside world to invoke the predictive capabilities of the model

Over time the amount of data stored that needs to be processed increases and it necessitates the need to run the Data Science process frequently. But different technologies/stack solve different parts of the Data Science problem. Leaving it to respective teams introduces lag into the system. What is needed is an automated pipeline process - one that can be invoked based on business logic (real time, near-real-time etc) and a guarantee that it will maintain data integrity.
Details of the workshop

Data Engineering

We all know that 80% of the effort is spent on data engineering while the rest is spent in building the actual machine learning models. Data engineering starts with identifying the right data sources. Data sources can be databases, third party APIs, HTML documents which needs to be scrapped and so on. Acquiring data from databases is a straight forward job, while acquiring data from third party APIs and scrapping may come with its own complexities like page visit limits, API rate limiting etc. Once we manage to acquire data from all these sources, the next job is to clean the data.

We will be covering the following topics for data engineering:

  • Identifying and working with 2 data sources.
  • Writing ETL (Extraction, Transformation and Loading) with Pandas
  • Building dependency management with Luigi
  • Logging the process
  • Adding notifications on success and failure

Machine Learning

Building a robust and scalable machine learning platform is a hard job. As the data size increases, the need for more computational capabilities increase. So how do you build a system that can scale by just adding more hardware and not worrying about changing the code too much every time? The answer to that is to use Apache Spark ML. Apache Spark lets us build machine learning platforms by providing distributed computing capabilities out of the box.

We will be covering the following topics for Machine Learning:

  • Feature Engineering
  • Hypothesis to solve
  • Configuration of environment variables for Apache Pyspark
  • Build the Machine Learning code with Apache Spark
  • Persisting the model

API

It ain’t over until the fat lady sings. Making a system API driven is very essential as it ensures the usage of the built machine learning model , thereby helping other systems integrate the capabilities with ease.

We will be covering the following topics for API:

  • Building REST API with Flask
  • Based on the input parameters, build respective methods to extract features to be fed into the model
  • Send responses as a JSON

Pre-Requisites:

  • Python - Knowledge of writing classes
  • Knowledge of data science:
    • What is data science?
    • Practical use cases for data science?
  • Knowledge of machine learning:
    • Expect to know Linear regression and logistic regression
  • Knowledge of software engineering:
    • Understanding scalability and high available systems

Requirements

  • Laptop with python3 installed
  • virtualenv with python3
  • luigi
  • pandas
  • apache-spark pre-built for hadoop
  • flask
  • requests
  • postgresql
  • Recommended OS - Linux/OSX
  • Recommended memory - 8gb (atleast 4gb)
  • Lots of enthusiasm

Speaker bio

Speaker Bio:

Nischal is co founder and Data Engineer at Unnati Data Labs who enables the Data Scientists to work at peace. He makes sure that they get the data they need and in the way they need it. Previously he has built, from scratch, various systems for E-commerce like catalog management, recommendation engines and market basket analysis to name a few during his tenure at Redmart.

Raghotham is a co founder Data Scientist at Unnati Data Labs, who can work across the complete stack. Previously, at Touchpoints Inc., He single handedly built a data analytics platform for a fitness wearable company. With Redmart, he worked on the CRM system and has built a sentinment analyzer for Redmart’s Social Media. Prior to Redmart and Touchpoints, Raghotham worked at SAP Labs where he was a core part of what is currently SAP’s framework for building web and mobile products. He was a part of multiple SAP wide events helping to spread the knowledge both internally and to customers.

They have conducted workshops in the field of Deep learning across the world. They are strong believers of open source and love to architect big, fast and reliable systems.

Links

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more