Distributed Computing Abstractions for Big Data Science

Jul 2016

25 Mon

26 Tue

27 Wed

28 Thu 08:30 AM – 06:25 PM IST

29 Fri 08:30 AM – 06:15 PM IST

30 Sat 08:45 AM – 05:00 PM IST

31 Sun 08:15 AM – 06:00 PM IST

Make a submission

NIMHANS Convention Centre

Format

This year’s edition spans two days of hands-on workshops and conference. We are inviting proposals for:

Full-length 40 minute talks.

Crisp 15-minute talks.

Sponsored sessions, 15 minute duration (limited slots available; subject to editorial scrutiny and approval).

Hands-on Workshop sessions, 3 and 6 hour duration.

Selection process

Proposals will be filtered and shortlisted by an Editorial Panel. We urge you to add links to videos / slide decks when submitting proposals. This will help us understand your past speaking experience. Blurbs or blog posts covering the relevance of a particular problem statement and how it is tackled will help the Editorial Panel better judge your proposals.

We expect you to submit an outline of your proposed talk – either in the form of a mind map or a text document or draft slides within two weeks of submitting your proposal.

We will notify you about the status of your proposal within three weeks of submission.

Selected speakers must participate in one-two rounds of rehearsals before the conference. This is mandatory and helps you to prepare well for the conference.

There is only one speaker per session. Entry is free for selected speakers. As our budget is limited, we will prefer speakers from locations closer home, but will do our best to cover for anyone exceptional. HasGeek will provide a grant to cover part of your travel and accommodation in Bangalore. Grants are limited and made available to speakers delivering full sessions (40 minutes or longer).

Commitment to open source

HasGeek believes in open source as the binding force of our community. If you are describing a codebase for developers to work with, we’d like it to be available under a permissive open source licence. If your software is commercially licensed or available under a combination of commercial and restrictive open source licences (such as the various forms of the GPL), please consider picking up a sponsorship. We recognise that there are valid reasons for commercial licensing, but ask that you support us in return for giving you an audience. Your session will be marked on the schedule as a sponsored session.

Key dates and deadlines

Revised paper submission deadline: 17 June 2016

Confirmed talks announcement (in batches): 13 June 2016

Schedule announcement: 30 June 2016

Conference dates: 28-29 July 2016

##Venue
The Fifth Elephant will be held at the NIMHANS Convention Centre, Dairy Circle, Bangalore.

##Contact
For more information about speaking proposals, tickets and sponsorships, contact info@hasgeek.com or call +91-7676332020.

Distributed Computing Abstractions for Big Data Science

Submitted Jun 9, 2016

Section: Full talk Technical level: Intermediate

The data science field has made significant advances in the last few years, with a renewed focus on getting data science to work at scale. The talk shall outline distributed computing abstractions required to realize data science at scale. The Resilient Distributed DataSet (RDD) abstraction provided by Spark is becoming a de-facto approach for big data science. However, Apache Flink and recently, Concord have emerged as interesting alternatives to Spark and provide streaming dataflow abstractions – while Spark can achieve real-time analytics by mini-batching, Flink’s allows event streaming as a first class abstraction and provides exactly once guarantees. TensorFlow also provides a dataflow abstraction for deep learning nteworks. TensorFlow has recently released distributed version by using gRPC or by integrating with cluster management systems such as Kubernetes. Graph processing abstractions are useful in realizing complex algorithms on large real-life natural power law graphs such as Twitter or LinkedIn graphs. GraphLab and Titan are the prominent graph processing systems. GraphLab provides an efficient partitioning mechanism to split a large graph across a cluster of nodes and run algorithms at scale. It must be noted that common machine learning algorithms such as clustering or classification as well as deep learning can be realized on top of graph processing abstractions. Titan graph DB has very good integration with several NoSQLs as data sources including Cassandra and HBase as well as processing engines for machine learning including Spark, Giraph and Hadoop. We also outline our experience of implementing machine learning and deep learning algorithms over many of these abstractions.
The key audience takeaways include:
Implementation details of machine learning algorithms over several distributed computing frameworks such as Spark, GraphLab, Flink and TensorFlow.
State-of-art review of big data science – right from distributed TensorFlow to Dato to Flink, audience get a feel for cutting edge technology in the field.
Discussion of pros and cons of similar frameworks and when to use them – for instance, trade-offs between Apache Spark and Flink and when to use one over the other (if you need low latency event specific processing use Flink or use Spark-streaming when you need high throughput processing not requiring CEP). Similarly trade-offs between GraphLab and Titan, when to use one over the other.

Outline

Introduction to Apache Spark, Flink. ML/Deep Learning on top of Spark/Flink with code.
Introduction to TensorFlow - distributed deep learning.
Introduction to GraphLab/Titan - ML/deep learning on top of GraphLab/Titan with code.

Requirements

Nothing specific.

Speaker bio

Dr. Vijay Srinivas Agneeswaran has a Bachelor’s degree in Computer Science & Engineering from SVCE, Madras University (1998), an MS (By Research) from IIT Madras in 2001, a PhD from IIT Madras (2008) and a post-doctoral research fellowship in the LSIR Labs, Swiss Federal Institute of Technology, Lausanne (EPFL). He has joined as Director of Technology in the data sciences team of SapientNitro. He has spent the last ten years creating intellectual property and building products in the big data area in Oracle, Cognizant and Impetus. He has built PMML support into Spark/Storm and realized several machine learning algorithms such as LDA, Random Forests over Spark. He led a team that designed and implemented a big data governance product for a role-based fine-grained access control inside of Hadoop YARN. He and his team have also built the first distributed deep learning framework on Spark. He is a professional member of the ACM and the IEEE (Senior) for the last 10+ years. He has four full US patents and has published in leading journals and conferences, including IEEE transactions. His research interests include distributed systems, data sciences as well as Big-Data and other emerging technologies. He has been an invited speaker in several national and International conferences such as O’Reilly’s Strata Big-data conference series. He lives in Bangalore with his wife, son and daughter and enjoys researching history and philosophy of Egypt, Babylonia, Greece and India.

Slides

http://www.slideshare.net/VijayAgneeswaran/distributed-computing-abstractionsdatascience6june2016ver04

The Fifth Elephant 2016

Format

Selection process

Commitment to open source

Key dates and deadlines