The Fifth Elephant 2018

The seventh edition of India's best data conference

Distributed Deep Learning

Submitted by Somya Kumar (@somyak) on Monday, 26 March 2018

videocam
Preview video

Technical level

Intermediate

Section

Crisp talk

Status

Submitted

Vote on this proposal

Login to vote

Total votes:  +16

Abstract

There are various open source frameworks like Tensorflow, CNTK, MXNET, Pytorch etc which allow data scientists to develop deep learning models. Traditionally, data scientists train models on a single machine, however as datasets and models grow, model training on a single node becomes inefficient. There are a couple of frameworks like tensorflow which support model training on multiple machines using data and model parallelism. However, running these jobs across machines require configuring the jobs with cluster specific information explicitly which is cumbersome and time consuming for data scientists.

We (at Qubole) are trying to solve this problem by integrating different available solutions for distributed training in our clusters. There are interesting projects in the open source community like Horovod by Uber and TensorFlowOnSpark by Yahoo to make distribution across machines easier. Horovod brings HPC techniques to Deep Learning by using ring-all-reduce approach which speeds up model training. It also exposes APIs to convert a non distributed tensorflow code to a distributed one. TensorFlowOnSpark takes away all the complexities of specifying cluster spec by using Spark as the orchestration layer. In this talk, I will discuss various architectures followed by a comparative analysis and benchmarking of techniques available for distributed deep learning.

Outline

  1. Briefly comparing machine learning and deep learning.
  2. Motivation behind distributed deep learning and challenges in transition.
  3. Challenges in DDL: Model and data parallelism to train models on large datasets efficiently.
  4. Implementation of distributed Deep Learning in frameworks like Tensorflow, MXNET etc.
  5. Comparison and benchmarking of different open source frameworks like TFOS, Horovod built on top of tensorflow.
  6. Open challenges in the current frameworks available.

Speaker bio

https://www.linkedin.com/in/somya-kumar-b4b62092

Slides

https://docs.google.com/presentation/d/14IV8Hmf8_loRk1ZxSKIhFIunI-7caCCWn46A2MjD7AQ/edit?usp=sharing

Preview video

https://youtu.be/d0HnUKjRDwU

Comments

  • 1
    Zainab Bawa (@zainabbawa) Reviewer 6 months ago

    This proposal only describes the options that are available for doing distributed deep learning. It does not explain how you have used distributed deep learning for your use case, how you have migrated to this system and how you run it.

    • 1
      Somya Kumar (@somyak) Proposer 6 months ago (edited 6 months ago)

      We have worked on understanding and providing distributed deeplearning as a service to our users. In the process, we have explored the architecture of different available options and run our benchmarks. The aim of the talk is to present our findings and learnings for the same. We have examples denoting how to migrate a non distributed tensorflow code to a distributed one.

  • 0
    Zainab Bawa (@zainabbawa) Reviewer 7 months ago

    Sowmya, we need your preview video to complete evaluation for this talk.

    • 1
      Somya Kumar (@somyak) Proposer 7 months ago

      Hi Zainab, I have uploaded the video now.

Login with Twitter or Google to leave a comment