The Fifth Elephant 2018

The seventh edition of India's best data conference

Distributed Deep Learning

Submitted by Somya Kumar (@somyak) on Monday, 26 March 2018

Section: Crisp talk Technical level: Intermediate Status: Rejected


There are various open source frameworks like Tensorflow, CNTK, MXNET, Pytorch etc which allow data scientists to develop deep learning models. Traditionally, data scientists train models on a single machine, however as datasets and models grow, model training on a single node becomes inefficient. There are a couple of frameworks like tensorflow which support model training on multiple machines using data and model parallelism. However, running these jobs across machines require configuring the jobs with cluster specific information explicitly which is cumbersome and time consuming for data scientists.

We (at Qubole) are trying to solve this problem by integrating different available solutions for distributed training in our clusters. There are interesting projects in the open source community like Horovod by Uber and TensorFlowOnSpark by Yahoo to make distribution across machines easier. Horovod brings HPC techniques to Deep Learning by using ring-all-reduce approach which speeds up model training. It also exposes APIs to convert a non distributed tensorflow code to a distributed one. TensorFlowOnSpark takes away all the complexities of specifying cluster spec by using Spark as the orchestration layer. In this talk, I will discuss various architectures followed by a comparative analysis and benchmarking of techniques available for distributed deep learning.


  1. Briefly comparing machine learning and deep learning.
  2. Motivation behind distributed deep learning and challenges in transition.
  3. Challenges in DDL: Model and data parallelism to train models on large datasets efficiently.
  4. Implementation of distributed Deep Learning in frameworks like Tensorflow, MXNET etc.
  5. Comparison and benchmarking of different open source frameworks like TFOS, Horovod built on top of tensorflow.
  6. Open challenges in the current frameworks available.

Speaker bio


Preview video


  • Zainab Bawa (@zainabbawa) Crew 2 years ago

    This proposal only describes the options that are available for doing distributed deep learning. It does not explain how you have used distributed deep learning for your use case, how you have migrated to this system and how you run it.

    • Somya Kumar (@somyak) Proposer 2 years ago (edited 2 years ago)

      We have worked on understanding and providing distributed deeplearning as a service to our users. In the process, we have explored the architecture of different available options and run our benchmarks. The aim of the talk is to present our findings and learnings for the same. We have examples denoting how to migrate a non distributed tensorflow code to a distributed one.

  • Zainab Bawa (@zainabbawa) Crew 2 years ago

    Sowmya, we need your preview video to complete evaluation for this talk.

    • Somya Kumar (@somyak) Proposer 2 years ago

      Hi Zainab, I have uploaded the video now.

Login to leave a comment