Distributed Deep Learning

Jul 2018

23 Mon

24 Tue

25 Wed

26 Thu 07:45 AM – 06:15 PM IST

27 Fri 07:45 AM – 05:35 PM IST

28 Sat

29 Sun

NIMHANS Convention Centre, Bengaluru

Distributed Deep Learning

Submitted Mar 26, 2018

Section: Crisp talk Technical level: Intermediate

There are various open source frameworks like Tensorflow, CNTK, MXNET, Pytorch etc which allow data scientists to develop deep learning models. Traditionally, data scientists train models on a single machine, however as datasets and models grow, model training on a single node becomes inefficient. There are a couple of frameworks like tensorflow which support model training on multiple machines using data and model parallelism. However, running these jobs across machines require configuring the jobs with cluster specific information explicitly which is cumbersome and time consuming for data scientists.

We (at Qubole) are trying to solve this problem by integrating different available solutions for distributed training in our clusters. There are interesting projects in the open source community like Horovod by Uber and TensorFlowOnSpark by Yahoo to make distribution across machines easier. Horovod brings HPC techniques to Deep Learning by using ring-all-reduce approach which speeds up model training. It also exposes APIs to convert a non distributed tensorflow code to a distributed one. TensorFlowOnSpark takes away all the complexities of specifying cluster spec by using Spark as the orchestration layer. In this talk, I will discuss various architectures followed by a comparative analysis and benchmarking of techniques available for distributed deep learning.

Outline

Briefly comparing machine learning and deep learning.
Motivation behind distributed deep learning and challenges in transition.
Challenges in DDL: Model and data parallelism to train models on large datasets efficiently.
Implementation of distributed Deep Learning in frameworks like Tensorflow, MXNET etc.
Comparison and benchmarking of different open source frameworks like TFOS, Horovod built on top of tensorflow.
Open challenges in the current frameworks available.

Speaker bio

https://www.linkedin.com/in/somya-kumar-b4b62092

Slides

https://docs.google.com/presentation/d/14IV8Hmf8_loRk1ZxSKIhFIunI-7caCCWn46A2MjD7AQ/edit?usp=sharing

The Fifth Elephant 2018