Distributed Deep Learning
Submitted by Somya Kumar (@somyak) on Monday, 26 March 2018
There are various open source frameworks like Tensorflow, CNTK, MXNET, Pytorch etc which allow data scientists to develop deep learning models. Traditionally, data scientists train models on a single machine, however as datasets and models grow, model training on a single node becomes inefficient. There are a couple of frameworks like tensorflow which support model training on multiple machines using data and model parallelism. However, running these jobs across machines require configuring the jobs with cluster specific information explicitly which is cumbersome and time consuming for data scientists.
We (at Qubole) are trying to solve this problem by integrating different available solutions for distributed training in our clusters. There are interesting projects in the open source community like Horovod by Uber and TensorFlowOnSpark by Yahoo to make distribution across machines easier. Horovod brings HPC techniques to Deep Learning by using ring-all-reduce approach which speeds up model training. It also exposes APIs to convert a non distributed tensorflow code to a distributed one. TensorFlowOnSpark takes away all the complexities of specifying cluster spec by using Spark as the orchestration layer. In this talk, I will discuss various architectures followed by a comparative analysis and benchmarking of techniques available for distributed deep learning.
- Briefly comparing machine learning and deep learning.
- Motivation behind distributed deep learning and challenges in transition.
- Challenges in DDL: Model and data parallelism to train models on large datasets efficiently.
- Implementation of distributed Deep Learning in frameworks like Tensorflow, MXNET etc.
- Comparison and benchmarking of different open source frameworks like TFOS, Horovod built on top of tensorflow.
- Open challenges in the current frameworks available.