Distributed Machine Learning - Challenges and Oppurtunities
Submitted by Anand Chitipothu (@anandology) on Saturday, 10 June 2017
Section: Crisp talk for data engineering track Technical level: Intermediate
The traditional machine learning libraries like scikit-learn in Python are written to work on a single computer. While that is good enough for small datasets, traning ML models on large datasets often taken very long time.
While tools like Apache Spark allows the users to run the ML algorithms in a cluster, it requires a completely new way to doing things and it comes with the complexity of managing the infrastructure.
The tools like scikit-learn can’t be easily adopted to platforms like Spark because these platforms exploit the data parallelism. That requires all the algorithms to be use implemeted with data parallism in mind.
The other oppurtinity is to exploit the task parallelism inherently present in many ML workflows like hyperparameter optimization. Doing a hyperparameter optimiation involves trainining multiple models with different parameters and picking the best. Training each of these models can be done in parallel and can be distributed to multiple nodes.
There is some interesting work done in this area like dask from continnum and spark-sklearn from databricks. The dask project even provides tools to spawn a cluster on AWS to distribute the load. While these approaches looks promising, there are not very accessible to data scientists as they come with the burden of managing infrastructure.
We at rorodata have been trying to find way to simplify this and make data scientists run they ML trainings faster without worrying about infrastructure. We’ve built a tool called rorocloud, a serverless platform to run data science experiements and built simple tools to achive task parallism on the rorocloud.
In this talk I’m going to explore the oppunities present to paralleize the ML training workflows, the challenges with the currently available options and share the learnings from our experiments.
- Traditional Machine Learning and opputinities for parallism
- Available solutions for distributing and the challenges
- Our aproach
- Learnings from our experiments
Anand has been crafting beautiful software since a decade and half. He’s now building a data science platform, rorodata, which he recently co-founded. He regularly conducts advanced programming courses through Pipal Academy. He is co-author of web.py, a micro web framework in Python. He has worked at Strand Life Sciences and Internet Archive.