Distributed Computing Abstractions for Big Data Science
Vijay Srinivas Agneeswaran, Ph.D
The data science field has made significant advances in the last few years, with a renewed focus on getting data science to work at scale. The talk shall outline distributed computing abstractions required to realize data science at scale. The Resilient Distributed DataSet (RDD) abstraction provided by Spark is becoming a de-facto approach for big data science. However, Apache Flink and recently, Concord have emerged as interesting alternatives to Spark and provide streaming dataflow abstractions – while Spark can achieve real-time analytics by mini-batching, Flink’s allows event streaming as a first class abstraction and provides exactly once guarantees. TensorFlow also provides a dataflow abstraction for deep learning nteworks. TensorFlow has recently released distributed version by using gRPC or by integrating with cluster management systems such as Kubernetes. Graph processing abstractions are useful in realizing complex algorithms on large real-life natural power law graphs such as Twitter or LinkedIn graphs. GraphLab and Titan are the prominent graph processing systems. GraphLab provides an efficient partitioning mechanism to split a large graph across a cluster of nodes and run algorithms at scale. It must be noted that common machine learning algorithms such as clustering or classification as well as deep learning can be realized on top of graph processing abstractions. Titan graph DB has very good integration with several NoSQLs as data sources including Cassandra and HBase as well as processing engines for machine learning including Spark, Giraph and Hadoop. We also outline our experience of implementing machine learning and deep learning algorithms over many of these abstractions.
The key audience takeaways include:
Implementation details of machine learning algorithms over several distributed computing frameworks such as Spark, GraphLab, Flink and TensorFlow.
State-of-art review of big data science – right from distributed TensorFlow to Dato to Flink, audience get a feel for cutting edge technology in the field.
Discussion of pros and cons of similar frameworks and when to use them – for instance, trade-offs between Apache Spark and Flink and when to use one over the other (if you need low latency event specific processing use Flink or use Spark-streaming when you need high throughput processing not requiring CEP). Similarly trade-offs between GraphLab and Titan, when to use one over the other.
- Introduction to Apache Spark, Flink. ML/Deep Learning on top of Spark/Flink with code.
- Introduction to TensorFlow - distributed deep learning.
- Introduction to GraphLab/Titan - ML/deep learning on top of GraphLab/Titan with code.
Dr. Vijay Srinivas Agneeswaran has a Bachelor’s degree in Computer Science & Engineering from SVCE, Madras University (1998), an MS (By Research) from IIT Madras in 2001, a PhD from IIT Madras (2008) and a post-doctoral research fellowship in the LSIR Labs, Swiss Federal Institute of Technology, Lausanne (EPFL). He has joined as Director of Technology in the data sciences team of SapientNitro. He has spent the last ten years creating intellectual property and building products in the big data area in Oracle, Cognizant and Impetus. He has built PMML support into Spark/Storm and realized several machine learning algorithms such as LDA, Random Forests over Spark. He led a team that designed and implemented a big data governance product for a role-based fine-grained access control inside of Hadoop YARN. He and his team have also built the first distributed deep learning framework on Spark. He is a professional member of the ACM and the IEEE (Senior) for the last 10+ years. He has four full US patents and has published in leading journals and conferences, including IEEE transactions. His research interests include distributed systems, data sciences as well as Big-Data and other emerging technologies. He has been an invited speaker in several national and International conferences such as O’Reilly’s Strata Big-data conference series. He lives in Bangalore with his wife, son and daughter and enjoys researching history and philosophy of Egypt, Babylonia, Greece and India.
- ACM Distinguished Speaker
- Big Data Analytics Beyond Hadoop book
- O’Reilly Strata Conference presentation:
- Strata conf video:
- Video of big data beyond Hadoop Webinar
- LinkedIn Profile
- US Patents:
- Keynote speaker at the fifth elephant conference 2014 -