MapReduce and the "Art of Thinking Parallel"
Submitted by Shailesh Kumar (@shkumar) on Tuesday, 23 April 2013
Analytics and Visualization
The goal of the session is to take the audience from the "MECHANICS of using MapReduce" (to do simple slicing and dicing of BigData) to the "ART of using MapReduce" to solve more complex problems that at first glance look "unnatural" for MapReduce!
In this session we will:
Introduce MapReduce framework from scratch (if needed)
Highlight some limitations of MapReduce via types of parallelizations that cannot be done "naturally" in MapReduce (e.g. Joins)
Develop insights into how to transform such problems so they can be solved using MapReduce
Solve some "beautiful" problems in MapReduce (e.g. finding all maximal cliques in a graph)
The objective is to explore deeper insights in using MapReduce framework.
MapReduce is a ubiquitously used framework for largescale number crunching in BigData analytics. While it is quite general, it is not universal. There are a lot of analytics problems that cannot be ported to the MapReduce framework "naturally" (e.g. finding similarity between all pairs of documents in their Bag-of-Words representation).
In this talk, through a series of such problems, we will highlight both the limitations of MapReduce and how to overcome those limitations by being "smart" about "transforming those problems" to be more "amenable to MapReduce".
As a concrete example we will develop an end-to-end solution in MapReduce for a very important and NP-hard Graph Theory problem - finding all Maximal Cliques in a graph.
Basic understanding of MapReduce and Complexity of Algorithms would be helpful but not required.
Dr. Shailesh Kumar is a Member of Technical Staff at Google, Hyderabad where he works on large scale data mining problems for various Google products. Prior to joining Google, he has worked as a Principal Dev. Manager at Microsoft (Bing) Hyderabad, Sr. Scientist at Yahoo! Labs Bangalore, and Principal Scientist at Fair Isaac Research in San Diego, USA.
Dr. Kumar has over fifteen years of experience in applying and innovating machine learning, statistical pattern recognition, and data mining algorithms to hard prediction problems in a wide variety of domains including information retrieval, web analytics, text mining, computer vision, retail data mining, risk and fraud analytics, remote sensing, and bioinformatics. He has published over 20 conference papers, journal papers, and book chapters and holds over a dozen patents in these areas.
He has two keen passions - first creating "magic from data" and second understanding functionally how the brain works!
Dr. Kumar received his PhD in Computer Engineering in 2000 (with a specialization in statistical pattern recognition and data mining) and Masters in Computer Science in 1997 (with a specialization in artificial intelligence and machine learning), both from the University of Texas at Austin, USA. He received his B.Tech. in Computer Science and Engineering from the Institute of Technology, Banaras Hindu University in 1995.