The Fifth Elephant 2013

An Event on Big Data and Cloud Computing

MapReduce and the "Art of Thinking Parallel"

Submitted by Shailesh Kumar (@shkumar) on Tuesday, 23 April 2013

Section: Analytics and Visualization Technical level: Advanced


The goal of the session is to take the audience from the "MECHANICS of using MapReduce" (to do simple slicing and dicing of BigData) to the "ART of using MapReduce" to solve more complex problems that at first glance look "unnatural" for MapReduce!

In this session we will:

  • Introduce MapReduce framework from scratch (if needed)

  • Highlight some limitations of MapReduce via types of parallelizations that cannot be done "naturally" in MapReduce (e.g. Joins)

  • Develop insights into how to transform such problems so they can be solved using MapReduce

  • Solve some "beautiful" problems in MapReduce (e.g. finding all maximal cliques in a graph)

The objective is to explore deeper insights in using MapReduce framework.


MapReduce is a ubiquitously used framework for largescale number crunching in BigData analytics. While it is quite general, it is not universal. There are a lot of analytics problems that cannot be ported to the MapReduce framework "naturally" (e.g. finding similarity between all pairs of documents in their Bag-of-Words representation).

In this talk, through a series of such problems, we will highlight both the limitations of MapReduce and how to overcome those limitations by being "smart" about "transforming those problems" to be more "amenable to MapReduce".

As a concrete example we will develop an end-to-end solution in MapReduce for a very important and NP-hard Graph Theory problem - finding all Maximal Cliques in a graph.


Basic understanding of MapReduce and Complexity of Algorithms would be helpful but not required.

Speaker bio

Dr. Shailesh Kumar is a Member of Technical Staff at Google, Hyderabad where he works on large scale data mining problems for various Google products. Prior to joining Google, he has worked as a Principal Dev. Manager at Microsoft (Bing) Hyderabad, Sr. Scientist at Yahoo! Labs Bangalore, and Principal Scientist at Fair Isaac Research in San Diego, USA.

Dr. Kumar has over fifteen years of experience in applying and innovating machine learning, statistical pattern recognition, and data mining algorithms to hard prediction problems in a wide variety of domains including information retrieval, web analytics, text mining, computer vision, retail data mining, risk and fraud analytics, remote sensing, and bioinformatics. He has published over 20 conference papers, journal papers, and book chapters and holds over a dozen patents in these areas.

He has two keen passions - first creating "magic from data" and second understanding functionally how the brain works!

Dr. Kumar received his PhD in Computer Engineering in 2000 (with a specialization in statistical pattern recognition and data mining) and Masters in Computer Science in 1997 (with a specialization in artificial intelligence and machine learning), both from the University of Texas at Austin, USA. He received his B.Tech. in Computer Science and Engineering from the Institute of Technology, Banaras Hindu University in 1995.



  • Sugandha Naolekar (@techinsane) 6 years ago (edited 6 years ago)

    Really looking forward to it…!! And I am currently focussing on doing some geo-processing using hadoop-mapreduce! Thus, this session will be really helpful!

  • t3rmin4t0r (@t3rmin4t0r) 6 years ago

    Too often, I find people conflating “Thinking in parallel” to “Try to share-nothing”, which naturally confined to only a small slice of parallelizable problems. I wanted to know more about the talk here, to know whether the talk is slanted towards the algorithmically curious or about exploring the limitations of MR.

    I have often hit limitations of the share-nothing architecture in hadoop, but eventually it has lead to methods to work-around that limitation with concurrent k-v stores with atomic adds, atomic swaps or built-in atomic counters.

    It is certainly not infinitely scaleable to do so in systems with skews, but I have found it interesting to distill algorithms till they fit this hybrid model, which has often left the box of “true MR parallelism”, but certainly scaling the problem upto hundreds of nodes or more (say, adding absolute rank to sorted data).

    Are you trying to stay within the limitations of algorithmic rewrites for MR or are you going to spill over into the hacky territory surrounding MR.

  • Shailesh Kumar (@shkumar) Proposer 6 years ago

    This talk is really more slanted towards the algorithmically curious more than the hacky territory surrounding MR. We will stay with the traditional MR and algorithmic rewrites. Think about finding all maximal cliques in an iterative mapreduce without the hacks! What kind of problem transformations would make such a problem map-reduce friendly? essentially, what would be the map and what would be the reduce? For the purpose of this talk, the “Art of Thinking Parallel” is really the “Art of designing the Map and Reduce functions” for your not so MapReduce friendly problems :-)

Login with Twitter or Google to leave a comment