Co-occurrence Analytics: A versatile framework for finding interesting needles in crazy haystacks!
Submitted by Shailesh Kumar (@shkumar) on Tuesday, 9 April 2013
Section: Analytics and Visualization Technical level: Advanced
In this session we will learn about a new way of thinking about data mining and big data analytics, "Co-occurrence Analytics" - a unified framework for mining latent insights in a wide variety of data of the form: "relationships between entities". We will show how the framework can be used to discover...
Logical Product bundles in retail market basket data - a significant departure from the traditional frequent item-set mining,
Meaningful multi-word-units or Phrases in text data - a significant departure from the traditional n-gram based language models,
Semantic Concepts in tag networks - a significant departure from the traditional modularity based community detection algorithms, and
Hierarchy of visual Object in images - a significant departure from the traditional Bag-of-visual words to understand images.
Most data around us can be thought of as "things co-occurring with other things in certain contexts". Whether it is products co-occurring with other products in retail market baskets, words occurring before or after other words in unstructured text, tags co-occurring with other tags in social tagging systems, people co-occurring with other people in various social networking scenarios, or objects occurring in various 2-D geometrical juxtapositions of other objects in images, etc.
While there have been silos of efforts in each research community - retail, text, social networking, and vision, etc. - in dealing with "their" data, there has been no unifying framework to tame such a wide variety of co-occurrence data systematically - a theme for this session.
We will present a simple, intuitive, yet a powerful co-occurrence analytics framework to deal with a wide variety of data of the form "things co-occurring with other things in some context". After describing the framework we will demonstrate how to adapt and apply the core principles of the framework to a variety of large real-world datasets to find novel and actionable insights even in the presence of significant noise in the data.
What makes this approach attractive is that it is:
(1) Unsupervised: No cost of getting labeled data. Just point it to the data and crunch.
(2) Unbiased: No prior assumptions about data distributions, etc.
(3) High Precision: Generates very high quality insights.
(4) High Recall: Generates exhaustively many insights.
(5) Parameter Poor: Very few parameters to play with.
(6) Scaleable: Highly parallelizable in MapReduce sense.
While the session will use some basic concepts from probability theory, information theory, graph theory, visualization, and data mining, the session will be self contained and no prior background in any of these areas is assumed.
Dr. Shailesh Kumar is a Member of Technical Staff at Google, Hyderabad where he works on large scale data mining problems for various Google products. Prior to joining Google, he has worked as a Principal Dev. Manager at Microsoft (Bing) Hyderabad, Sr. Scientist at Yahoo! Labs Bangalore, and Principal Scientist at Fair Isaac Research in San Diego, USA.
Dr. Kumar has over fifteen years of experience in applying and innovating machine learning, statistical pattern recognition, and data mining algorithms to hard prediction problems in a wide variety of domains including information retrieval, web analytics, text mining, computer vision, retail data mining, risk and fraud analytics, remote sensing, and bioinformatics. He has published over 20 conference papers, journal papers, and book chapters and holds over a dozen patents in these areas.
He has two keen passions - first creating "magic from data" and second understanding functionally how the brain works!
Dr. Kumar received his PhD in Computer Engineering in 2000 (with a specialization in statistical pattern recognition and data mining) and Masters in Computer Science in 1997 (with a specialization in artificial intelligence and machine learning), both from the University of Texas at Austin, USA. He received his B.Tech. in Computer Science and Engineering from the Institute of Technology, Banaras Hindu University in 1995.