The Fifth Elephant 2014

A conference on big data and analytics

Viraj Paripatyadar

@virajparipatyadar

Machine learning + Interactive visualization: A pragmatic approach to fixing knowledge bases

Submitted Jun 1, 2014

We wish to explore how the use of recommenders and visualization can help in fixing problems inherent to knowledge bases. We will tackle one such problem which is incorrect/missing assignment of tags to articles in a knowledge base. We will also demonstrate how off-the-shelf software in the Hadoop ecosystem could be used to improve the richness of this data through processing and visualization. We will be showing how this concept works with a couple of examples.

Outline

Making sense of knowledge bases

Many publicly available knowledge bases offer extremely rich data sets in that they are almost entirely human-edited, in natural language and tagged using a likely non-curated system of tags. We wanted to explore if we could improve the richness of the latter using software from the Hadoop ecosystem and aiding the comprehension of results using visualization.

Relating tags

To begin with, we defined a relationship between tags and, through a series of operations, extracted numerical data about this relationship from the knowledge base data. Then, we used a recommender to come up with more suggested relations between such tags. Finally, all of the results were visualized as an interactive graph to help faster understanding.

When we did this for Wikipedia, this helped us spot some interesting relationships and also some new ones which are missing and should be edited in. We are in the process of trying this out for another such database.

The talk

In the session, we will be covering how we came up with the problem and how we solved it. We will talk abouts details of the examples and some interesting results we came up with as we played around with the visualization. Here is a list of tools we used to accomplish this, although we will be discussing these only briefly during the talk:

  • Hadoop MapReduce
  • Pig
  • Apache Mahout
  • Neo4j
  • D3.js

Speaker bio

Viraj is a Software Architect at GS Lab. For the last 8 years while at GS Lab, he has worked in the area of Web Applications. His current area of focus is Data Analytics and Visualization, Design/Development of Scalable Web Applications and exploring Data Analytics use cases in Web based products. As part of this effort, he led a team of engineers to develop a social news reader application with a recommendation engine suggesting news and products based on users’ reading habits, their social relations and likes.

Prior to Analytics, he has worked in Web Applications Security, developing Web-based attacks for an enterprise Web security assessment product. He is a Computer Science M.Tech. from IIT-Kharagpur. Before that, he did his M.Sc. in Mathematics from University of Pune.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures