The Fifth Elephant 2013

An Event on Big Data and Cloud Computing

Up next

Making Sense of content in domain intense QA/Discussion Forums- A Text Mining Problem


Pramod N Haritsa


StackOverFlow: Imagine a world without such user collaboration and moderation in maintaining and getting information from discussion forums. In this talk we’ll see how can one make sense of content in QA/Discussion Forums using a palette of text processing techniques.

The talk aims at showcasing how can one use text processing(NLP or otherwise) and statistics to arrive at insights.


In identifying a lot of information regarding how credible is a User’s answer, what is the level of difficulty, what is the discussion thread talking about etc, We currently rely on a lot of collaboration from the users and moderators maintaining such user forums. At times, the information exchange happens through more unstructured and informal medium like mailing list/groups etc.

The talk aims to answer certain bits of the following question.

How can we associate a discussion thread with the key themes?

How can we associate a person to themes which he/she has interacted?

How can we measure the domain difficulty level of a discussion thread?

How can we allocate StackOverflow kind user ratings by just looking at the content of a User’s response?

How can we collate these information for future queries, recommendation etc?

Challenges in analysing Unstructured(speech form) Text.

Simple, yet effective statistical techniques to derive insights.

Simple visualisation.

iAdler - a prototype application for extracting insights from mailing lists.

The talk aims to highlight the idea behind including text for any insight derivation and not just the collaborative/user information.(Section on how numbers can be misleading at times)

Speaker bio

Pramod is currently working as an Application Developer in the Analytics Initiative at Thoughtworks Inc. He has spent major part of his academic career working as a research assistant under Dr Srinivasa K G and has experience in applying Machine Learning Techniques onto various computer science problems which include Network Data,Speech Processing, Text Processing and NLP. His research interests include Text & Data mining, Machine learning, Distributed Systems and Game Theory.