The Fifth Elephant 2013

An Event on Big Data and Cloud Computing

Making Sense of content in domain intense QA/Discussion Forums- A Text Mining Problem

Submitted by Pramod N Haritsa (@machinelearner) on Wednesday, 29 May 2013

Section: Analytics and Visualization Technical level: Beginner


StackOverFlow: Imagine a world without such user collaboration and moderation in maintaining and getting information from discussion forums. In this talk we'll see how can one make sense of content in QA/Discussion Forums using a palette of text processing techniques.

The talk aims at showcasing how can one use text processing(NLP or otherwise) and statistics to arrive at insights.


In identifying a lot of information regarding how credible is a User's answer, what is the level of difficulty, what is the discussion thread talking about etc, We currently rely on a lot of collaboration from the users and moderators maintaining such user forums. At times, the information exchange happens through more unstructured and informal medium like mailing list/groups etc.

The talk aims to answer certain bits of the following question.

How can we associate a discussion thread with the key themes?

How can we associate a person to themes which he/she has interacted?

How can we measure the domain difficulty level of a discussion thread?

How can we allocate StackOverflow kind user ratings by just looking at the content of a User's response?

How can we collate these information for future queries, recommendation etc?

Challenges in analysing Unstructured(speech form) Text.

Simple, yet effective statistical techniques to derive insights.

Simple visualisation.

iAdler - a prototype application for extracting insights from mailing lists.

The talk aims to highlight the idea behind including text for any insight derivation and not just the collaborative/user information.(Section on how numbers can be misleading at times)

Speaker bio

Pramod is currently working as an Application Developer in the Analytics Initiative at Thoughtworks Inc. He has spent major part of his academic career working as a research assistant under Dr Srinivasa K G and has experience in applying Machine Learning Techniques onto various computer science problems which include Network Data,Speech Processing, Text Processing and NLP. His research interests include Text & Data mining, Machine learning, Distributed Systems and Game Theory.


  • Joydeep Sen Sarma (@jsensarma) 6 years ago

    Is there a live demo on the work done somewhere? (Can I just point to a stackoverflow page and get some insights?)

    • Pramod N Haritsa (@machinelearner) Proposer 6 years ago

      We are currently in the final stages of development. Ability to point out important conversation in a thread is being built out as we speak.
      The application currently works on public mailing list(any mbox/threadable atom source); bit about Stack overflow is something we could include in future.
      During the talk i will showcase the answer to the above questions(and more) using Hadoop-User MailingList Data as an example.
      The emphasis is to say that there is a lot of useful information in text and its possible to extract such info using simple statistical methods.

      Few of the methods that the application uses includes Correspondence Analysis for theme extraction, Supervised Learning(One class SVM and Bayes) for difficulty classification and question identification etc.

Login with Twitter or Google to leave a comment