The Fifth Elephant 2013

An Event on Big Data and Cloud Computing

Implementing Named-Entity-Recognizer on Twitter Data and Using it to Cluster Similar Tweets.

Submitted by Abhishek Vaid (@vaidabhishek) on Tuesday, 4 June 2013

videocam_off

Technical level

Intermediate

Section

Analytics and Visualization

Status

Submitted

Vote on this proposal

Login to vote

Total votes:  +8

Objective

  • To understand the scope of extracting Real-Entities from tweets using freely available NER-Engines and POS-Tagging.
  • To apply NER-Output to cluster Twitter Data, aggregating contextually similar tweets.
  • Challenges and limitations of current methods.

Description

This is a talk regarding how we currently detect duplicate or contextually similar tweets based on their content for frrole.com. To achieve respectable levels of accuracy, we use POS-Taggers and NER-Modules published by research groups at University of Washington and Carnegie Mellon University. Integrating these tools and applying some more algorithimic hacks, we're able to achieve fairly good levels of accuracy. This talk is about the design decisions we made and challenges we solved while achieving this.

Requirements

This is not a workshop, but for participants to be able to appreciate the talk, basic know-how of following topics will be sufficient:
1.) Algorithms and Data Structures.
2.) MongoDB or any other JSON based No-SQL DB.
3.) Python, Java or any other modern programming language.
4.) Some graph theory basics
5.) Some idea of what NLP and Text Mining is.

Speaker bio

I am currently the technical lead at frrole.com. In last 4 monhts, I have been able to successfully implement a clustering pipeline for frrole's twitter stream. In doing so, I solved some really interesting problems and made some interesting design decisions. The tools I used are mostly libraries and modules published by research groups of leading universities. I hold a bachelors and masters from IIITM, Gwalior and have spend some time teaching graduate and under-graduate courses. I'm also an avid MOOCoholic and enjoy learning new technologies from time to time.

Slides

http://blog.frrole.com/post/43482047103/latest-from-technology-frrole-2-0

Comments

Login with Twitter or Google to leave a comment