Implementing Named-Entity-Recognizer on Twitter Data and Using it to Cluster Similar Tweets.
Submitted by Abhishek Vaid (@vaidabhishek) on Tuesday, 4 June 2013
Section: Analytics and Visualization Technical level: Intermediate
- To understand the scope of extracting Real-Entities from tweets using freely available NER-Engines and POS-Tagging.
- To apply NER-Output to cluster Twitter Data, aggregating contextually similar tweets.
- Challenges and limitations of current methods.
This is a talk regarding how we currently detect duplicate or contextually similar tweets based on their content for frrole.com. To achieve respectable levels of accuracy, we use POS-Taggers and NER-Modules published by research groups at University of Washington and Carnegie Mellon University. Integrating these tools and applying some more algorithimic hacks, we're able to achieve fairly good levels of accuracy. This talk is about the design decisions we made and challenges we solved while achieving this.
This is not a workshop, but for participants to be able to appreciate the talk, basic know-how of following topics will be sufficient:
1.) Algorithms and Data Structures.
2.) MongoDB or any other JSON based No-SQL DB.
3.) Python, Java or any other modern programming language.
4.) Some graph theory basics
5.) Some idea of what NLP and Text Mining is.
I am currently the technical lead at frrole.com. In last 4 monhts, I have been able to successfully implement a clustering pipeline for frrole's twitter stream. In doing so, I solved some really interesting problems and made some interesting design decisions. The tools I used are mostly libraries and modules published by research groups of leading universities. I hold a bachelors and masters from IIITM, Gwalior and have spend some time teaching graduate and under-graduate courses. I'm also an avid MOOCoholic and enjoy learning new technologies from time to time.