FlashText – A Python Library 28x faster than Regular Expressions for NLP tasks
Nandan Thakur
@nthakur20
Data Science starts with data cleaning. When developers are working with text, they often clean it up first. Sometimes by replacing keywords (“Javascript” with “JavaScript”) while other times, to find out whether a keyword (“JavaScript”) was mentioned in a document. In today’s fast-moving world, bigger and bigger datasets are coming up with tens of thousands to millions of documents. the amount of time one would want to invest in cleaning these gigantic datasets would take them days using RegEx (5 days ~ 20K keywords and 3 Million documents). Therefore, FlashText - a super blazingly fast library reduced days of computation time into few minutes (15mins ~ 20K keywords and 3 Million documents). FlashText is efficient at both extracting keywords and replacing them in sentences and has been implemented using the Aho-Corasick algorithm and the Trie Data Structure approach.
Outline
[0-3mins]: Brief Introduction about Myself. Introduction to FlashText and compare FlashText vs. Regular Expressions Performance.
[3-8mins]: How is FlashText so blazingly fast?
[8-10mins]: When to Use FlashText?
[10-12mins]: Installing FlashText.
[12-15mins]: UseCase 1: Code – Searching for words in a text document
[15-18mins]: UseCase 2: Code – Replacing words in a text document
[18-20mins]: End Notes and Feedback for Future Talks.
Requirements
Not a workshop
Speaker bio
I am a perpetual, quick learner and keen to explore the realm of Data Analytics and Science. I am deeply excited about the times we live in and the rate at which data is being generated and being transformed as an asset. I am well versed in domains such as Natural Language Processing, Machine Learning, and Signal Processing and share a keen interest in learning interdisciplinary concepts involving Machine Learning.
Links
- The repository has over 2700+ Stars on GitHub and 15,000+ claps on Medium.
- Radim Rehurek (Founder of RaRe Technologies (Gensim)) has tweeted about this repository here: https://twitter.com/RadimRehurek/status/904989624589803520
- Medium Article: https://www.freecodecamp.org/news/regex-was-taking-5-days-flashtext-does-it-in-15-minutes-55f04411025f/ (Over 15,000+ Claps)
- GitHub Repo: https://github.com/vi3k6i5/flashtext (Over 2700+ Stars)
- FlashText Documentation: https://buildmedia.readthedocs.org/media/pdf/flashtext/latest/flashtext.pdf
- FlashText Research Paper: https://arxiv.org/pdf/1711.00046.pdf
- LinkedIn: https://linkedin.com/in/nthakur20/
- Video Preview: https://youtu.be/s8WP79QU1zw
- Slides: https://drive.google.com/open?id=1WZ6MU80Qoz5znd89p9aSzTKxAor4Mo6zMvF2qPKqRyA
Slides
https://drive.google.com/open?id=1WZ6MU80Qoz5znd89p9aSzTKxAor4Mo6zMvF2qPKqRyA
{{ errorMsg }}