The Fifth Elephant 2019

Gathering of 1000+ practitioners from the data ecosystem

FlashText – A Python Library 28x faster than Regular Expressions for NLP tasks

Submitted by Nandan Thakur (@nthakur20) on Jun 15, 2019

Session type: Short talk of 20 mins Status: Rejected

Abstract

Data Science starts with data cleaning. When developers are working with text, they often clean it up first. Sometimes by replacing keywords (“Javascript” with “JavaScript”) while other times, to find out whether a keyword (“JavaScript”) was mentioned in a document. In today’s fast-moving world, bigger and bigger datasets are coming up with tens of thousands to millions of documents. the amount of time one would want to invest in cleaning these gigantic datasets would take them days using RegEx (5 days ~ 20K keywords and 3 Million documents). Therefore, FlashText - a super blazingly fast library reduced days of computation time into few minutes (15mins ~ 20K keywords and 3 Million documents). FlashText is efficient at both extracting keywords and replacing them in sentences and has been implemented using the Aho-Corasick algorithm and the Trie Data Structure approach.

Outline

[0-3mins]: Brief Introduction about Myself. Introduction to FlashText and compare FlashText vs. Regular Expressions Performance.

[3-8mins]: How is FlashText so blazingly fast?

[8-10mins]: When to Use FlashText?

[10-12mins]: Installing FlashText.

[12-15mins]: UseCase 1: Code – Searching for words in a text document

[15-18mins]: UseCase 2: Code – Replacing words in a text document

[18-20mins]: End Notes and Feedback for Future Talks.

Requirements

Not a workshop

Speaker bio

I am a perpetual, quick learner and keen to explore the realm of Data Analytics and Science. I am deeply excited about the times we live in and the rate at which data is being generated and being transformed as an asset. I am well versed in domains such as Natural Language Processing, Machine Learning, and Signal Processing and share a keen interest in learning interdisciplinary concepts involving Machine Learning.

Links

Slides

https://drive.google.com/open?id=1WZ6MU80Qoz5znd89p9aSzTKxAor4Mo6zMvF2qPKqRyA

Preview video

https://www.youtube.com/watch?v=s8WP79QU1zw

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('You need to be a participant to comment.') }}

{{ formTitle }}
{{ gettext('Post a comment...') }}
{{ gettext('New comment') }}

{{ errorMsg }}