The Fifth Elephant 2017

On data engineering and application of ML in diverse domains

Anuj Gupta

@anujgupta82

Learning representations of text for NLP

Submitted Apr 19, 2017

Think of your favorite NLP application you wish to build - sentiment analysis, named entity recognition, machine translation, information extraction, summarization, recommender system. A key step in building it is - using the right technique to represent the text in a form that machine can understand. In this workshop, we will focus on the key concepts, maths, and code behind state-of-the-art techniques for text representation.

This workshop is meant for NLP enthusiast, ML practitioners, Data science teams who work with text data and wish to gain a deeper understanding of Learning representations of text for NLP. This will be a very hands-on workshop with jupyter notebooks to create various representations, coupled with the key concepts & maths that forms the basis of their respective theory.

Outline

Machine Learning in Images has had a phenomenal success story. One of the key reasons for it is: Rich representation of data - raw image in matrix form with RGB values.

While in images, directly using the pixel values is a very natural representation. However, when it comes to text, there is no such natural representation. No matter how good is your ML algorithm, it can do only so much unless there is a richer way to represent underlying text data. Thus, whatever NLP task/application you are building, it’s imperative to find a good representation for your text. Motivated from this, the subfield of representation learning of text for NLP has attracted a lot of interest in the past few years.

__ Various representation learning techniques have been discussed at length in literature, but from a practitioner’s point of view, there is a dearth of comprehensive tutorials that provides full coverage with the mathematical explanation and implementation details of these algorithms.__ This workshop aims to bridge this gap. This workshop aims to demystify, both - Theory (key concepts, maths) and Practice (code) that goes into these various representation schemes. At the end of workshop participants would have gained a fundamental understanding of these schemes and will be able to implement embeddings on their datasets.

Course Content:

  1. Old ways of representing text

  2. Introduction to Embedding spaces

  3. Word-Vectors

  4. Sentence2vec/Paragraph2vec/Doc2Vec

  5. Character2Vec

For each of the above representation scheme, we will understand and implement various evaluation and visualization techniques.

Requirements

Laptop and Lots of enthusiasm.
We will provide pre installed virtual machine which will help you get started without fuss.

Speaker bio

  1. Anuj Gupta is a senior ML researcher at Freshdesk; working in the area NLP, Machine Learning, Deep learning. Earlier he was heading ML efforts at Airwoot(Now acquired by Freshdesk). He dropped out of Phd in ML to work with startups. He graduated from IIIT H with specialization in theoretical comp science.

He has given tech talks at prestigious forums like PyData DC, Fifth Elphant, ICDCN, PODC, IIT Delhi, IIIT Hyderabad and special interest groups like DLBLR. More about him - https://www.linkedin.com/in/anuj-gupta-15585792/

  1. Satyam Saxena is a ML researcher at Freshdesk. An IIT alumnus, his interest lie in NLP, Machine Learning, Deep Learning. Prior to this, he was a part of ML group Cisco. He was a visiting researcher at Vision Labs in IIIT Hyd where he used computer vision and deep learning to build applications to assisting visually impaired people. He presented some of this work at ICAT 2014, Turkey. https://www.linkedin.com/in/sam-iitj/

Slides

https://www.slideshare.net/anujgupta5095/representation-learning-for-nlp

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures