How similar are two pieces of text? A moderately broad and deep dive in one of the fundamental topics in NLP.

This submission has been added to the schedule

How similar are two pieces of text? A moderately broad and deep dive in one of the fundamental topics in NLP.

Submitted Nov 12, 2017

Section: Full talk Technical level: Intermediate

I will talk about a fundamental problem of measuring similarity between two pieces of text. This problem appears in many contexts from search and information retrieval, natural language inferencing, plagiarism detection, answer scoring, machine translation, (near) duplicate detection etc. I will give an overview of some fundamentals, key formulations and approaches of work that is present in the literature.

The talk will center around scenarios where there are notions of application dependent similarity scores. I will use the example of automatically grading student answers against instructor-provided model answers (Automatic Short Answer Grading or ASAG). Given a question, a model answer - how can we automatically grade short answers where the answers are a sentence to a paragraph long. I will show various nuances of text similarity formulation for such applications and associated challenges.
I will introduce a range of generic unsupervised, semi-supervised and supervised techniques for measuring similarity. We will deep dive into couple of state of the art approaches - one based on classical pattern mining and word2vec and the other based on Siamese LSTM networks with a new cost function inspired by Earth Movers Distance (EMD).
This talk will be based on various papers published in 2016-17 in reputed conferences such as IJCAI, ECAI and COLING.

Outline

Text Similarity
a. Definition and scope
Application Areas
a. Information retrieval
b. Paraphrase detection
c. Natural language inference
d. Plagiarism detection
Types of Similarity
Techniques
a. Supervised
i. Classical techniques
ii. Deep neural network based techniques
b. Unsupervised
i. Lexical
ii. Semantic
Automatic Short Answer Grading
a. Context and motivation
b. Word-similarity based techniques
i. Wisdom of students
c. Siamese LSTM-based supervised ASAG technique
Conclusion

Speaker bio

Shourya Roy is the Head and Vice President of American Express Big Data Labs (BDL) which he took up in December 2016. In this role he is responsible for establishing and executing the technical agenda for BDL working closely with the broader Decision Science community and business units. Shourya is leading a team of scientists and engineers in the areas of machine learning, artificial intelligence, deep learning and cloud computing.

Prior to joining American Express, Shourya spent nearly fifteen years in the labs of IBM and Xerox playing several leadership roles in technical research, research and strategic management, customer facing business development. Shourya has a proven track record of conceptualize and initialize (by influencing business group leaders), design and develop (by participating and leading research teams) and transfer (with software development partners) innovation from research labs to real life operations and business.
Shourya’s technical expertise spans Text and Data Mining, Natural Language Processing, Machine Learning, and Big Data in which he is a well-known thought leader in several communities. His work has led to more than 60 publications in premier journals and conferences. He has been granted about 15 patents while tens of patent applications are currently in different stages of patent lifecycle. He is an active member of the ACM and ACL communities - as a part of which he has been associated with multiple conference and workshop organisations.
Shourya holds Ph.D., Masters and Bachelors Degrees in Computer Science from IISc Bangalore, IIT Bombay and Jadavpur University respectively. Shourya also has an MBA from Faculty of Management Studies (FMS), Delhi University.
Beyond work Shourya is passionate about meeting and knowing people as well as following and playing multiple sports.

Anthill Inside Miniconf – Pune