Anthill Inside 2019

A conference on AI and Deep Learning

Up next

Rigorous Evaluation of NLP Models for Real World Deployment

Submitted Sep 3, 2019

Rapid progress in NLP Research has seen a swift translation to real world commercial deployment. While a number of success stories of NLP applications have emerged, failures of translating scientific progress in NLP to real-world software have also been considerable (some of these issues are covered in my IJCAI paper Specifically, the challenges and gaps in the areas of testing and rigorous evaluation of NLP applications have largely remained unaddressed. Of late, there has been considerable debate and research into understanding what NLP models have learnt really when they are trained for a specific task. Instead of just reporting a few metrics such as accuracy/F1-score on a handful of datasets, deeper understanding of NLP models in terms of their robustness covering the input space and generalization capabilities is essential. One of the reasons why many NLP models don’t generalize and fail in real world is the lack of detailed evaluation of the model over a comprehensive set of inputs (both adversarial and non-adversarial) and understanding biases encoded and weaknesses. This talk will cover the need for rigorous evaluation of NLP models, current research and industry best practices on the same and provide practical tips to evaluate the generalizability and robustness of your model for production readiness.

This talk is aimed at NLP engineers and researchers ooking for deeper understanding of NLP model evaluation and robustness for real world inputs. (audience should have at least a minimum of 1-2 years of experience in ML/NLP. Desirable: knowledge of basic concepts such as robustness, adversarial testing, generalization)

Key takeaways would be (a) current gaps in evaluating NLP models (b) research overview of rigorous evaluation of NLP models (c) how can these research findings be applied practically for evaluation and improving NLP model robustness.


We motivate why rigorous evaluation of NLP models beyond simple metrics such as F1-score/accuracy are needed for real world deployment with a few historical use-cases/examples. We then talk about the “CleverHans Moment for NLP” ( We discuss the latest research around model evaluation for NLP. We then take up the example of a sentiment analysis task as a case-study and discuss the methodology for rigorous evaluation. We conclude by pointing out future work directions in this topic.


Participants should be have intermediate knowledge of NLP model building and tuning. Knowledge of concepts such as robustness, adversarial evaluation and generalization would be desirable but not essential.

Speaker bio

Sandya Mannarswamy ( is an independent NLP researcher. She was previously a senior research scientist at Conduent Labs India in the Natural Language Processing research group. She holds a Ph.D. in computer science from Indian Institute of Science, Bangalore. Her research interests span natural language processing, machine learning and compilers. Her research career spans over 19 years, at various R&D labs, including Hewlett Packard Ltd, IBM Research etc. She has co-organized a number of workshops including workshops at International Conference on Data Management, Machine Learning Debates workshop at ICML-2018 etc. Her current research is focused on software testing and evaluation of Natural Language Processing applications. She has a number of international research publications and patents in the area of natural language processing ( She co-authored a paper at International Conference on Artificial Intelligence (IJCAI) 2018, which focused on the challenges in taking AI applications from research to real world. Her current research is focussed on rigorous evaluation of NLP applications (“using NLP to evaluate NLP”). She is the author of the popular “CodeSport” column in Open Source For You magazine. (