Opening the NLP Blackbox - Analysis, Evaluation and Testing of NLP Models
Rapid progress in NLP Research has seen a swift translation to real world commercial deployment. While a number of success stories of NLP applications have emerged, failures of translating scientific progress in NLP to real-world software have also been considerable. Evaluation of NLP models is often limited to held out test set accuracy on a handful of datasets, and analysis of NLP models is often limited to ablation studies. Lack of rigorous evaluation leads to over-estimation of generalization performance of the built model. A lack of understanding of the inner workings of the model results in ‘Clever Hans’ models which fail in real world deployments. One of the reasons why many NLP models don’t generalize and fail in real world is the lack of detailed evaluation of the model over a comprehensive set of inputs and understanding of biases encoded and weaknesses using model analysis methods.
Of late there has been considerable research interest into analysis methods for NLP models, and evaluation techniques going beyond test set performance metrics. However, this area of work is still not widely disseminated through the NLP community. This talk aims to address this gap, by providing a detailed overview of NLP model analysis and evaluation methods, discuss their strengths and weaknesses and also point towards future research directions in this area.
This talk is intended to provide an in-depth overview of the analysis and evaluation methods for NLP models, covering existing techniques, challenges and research opportunities in this space.
We motivate why rigorous evaluation of NLP models beyond simple metrics such as F1 score/accuracy are needed for real world deployment with specific use-cases/examples. We then talk about the “Clever Hans moment for NLP” , wherein models learn dataset specific features and solve the dataset instead learning to solve the actual task on hand. This sets the context for the need to have robust methods of model analysis and evaluation.
Next, in the context of NLP model analysis and evaluation, we focus on four important questions:
1. What is the model internal structure in terms of the knowledge it has captured?
2. What is its behaviour with respect to different inputs?
3. How do we visualize the model inner workings?
4. How do we quantify model strengths and weaknesses?
For each of these questions, we discuss the existing methods available, point out their comparative advantages and disadvantages, as well as briefly outlining possible future research directions.