The Fifth Elephant 2020 edition

The Fifth Elephant 2020 edition

On data governance, engineering for data privacy and data science

The ninth edition of The Fifth Elephant will be held in Bangalore on 16 and 17 July 2020.

The Fifth Elephant brings together over one thousand data scientists, ML engineers, data engineers and analysts to discuss:

  1. Data governance
  2. Data privacy and engineering for privacy including engineering for Personal Data Protection (PDP) bill.
  3. Data cleaning, annotation, instrumentation and productionizing data science.
  4. Identifying and handling fraud + data security at scale
  5. Feature engineering and ML platforms.
  6. What it takes to create data-driven cultures in organizations of different scales.

**Event details:

Dates: 16-17 July 2020
Venue: NIMHANS Convention Centre, Dairy Circle, Bangalore

Why you should attend:

  1. Network with peers and practitioners from the data ecosystem.
  2. Share approaches to solving expensive problems such as cleanliness of training data, annotation, model management and versioning data.
  3. Demo your ideas in the demo sessions.
  4. Join Birds of Feather (BOF) sessions to have productive discussions on focussed topics. Or, start your own Birds of Feather (BOF) session.

Contact details:
For more information about The Fifth Elephant, call +91-7676332020 or email sales@hasgeek.com


Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more
Sandya Mannarswamy

Sandya Mannarswamy

@sandyasm

Is Your NLP Model Solving the Dataset Or the Actual Task? - Identifying, Analyzing and Mitigating Spurious Dataset Cues in NLP Applications

Submitted Mar 24, 2020

Natural Language Processing models are susceptible to learning spurious and shallow patterns in the dataset which does not generalize well to real world data. Given that dataset serves as the proxy for the actual task on hand, often deep learning NLP models learn from the spurious shallow patterns in the dataset instead of solving the actual task on hand. The presence of such non-robust brittle features has been shown to lead to poor real world generalization performance in various tasks such as sentiment analysis, question answering, natural language inference etc. Hence performance on a handful of test datasets does not often translate to real world. It is essential to analyze the data and identify whether the model is depending on such shallow and spurious patterns. As a second step it is essential to mitigate such impact on the learnt model to improve the real world performance. In this talk, we discuss the current state of art methods in identifying such shallow surface cues in NLP datasets and cover a range of techniques to mitigate such cues and to build models which don’t depend on them.

Outline

Part I - Is your NLP model solving the dataset instead of the actual task?
We cover existing research literature on identification of shallow surface cues/patterns in datasets and thus motivate the need for building models which don’t depend on such surface cues. (We also briefly cover the work on how adversarial examples can be shown to arise from the model dependence on such shallow cues).
Part II – Identifying, Analyzing and Mitigating spurious dataset cues from NLP models.
We then cover techniques which can identify model dependence on such spurious cues, and discuss mitigating and eliminating techniques. We consider two real world NLP tasks namely natural language inference and question answering and show how these techniques are applicable in eliminating such shallow surface cues from the model learning.
This talk is focussed on intermediate and advanced NLP developers/practitioners who are interested in building robust NLP models. Prior knowledge of basic NLP is assumed.

References
(1) https://arxiv.org/abs/2002.04108
(2) https://arxiv.org/abs/1908.10763
(3) https://arxiv.org/abs/1909.03683
(4) https://arxiv.org/abs/1911.03861
(5) https://arxiv.org/abs/2001.01565
(6) https://arxiv.org/abs/1905.02175

Requirements

None

Speaker bio

Sandya Mannarswamy is an independent researcher in Natural Language Processing. She was a senior research scientist at Conduent Labs India in the Natural Language Processing research group. She holds a Ph.D. in computer science from Indian Institute of Science, Bangalore. She has 19 years of software industry experience, at various R&D labs, including Hewlett Packard Ltd, IBM and Xerox Research. Her work spans a number of areas natural language processing, machine learning, compiler optimizations, developer tools and file systems, with a number of publications and patents.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more