Anthill Inside 2019

A conference on AI and Deep Learning

Tickets

Annotate This! A simple tool that seeks to simplify the creation of annotated text datasets - primarily for Hindi and other vernacular Languages

Submitted by karmanya aggarwal (@calmdownkarm) on Sunday, 28 April 2019


Preview video

Section: Crisp talk Technical level: Beginner Session type: Demo

View proposal in schedule

Abstract

This talk will comprise of my giving a quick overview of an open source tool that allows for faster and relatively error free annotations of text data, by leveraging language models and gamification principles. I would subsequently go over the tech that powers it and demonstrate how it works.

Deep Learning is more accessible today than ever - super cheap GPUs, almost plug and play models that require very little fine-tuning for standard usecases, and more so than ever, an abundance of data. This includes the tremendous impact deep learning based language models have had in Natural Language Processing.

Unfortunately, 2/3 items of that list disappear very quickly when we look at languages other than English. State-of-the-art classifiers for even simple tasks like sentiment analysis have drastically poor performance even in mainstream languages like Hindi, and are almost nonexistent in less popular / vernacular languages.

Clean, well annotated datasets for languages other than English, German and Spanish are few and far between - For example, Sentiment140 is an English Language Twitter Dataset, with 1.6 million samples that have been annotated. However, Bhaav, a similar dataset in Hindi, has 3 thousand samples, and is possibly the most comprehensive sentiment analysis hindi dataset that I’ve ever worked with.

Why is this? There is a stunningly large amount of Hindi language text that’s easily available in the public domain. As a result, modern transfer learning approaches to NLP have no dearth of language models - from Common Crawl Hindi Language word embeddings, to multi-layer language models like Hindi2Vec and Hindi Language vectors for Elmo. There’s a tremendous amount of focus on building and disseminating these models, and others for languages like Tamil and Telugu through Fast.ai, IndicNLP and other organizations.

What’s worse is that tools don’t exist to make the creation of these datasets simple. Though premier institutions have spent decades working on NLP applications in Indian languages, their primary areas of focus seem to be in machine translation and wordnet-esque lexical corpora of Indian languages. Prodigy.AI is probably the best text based annotation tool out there, but it’s incredibly expensive even for individual contributors. Moreover, most open source communities in India, have a large number of students that are motivated, and looking for ways to contribute towards open source projects.

In this talk, I’d like to introduce a tool that I’ve released that seeks to gamify the creation of such datasets - hopefully enticing the large number of motivated students of machine learning across India to contribute toward creation of datasets, and reward them for their time. The tool allows for anyone to host a webapp that allows contributors to:

  1. annotate datasets
  2. set tests to evaluate domain expertise
  3. export these datasets in a manner that allows for consensus amongst multiple annotators.

This makes the annotations somewhat robust. Dataset annotators are further rewarded for their time via github contributions. The tool itself uses word embedding to assist in the annotation process and deals with the typical problems that text annotation poses fairly well - from simple typos inflating the number of annotations, to dealing with inter annotator agreement and allows for relatively bias free annotations by allowing configurable thresholds. Though so far it’s been used primarily for Hindi, it can be used for any language, and thus benefits multiple small communities around the world at large.

Outline

  • An overview of AnnotateThis!
  • A shallow dive into word embeddings and how they’re used for a better annotation experience

Speaker bio

I’m a developer at Gramener - where I focus on building tooling for people interested in Data Science, Machine Learning and Data Visualization. I’m also a Research Assistant at IIIT Delhi’s MIDAS Research Group, where I focus on various applications of NLP, with emphasis on Indian Languages.

Links

Slides

https://docs.google.com/presentation/d/13-UMtpyK4y_0lM7gnsNpfYPE3GH8THtUa3GAOqfavoY/edit?usp=sharing

Preview video

https://youtu.be/GoMUpXL8GZs

Comments

  • Abhishek Balaji (@booleanbalaji) Reviewer 5 months ago

    Hi Karmanya,

    Thank you for submitting a proposal. For us to evaluate your proposal, we need to see more detailed slides. Your slides must take the following points into consideration:

    • Problem statement/context, which the audience can relate to and understand. The problem statement has to be a problem (based on this context) that can be generalized for all.
    • What were the tools/options available in the market to solve this problem? How did you evaluate these, and what metrics did you use for the evaluation? Why did you decide to build your own ML model?
    • Why did you pick the option that you did?
    • Explain how the situation was before the solution you picked/built and how was the fraud/ghosting after implementing the solution you picked and built? Show before-after scenario comparisons & metrics.
    • What compromises/trade-offs did you have to make in this process?
    • What are the privacy, regulatory and ethical considerations when building this solution?
    • What is the one takeaway that you want participants to go back with at the end of this talk? What is it that participants should learn/be cautious about when solving similar problems?

    As next steps, we’d need to see the detailed and/or updated slides by 21 May, in order to close the decision on your proposal. If we dont receive an update by 21 May, we’d have to move the proposal for consideration for a future conference.

    • karmanya aggarwal (@calmdownkarm) Proposer 5 months ago

      Hey, I’ve updted the slides.

      • Abhishek Balaji (@booleanbalaji) Reviewer 4 months ago

        Thanks, moving this to evaluation.

Login with Twitter or Google to leave a comment