Annotate This! A simple tool that seeks to simplify the creation of annotated text datasets - primarily for Hindi and other vernacular Languages
This talk will comprise of my giving a quick overview of an open source tool that allows for faster and relatively error free annotations of text data, by leveraging language models and gamification principles. I would subsequently go over the tech that powers it and demonstrate how it works.
Deep Learning is more accessible today than ever - super cheap GPUs, almost plug and play models that require very little fine-tuning for standard usecases, and more so than ever, an abundance of data. This includes the tremendous impact deep learning based language models have had in Natural Language Processing.
Unfortunately, 2/3 items of that list disappear very quickly when we look at languages other than English. State-of-the-art classifiers for even simple tasks like sentiment analysis have drastically poor performance even in mainstream languages like Hindi, and are almost nonexistent in less popular / vernacular languages.
Clean, well annotated datasets for languages other than English, German and Spanish are few and far between - For example, Sentiment140 is an English Language Twitter Dataset, with 1.6 million samples that have been annotated. However, Bhaav, a similar dataset in Hindi, has 3 thousand samples, and is possibly the most comprehensive sentiment analysis hindi dataset that I’ve ever worked with.
Why is this? There is a stunningly large amount of Hindi language text that’s easily available in the public domain. As a result, modern transfer learning approaches to NLP have no dearth of language models - from Common Crawl Hindi Language word embeddings, to multi-layer language models like Hindi2Vec and Hindi Language vectors for Elmo. There’s a tremendous amount of focus on building and disseminating these models, and others for languages like Tamil and Telugu through Fast.ai, IndicNLP and other organizations.
What’s worse is that tools don’t exist to make the creation of these datasets simple. Though premier institutions have spent decades working on NLP applications in Indian languages, their primary areas of focus seem to be in machine translation and wordnet-esque lexical corpora of Indian languages. Prodigy.AI is probably the best text based annotation tool out there, but it’s incredibly expensive even for individual contributors. Moreover, most open source communities in India, have a large number of students that are motivated, and looking for ways to contribute towards open source projects.
In this talk, I’d like to introduce a tool that I’ve released that seeks to gamify the creation of such datasets - hopefully enticing the large number of motivated students of machine learning across India to contribute toward creation of datasets, and reward them for their time. The tool allows for anyone to host a webapp that allows contributors to:
- annotate datasets
- set tests to evaluate domain expertise
- export these datasets in a manner that allows for consensus amongst multiple annotators.
This makes the annotations somewhat robust. Dataset annotators are further rewarded for their time via github contributions. The tool itself uses word embedding to assist in the annotation process and deals with the typical problems that text annotation poses fairly well - from simple typos inflating the number of annotations, to dealing with inter annotator agreement and allows for relatively bias free annotations by allowing configurable thresholds. Though so far it’s been used primarily for Hindi, it can be used for any language, and thus benefits multiple small communities around the world at large.
- An overview of AnnotateThis!
- A shallow dive into word embeddings and how they’re used for a better annotation experience
I’m a developer at Gramener - where I focus on building tooling for people interested in Data Science, Machine Learning and Data Visualization. I’m also a Research Assistant at IIIT Delhi’s MIDAS Research Group, where I focus on various applications of NLP, with emphasis on Indian Languages.