The Fifth Elephant 2017

On data engineering and application of ML in diverse domains


How to prepare your language for Machine Learning and NLP with an open audio documentation toolkit

Submitted by Subhashish Panigrahi (@psubhashish) on Sunday, 28 May 2017

Preview video

Section: Full talk for Data in Government track Technical level: Intermediate


Pronunciation libraries are a key to building machine learning tools and many Natural Language Processing research and product development. In the age of personal assistant apps, human voice-based apps can help people with visual disability and everyone else access information, and contribute back to the knowledge commons. There is a need for a range of native-language-based solutions—from talking dictionaries to educational games, to language learning applications and accessibility tools like text-to-speech and speech-to-text.

My talk will be focused on a project called Kathabhidhana, an open source audio documention toolkit that I started initially just as a tool to create a pronunciations for Wiktionary, Wikipedia’s sister project and a multilingual dictionary, and later grew into a full-fledged toolkit that can be used to document any language. It can help create recordings of a large word list, clean up the audio, and create both a pronunciation library and a curated dataset. Where a pronunciation library is key to building tools from text-to-speech engines to complex deep learning research, the dataset is equally valuable for the open data perspectives. Most importantly, it will focus on the need for building resources to help millions of people in this country with need for accessibility.


Currently, many Indian languages do not have many good quality pronunciation recordings. India is home to over 18 million people with visual impairment of which 7.8 million are fully blind. Similarly, 30% of India’s population is illiterate. India is home to the highest number of visually impaired and illiterate people in the entire world, and that’s not a good news. The recent Google-KPMG report states that over 70% of the internet users trust content in their native language as compared to the English content. However the native-language support is widely lacking across platforms—from government programs to various apps to several other public utitilities. Even politiical parties have not yet localized their public addresses in native languages. There is a great need for Free/Libre and open source tools that can not just help those with visual impairment or illiteracy access knowledge, but to more than 70% of India’s poppulation that is primarily monoligual. With the swiftly growing Internet and about 500 million Indians connected to the web alreadt, it is important that they get to access the wealth of information in their own language. Pronunciation libraries are key to develop content-centric tools that are useful for everyone. Thanks to AR/VR—the scope of education, entertainment, and other content-based applications are expanding really fast. This talk will detail about how the Indian languages can be empowered with the digital tools in an open way, and leverage the available technical innovations.


Laptops with (Linux/Mac preferred)

Speaker bio

Subhashish a logn time Community, Communications and Outreach Catalyst in Openness movement. His work—at global nonprofits like Mozilla, Centre for Internet and Society, and Wikimedia Foundation—specializes in shaping the communications, partnerships, educational outreach and building innovating products with openness in its core to help with the need in the global south. Subhashish has helped Mozilla shape its global communications strategy for the Campus Clubs program, structured Mozilla’s Diversity and Inclusion strategy across Asia, conducted a research across 20 top tech, law and business universities across India to assess the state of open source. He spearheaded more than 200 Wikimedia outreach programs that reached out to more than 6000 potential contributors to 23 Indian language-Wikipedias. This work helped increase Wikipedia’s reach (readers) by more than 60% (3x global readership), and participation (editors) by 50% [as compared to ~10% global growth]. He has spoken in many international conferences across 11 countries, and a few hundred more in India. He has been advising in personal capacity to some of the most notable nonprofits and other open collectives like Global Voices,, Open Knowledge International, and the OER Conference.



Preview video


Login with Twitter or Google to leave a comment