PyCon Pune 2017

A conference on the Python programming language

Gaurav Godhwani

@gggodhwani

Making Indian Budgets Machinable using Python

Submitted Nov 30, 2016

Indian Budget documents across various tiers of government, consist of detailed information on allocations made and resources raised in a financial year. Unfortunately these documents are published in messy PDF formats which makes it difficult for researchers, economists and general public to analyse and use this crucial data. This session will delve into how we can create a data pipeline and leverage computer vision techniques to parse these documents into clean machine-readable formats, using some popular python libraries(like PyPDF2, OpenCV, numpy, etc) along with other open-source tools like Tabula, CKAN.

What’s in for you?

Building data pipelines for civic-engagement is still in its embryonic stage in India, this talk will give an opportunity to data enthusiasts to learn, produce and contribute to open data in their geographies. People will also explore how we can employ simple python scripts and open-source tools to deal with complex multifarious data formats.

Outline

The talk will be organized as:

  • Setting the scene
  • Issues with Indian Budget Documents
  • Overview of the data pipeline
  • Custom scraping techniques using Xpath(via lxml)
  • Table detection using OpenCV and other python libraries
  • Integration with Tabula(Java)
  • Basic data wrangling using regex and Pandas
  • Data publishing via CKAN
  • Demo: OpenBudgetsIndia in action
  • Future
  • Questions

Requirements

Knowledge of Python 2.7, acquaintance with basic data mining

Speaker bio

Links

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}