Making Indian Budgets Machinable using Python

Feb 2017

13 Mon

14 Tue

15 Wed

16 Thu 09:00 AM – 06:00 PM IST

17 Fri 09:00 AM – 06:00 PM IST

18 Sat

19 Sun

AMANORA THE FERN HOTELS AND CLUB, PUNE, Pune

Making Indian Budgets Machinable using Python

Submitted Nov 30, 2016

Technical level: Beginner

Indian Budget documents across various tiers of government, consist of detailed information on allocations made and resources raised in a financial year. Unfortunately these documents are published in messy PDF formats which makes it difficult for researchers, economists and general public to analyse and use this crucial data. This session will delve into how we can create a data pipeline and leverage computer vision techniques to parse these documents into clean machine-readable formats, using some popular python libraries(like PyPDF2, OpenCV, numpy, etc) along with other open-source tools like Tabula, CKAN.

###What’s in for you?
Building data pipelines for civic-engagement is still in its embryonic stage in India, this talk will give an opportunity to data enthusiasts to learn, produce and contribute to open data in their geographies. People will also explore how we can employ simple python scripts and open-source tools to deal with complex multifarious data formats.

Outline

The talk will be organized as:

Setting the scene
Issues with Indian Budget Documents
Overview of the data pipeline
Custom scraping techniques using Xpath(via lxml)
Table detection using OpenCV and other python libraries
Integration with Tabula(Java)
Basic data wrangling using regex and Pandas
Data publishing via CKAN
Demo: OpenBudgetsIndia in action
Future
Questions

Requirements

Knowledge of Python 2.7, acquaintance with basic data mining

Speaker bio

Open Source contributor
Python programmer
Independent Civic-technologist
Building OpenBudgetsIndia(beta launch scheduled for late December 2016) at Centre for Budget and Governance Accountability
Chapter Leader at DataKind Bangalore

PyCon Pune 2017