Making Indian Budgets Machinable using Python
Indian Budget documents across various tiers of government, consist of detailed information on allocations made and resources raised in a financial year. Unfortunately these documents are published in messy PDF formats which makes it difficult for researchers, economists and general public to analyse and use this crucial data. This session will delve into how we can create a data pipeline and leverage computer vision techniques to parse these documents into clean machine-readable formats, using some popular python libraries(like PyPDF2, OpenCV, numpy, etc) along with other open-source tools like Tabula, CKAN.
What’s in for you?
Building data pipelines for civic-engagement is still in its embryonic stage in India, this talk will give an opportunity to data enthusiasts to learn, produce and contribute to open data in their geographies. People will also explore how we can employ simple python scripts and open-source tools to deal with complex multifarious data formats.
The talk will be organized as:
- Setting the scene
- Issues with Indian Budget Documents
- Overview of the data pipeline
- Custom scraping techniques using Xpath(via lxml)
- Table detection using OpenCV and other python libraries
- Integration with Tabula(Java)
- Basic data wrangling using regex and Pandas
- Data publishing via CKAN
- Demo: OpenBudgetsIndia in action
Knowledge of Python 2.7, acquaintance with basic data mining
- Open Source contributor
- Python programmer
- Independent Civic-technologist
- Building OpenBudgetsIndia(beta launch scheduled for late December 2016) at Centre for Budget and Governance Accountability
- Chapter Leader at DataKind Bangalore