Submit a talk on data

Submit talks on data engineering, data science, machine learning, big data and analytics through the year – 2018

Making sense of messy data to track disease outbreaks in India

Submitted by Akash Tandon (@akashtandon) on Monday, 9 April 2018

videocam
Preview video

Technical level

Intermediate

Status

Submitted

Vote on this proposal

Login to vote

Total votes:  +3

Abstract

In spite of open data portals cropping up across multiple domains, working with the datasets they provide is difficult. In our bid to identify disease outbreaks and aid preventive health-care, we came across one such data source.

The Ministry of Health and Family Welfare (MoHFW) in India has the IDSP scheme in place to identify disease outbreaks at sub-district & village level across India. Under this scheme, it releases weekly outbreak data as a PDF document. PDFs are notorious for being hard to parse and incorporate in data science workflows. We’ll outline how we leverage Python/R based open source solutions including Apache Airflow and in-house tools to structure this data in order to derive useful insights from it.

Outline

  • Introduction and Background
  • Architecture
    • Generalizing the DAG creation work-flow on Airflow
    • Getting PDFs from IDSP website
    • Extracting data out of PDFs
    • Data wrangling using Python and R
    • Geography Identification
    • Insights and alert generation
  • Demo (with code snippets)

Speaker bio

Akash Tandon is a member of the data engineering team at SocialCops where he’s primary maintainer for their geography identification and entity recognition system. He also contributes to multiple components of the data pipeline.
Prior to this, he was a data engineer at RedCarpetUp. In the past, he had participated in the Google Summer of Code program as a student and mentor.

Links

Slides

https://speakerdeck.com/analyticalmonk/making-sense-of-messy-data-to-track-disease-outbreaks-in-india-fifth-elephant-2018

Preview video

https://youtu.be/9TSB5mYhb-I

Comments

  • 1
    Zainab Bawa (@zainabbawa) Reviewer 8 months ago

    What options did you evaluate apart besides Apache Airflow for your use case? Why did you narrow down on the solution you finally chose?

    • 2
      Akash Tandon (@akashtandon) Proposer 8 months ago (edited 8 months ago)

      When we had started our search, the top desired features were:
      - Resilience to failures
      - Ease of monitoring and logging
      - Programmatic control (rather than static configuration)
      - Presence of an intuitive UI

      Based on the above, we had narrowed down our choices to Luigi and Airflow. Open source nature and presence of active developer communities also helped.
      However, several factors led to us finally choosing Airflow. The prominent ones were a more intuitive UI, better monitoring and flexibility (through Airflow operators). Airflow just offers everything we needed out of the box. Being an Apache project also brings increased sense of reliability and maintenance.

  • 1
    Venkata Pingali (@venkatapingali) 7 months ago

    The choice is airflow is good but these days airflow deployment is common. There are atleast two other talks discussing airflow. Although it is central to your talk in the current form, I would recommend deemphasizing it and focusing more on the application-specific aspects. If there is reusable code/opensource project, that would be interesting to the community.

    • 1
      Akash Tandon (@akashtandon) Proposer 7 months ago (edited 7 months ago)

      Hi, thanks for your comment.
      If you were to closely go through the above proposal, outline and/or attached slides, it would be clear that Airflow isn’t central to the session. The idea behind the talk is to share our learnings from building the IDSP data pipeline mentioned above. Airflow is a component of the system but there are other components (PDF parser, geography identification, etc.) which are equally important. In addition, the talk will feature a demo and the associated code/notebook will be open source.

      • 1
        Venkata Pingali (@venkatapingali) 7 months ago

        Point taken. Two suggestions to increase the value for the audience:

        (a) Can you expand on those aspects (parsing, post-processing etc) and any particular challenges you had faced? (b) If you have organized and package the code (e.g., a python package) to increase reusability, you could discuss the interface as well.

        • 1
          Akash Tandon (@akashtandon) Proposer 6 months ago

          I’d taken your points under consideration and will incorporate them in the talk, if selected. Thanks for the suggestions.

Login with Twitter or Google to leave a comment