The Fifth Elephant 2016

India's most renowned data science conference


Increasing Trust and Efficiency of Data Science using dataset versioning

Submitted by Venkata Pingali (@pingali) on Sunday, 27 March 2016

Section: Crisp talk Technical level: Intermediate

Vote on this proposal

Login to vote

Total votes:  +12


As data science grows and matures as a domain, harder questions
are being asked by decision makers about trust and efficiency
of data science process. Some of them include:

  • Lineage/Auditability: Where did the numbers come from?
  • Reproducibility/Replicability: Is this an accident? Does it hold now?
  • Efficiency/Automation: Can you do it faster, cheaper, better?

Significant amount of data scientists’ time goes towards generating,
shaping, and using datasets. It is laborious and error prone.

In this talk, we introduce an open source tool, dgit - git
wrapper to manage dataset versions, and discuss why dgit was
developed, and how we can redo the data science process using


  1. Current process is iterative, expensive, and error prone
    • Does not account for imperfectness in knowledge about the problem, process, organization
    • 80% of companies report strategic decisions going wrong due to flawed data
  2. Basic requirements of improved process - trust and efficiency
    • Trust requires auditability and reproducibility of results
    • Efficiency requires standardization and automation
  3. Dataset is a fundamental abstraction of data science
    • Every data science task creates, transforms, validates, and applies datasets
    • Nesting and branching semantics
  4. New process around versioned datasets
    • Import ideas from software engineering - versioning, CI, testing
    • Git & Github-like experience for datasets
  5. dgit - enables git-like management of datasets
    • Python package, open source, MIT licence
    • Uses git for versioning
    • Focuses on capabilities that are specific to dataset management
      • Metadata management
      • Inter-dataset dependency tracking
      • Scanning for dataset updates
      • Validation and generation
      • Support for metadata backends
  6. dgit implementation and demo
    • Architecture and flexibility
    • Demos
      • Simplicity (automation)
      • Timeline (lineage)
      • Validation of data and model results (trust, automation)


This is not a hands on session. But if somebody wishes to install/play with dgit, they need python 3, virtualenv+pip installed.

Speaker bio

Dr. Venkata Pingali is Founder of Scribble Data, a data science automation company. He was former VP, Analytics at FourthLion technologies and led analytics work for large political campaigns and business customers of FourthLion. Previous to that he was Founder and CEO of an energy analytics company, eLuminos. He has a BTech from IIT Mumbai and PhD from University of Southern California, Los Angeles in systems




  • 1
    Pavan Yara (@yarapavan) 3 years ago

    Very interesting. Looking forward to hearing more in the final Fifth element program.

  • 1
    Harsha Hegde (@harshahegde) 3 years ago (edited 3 years ago)

    Very pertinent and commonly asked questions. Answered scientifically. Looking forward to this talk.

    • 1
      Venkata Pingali (@pingali) Proposer 3 years ago (edited 3 years ago)

      Thanks, Pavan and Harsha for considering the proposal. I am looking forward to an energetic conversation as well. if you are a hands-on type, do give dgit a spin. It is officially in alpha!

  • 1
    Harshad Saykhedkar (@harshss) 3 years ago

    Very apt topic. Looking forward to hearing more.

    • 1
      Venkata Pingali (@pingali) Proposer 3 years ago

      The user experience of dgit, among others, was refined based on input from a sokrati alum (your former colleague)! Further feedback is welcome.

Login with Twitter or Google to leave a comment