Hangar; git for your data

Nov 2019

18 Mon

19 Tue

20 Wed

21 Thu

22 Fri

23 Sat 08:30 AM – 05:30 PM IST

24 Sun

Make a submission

Accepting submissions till 01 Nov 2019, 04:20 PM

Taj M G Road, Bangalore, Bangalore

Tickets

##About the 2019 edition:

The schedule for the 2019 edition is published here: https://hasgeek.com/anthillinside/2019/schedule

The conference has three tracks:

Talks in the main conference hall track
Poster sessions featuring novel ideas and projects in the poster session track
Birds of Feather (BOF) sessions for practitioners who want to use the Anthill Inside forum to discuss:

Myths and realities of labelling datasets for Deep Learning.
Practical experience with using Knowledge Graphs for different use cases.
Interpretability and its application in different contexts; challenges with GDPR and intepreting datasets.
Pros and cons of using custom and open source tooling for AI/DL/ML.

#Who should attend Anthill Inside:

Anthill Inside is a platform for:

Data scientists
AI, DL and ML engineers
Cloud providers
Companies which make tooling for AI, ML and Deep Learning
Companies working with NLP and Computer Vision who want to share their work and learnings with the community

For inquiries about tickets and sponsorships, call Anthill Inside on 7676332020 or write to sales@hasgeek.com

#Sponsors:

Sponsorship slots for Anthill Inside 2019 are open. Click here to view the sponsorship deck.

Anthill Inside 2019 sponsors:

#Bronze Sponsor

#Community Sponsor

Hosted by

Anthill Inside

Anthill Inside is a forum for conversations about risk mitigation and governance in Artificial Intelligence and Deep Learning. AI developers, researchers, startup founders, ethicists, and AI enthusiasts are encouraged to: more

All submissions

Previous Next

Hangar; git for your data

Submitted Sep 5, 2019

Section: Full talk Technical level: Intermediate Session type: Demo

Software development is entering an era where the behavior of programs critically depends on the data they were trained on. In this setting, data is the new source code, and this opens the door to challenges like versioning and collaboration on numerical data. Enter Hangar, an open-source tool by [tensor]werk that brings Git-style version control to n-dimensional arrays. It supports versioning, branching, merging, time-travel, diffing, remote repositories and partial fetching, with data loaders for the major deep learning frameworks. At its core hangar is designed to solve many of the same problems faced by traditional code version control system (ie. Git), just adapted for numerical data: - Time travel through the historical evolution of a dataset - Zero-cost Branching to enable exploratory analysis and collaboration - Cheap Merging to build datasets over time (with multiple collaborators) - Completely abstracted organization and management of data files on disk - Ability to only retrieve a small portion of the data (as needed) while still maintaining a complete historical record - Ability to push and pull changes directly to collaborators or a central server (ie a truly distributed version control system)

The ability of version control systems to perform these tasks for codebases is largely taken for granted by almost every developer today; However, we are standing on the shoulders of giants, with decades of engineering which has resulted in these phenomenally useful tools. Now that a new era of “Data-Defined software” is taking hold, we find there is a strong need for analogous version control systems that are designed to handle numerical data at large scale... Welcome to Hangar!, a version control system for your data completely written in Python

The problem of versioning the dataset is not new and there are few attempts to solve it reliably, like Git-lfs, DVC, pachyderm. While all of these solutions work for different use cases, none of these solved every problem software 2.0 would have to deal with data versioning. For instance, if you have huge data stored in each commit, the checkout between commits would take a lot of time (hours actually). Hangar solves such problems with a completely different approach which made hangar to perform a few magnitudes better than existing tools. That being said, all of these perks come with few downsides in certain cases. Since hangar stores data as numpy arrays (tensors), it’s not possible to store data (or model) if it is not convertible as numpy arrays unless user hacks the way around. We’ll discuss all the advantages and disadvantages of using hangar with few examples in the talk.

Hangar being a completely-written-in-python toolkit, exploring the source of the hangar is easy. Getting the hang of hangar needs the user to understand hangar data philosophy. We’ll go through the internals, core concepts and few basic examples of working with data in general, deep learning pipeline etc using hangar

Outline

What is data
Hangar fundamentals
- Branching, merging, conflict management
- Hangar data philosophy
- Versioning and time travel
Other players in the data versioning world
- What hangar is good for
- What hangar is not good for
Hangar remote
Hangar CLI
- Performance
- Import and export
- Other operations
Python APIs
Deep learning pipeline with hangar

Speaker bio

I am working as a part of the development team of [Tensor]werk, an infrastructure development company focusing on deep learning deployment problems. I and my team focus on building open source tools for setting up a seamless deep learning workflow this includes RedisAI & hangar. I have been programming since 2012 and started using python since 2014 and moved to deep learning in 2015. I am an open source enthusiast and have contributed to the core of several widely used projects like PyTorch. I spend most of my research time on improving the interpretability of AI models using TuringNetwork. I have authored a deep learning book. I go by hhsecond on internet