Metadata Catalogue - Making sense of all your data, whether stream or store, the self serve way

Jul 2019

22 Mon

23 Tue

24 Wed

25 Thu 09:15 AM – 05:45 PM IST

26 Fri 09:20 AM – 05:30 PM IST

27 Sat

28 Sun

NIMHANS Convention Centre, Bengaluru

Metadata Catalogue - Making sense of all your data, whether stream or store, the self serve way

Submitted Mar 30, 2019

What

This talk presents the case for a central metadata catalogue repository for metadata discovery, cataloguing, and control service. This is another step towards enabling self service from your streams. We did this by forking Apache Atla, establishing a central metadata repository to capture metadata across datasets and surface it through a single platform to simplify data discovery and trace its lineage irrespective of formats, locations and tools.

Why should you care though?

Because the communoty is startng to care. There are multiple companies building theri won solutions (namely twitter, linkedin, netflix etc) and there is apache atlas which made its first GA version 1.0, roughly 6 months ago. We adopted it when this project was in incubation and we are happy we did!

What does this cover

Here is a brief overview of what the platform allows its users to do:

Discover data and data related artifacts : Data Sources, Events, Databases, Tables, Attributes, ETL Processes, Workflows etc
Trace the origin and owner of data
Understand data definitions, semantics, and constraints as intended by the producers
Trace data flow, evolutions, transformations and dependencies
Enable automatic programmatic checks for metadata consistency, and dependencies through an API

Outline

Why you need a central metadata catalogue too.
Why your schema registry is not enough.
A quick brief of the open source solutions: by Twitter, linkedin, netflix and the apache offering too.
Why we based our solution on the apache atlas. And why we maintain a fork of it internally.
How it helped us make sense of each and every piece of data / message in flight or at rest.

Speaker bio

Shiv is a passionate engineer who loves building scalable, fault-tolerant & highly available platforms. Shiv has contributed to multiple open source projects including apache pulsar, mysql, apache atlas etc. Shiv has worked on a variety of products ranging from backend platforms to infra to web applications and loves collaborating with people sharing and gathering knowledge through the open source community. Shiv has previously been a speaker at multiple open source conferences including FOSS ASIA, OPEN SOURCE INDIA etc.

The Fifth Elephant 2019