The Fifth Elephant 2019

Gathering of 1000+ practitioners from the data ecosystem

Metadata Catalogue - Making sense of all your data, whether stream or store, the self serve way

Submitted by Shivji Kumar Jha (@shiv4289) on Mar 30, 2019

Status: Rejected

Abstract

What

This talk presents the case for a central metadata catalogue repository for metadata discovery, cataloguing, and control service. This is another step towards enabling self service from your streams. We did this by forking Apache Atla, establishing a central metadata repository to capture metadata across datasets and surface it through a single platform to simplify data discovery and trace its lineage irrespective of formats, locations and tools.

Why should you care though?

Because the communoty is startng to care. There are multiple companies building theri won solutions (namely twitter, linkedin, netflix etc) and there is apache atlas which made its first GA version 1.0, roughly 6 months ago. We adopted it when this project was in incubation and we are happy we did!

What does this cover

Here is a brief overview of what the platform allows its users to do:
1. Discover data and data related artifacts : Data Sources, Events, Databases, Tables, Attributes, ETL Processes, Workflows etc
2. Trace the origin and owner of data
3. Understand data definitions, semantics, and constraints as intended by the producers
4. Trace data flow, evolutions, transformations and dependencies
5. Enable automatic programmatic checks for metadata consistency, and dependencies through an API

Outline

  1. Why you need a central metadata catalogue too.
  2. Why your schema registry is not enough.
  3. A quick brief of the open source solutions: by Twitter, linkedin, netflix and the apache offering too.
  4. Why we based our solution on the apache atlas. And why we maintain a fork of it internally.
  5. How it helped us make sense of each and every piece of data / message in flight or at rest.

Speaker bio

Shiv is a passionate engineer who loves building scalable, fault-tolerant & highly available platforms. Shiv has contributed to multiple open source projects including apache pulsar, mysql, apache atlas etc. Shiv has worked on a variety of products ranging from backend platforms to infra to web applications and loves collaborating with people sharing and gathering knowledge through the open source community. Shiv has previously been a speaker at multiple open source conferences including FOSS ASIA, OPEN SOURCE INDIA etc.

Links

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('You need to be a participant to comment.') }}

{{ formTitle }}
{{ gettext('Post a comment...') }}
{{ gettext('New comment') }}

{{ errorMsg }}