The Fifth Elephant 2019

The eighth edition of India's best data conference

Participate

Metadata Catalogue - Making sense of all your data, whether stream or store, the self serve way

Submitted by Shivji Kumar Jha (@shiv4289) on Saturday, 30 March 2019

Abstract

What

This talk presents the case for a central metadata catalogue repository for metadata discovery, cataloguing, and control service. This is another step towards enabling self service from your streams. We did this by forking Apache Atla, establishing a central metadata repository to capture metadata across datasets and surface it through a single platform to simplify data discovery and trace its lineage irrespective of formats, locations and tools.

Why should you care though?

Because the communoty is startng to care. There are multiple companies building theri won solutions (namely twitter, linkedin, netflix etc) and there is apache atlas which made its first GA version 1.0, roughly 6 months ago. We adopted it when this project was in incubation and we are happy we did!

What does this cover

Here is a brief overview of what the platform allows its users to do:
1. Discover data and data related artifacts : Data Sources, Events, Databases, Tables, Attributes, ETL Processes, Workflows etc
2. Trace the origin and owner of data
3. Understand data definitions, semantics, and constraints as intended by the producers
4. Trace data flow, evolutions, transformations and dependencies
5. Enable automatic programmatic checks for metadata consistency, and dependencies through an API

Outline

  1. Why you need a central metadata catalogue too.
  2. Why your schema registry is not enough.
  3. A quick brief of the open source solutions: by Twitter, linkedin, netflix and the apache offering too.
  4. Why we based our solution on the apache atlas. And why we maintain a fork of it internally.
  5. How it helped us make sense of each and every piece of data / message in flight or at rest.

Speaker bio

Shiv is a passionate engineer who loves building scalable, fault-tolerant & highly available platforms. Shiv has contributed to multiple open source projects including apache pulsar, mysql, apache atlas etc. Shiv has worked on a variety of products ranging from backend platforms to infra to web applications and loves collaborating with people sharing and gathering knowledge through the open source community. Shiv has previously been a speaker at multiple open source conferences including FOSS ASIA, OPEN SOURCE INDIA etc.

Links

Comments

  • Anwesha Sarkar (@anweshaalt) Reviewer 3 months ago

    Thank you for submitting the proposal. Submit your slides and preview video by 20th April (latest) it helps us to close the review process.

  • Zainab Bawa (@zainabbawa) Reviewer 2 months ago

    Some comments on your proposal:

    1. What is the readiness required – in the systems – for adopting Apache Atlas?
    2. Are there use cases where Apache Atlas doesn’t work?
    3. Apart from examples of large companies such as LinkedIn, Twitter, and Netflix, which are examples that participants from mid-sized companies can relate to which use Apache Atlas?
    4. In your specific case, you have to explain the before-and-after situation of Apache Atlas. How has life changed after adoption of Apache Atlas? Did you have to make tradeoffs with the adoption? How did your team make adjustments?

    We’ll need to see draft slides – by 27 May – which help us understand your thinking to assess the fit of your proposal for The Fifth Elephant. Since this proposal has been submitted a while ago with no further updates, you also have to let us know if your plans have changed and if you want us to move your proposal to future editions of Rootconf and/or The Fifth Elephant.

Login with Twitter or Google to leave a comment