Central Metadata Catalogue - understanding data in your pipelines and data stores the self serve way
The data ecosystem has come along way in last decade. The ride from structured to unstructured data, in a bid to support the 3Vs (volume, variety and velocity) of big data, has been quick. While its great to have bits flowing at great volumes, wouldn’t it be great to capture the semantics somehow? Wouldn’t it be great to pick up a message in your stream and know the authoratitative source of this message, what hops has it passed through, what cleansing it has gone through etc?
Enter metadata catalogue - a metadata discovery, cataloguing, and control service. Its something that a lot of organizations have been working at simultaneously and there are at least 4 open source versions that we studied - all released in last one year - by hortonworks, linkedin, twitter and Netflix. And that only emphasizes why its needed. The fact that there are so many solutions to it hints at a difficult problem and we will throw light on how we solved the same with forked version of Apache atlas.
In this talk, we we will discuss
1. Why we needed the metadata catalogue in our ecosystem,
2. Go over available open source soultions - 4 of those.
3. Why we did not use any of them as is.
4. How we changed the apache atlas to suit our needs.
5. How this stabilized our data pipeline.
Shiv is a passionate engineer who loves building scalable, fault-tolerant & highly available platforms. Shiv has contributed to multiple open source projects including apache pulsar, mysql, apache atlas etc. Shiv has worked on a variety of products ranging from backend platforms to infra to web applications and loves collaborating with people sharing and gathering knowledge through the open source community. Shiv has previously been a speaker at multiple open source conferences including FOSS ASIA, OPEN SOURCE INDIA etc.