The Fifth Elephant

The Fifth Elephant 2024 Annual Conference (12th &13th July)

Maximising the Potential of Data — Discussions around data science, machine learning & AI

Jul 2024

8 Mon

9 Tue

10 Wed

11 Thu

12 Fri

13 Sat 09:00 AM – 06:05 PM IST

14 Sun

Bangalore International Centre, Bangalore

All submissions

Previous Next

Nested Evolution and Schema Transformation (NEST) Framework for Managing Schema Evolution in Spark

Submitted May 31, 2024

Session type: 30 mins talk

Overview

The NEST Framework automates the handling of dynamic and nested schemas, making it easier for developers to manage schema changes and maintain accurate, deduplicated tables in Spark. We are excited to present this innovative solution at the Data Engineering Conference.

Who is the audience for our session:

This session is designed for professionals working with streaming data and managing evolving schema versions at the database level. If you work with deeply nested event tables and need to manage schema evolution and deduplication across different versions which Delta/Iceberg could not manage.

What problem/pain are we trying to solve?

When consuming streaming data, the schema changes with time as new fields are added, old fields are removed or altered. When these fields are nested like struct of array of struct of (string, int, map), the union of data from different schemas becomes cumbersome - in addition to masking or hashing PII information in the nested fields. The data engineer needs to come up with queries to transform the nested fields like arrays and structs.

Even though Delta/Iceberg tables support schema evolution, they break if the data type of a column changes or if there are map type fields.
Also maintenance of complex SQL is an overhead in addition to regular tasks and may result in failures if not handled immediately.

Broad areas we plan to cover during the session

Schema evolution management using Spark.
Incremental deduplication of table versions using Delta.
Masking and small transformations during the deduplication process.

How will participants benefit from our session?

Participants will gain practical knowledge through slides and a hands-on session on using the framework. They will learn how dynamically manage schema evolution and deduplication processes, see framework in action.

All submissions

Previous Next

Comments

Jul 2024

8 Mon

9 Tue

10 Wed

11 Thu

12 Fri

13 Sat 09:00 AM – 06:05 PM IST

14 Sun

Hosted by

The Fifth Elephant

Jumpstart better data engineering and AI futures

Supported by

Gold Sponsor

Atlassian

Atlassian unleashes the potential of every team. Our agile & DevOps, IT service management and work management software helps teams organize, discuss, and compl

Silver Sponsor

Google

Together, we can build for everyone.

Workshop sponsor

Datastax

Datastax, the real-time AI Company.

Lanyard Sponsor

Uber

We reimagine the way the world moves for the better.

Sponsor

Monster API

MonsterAPI is an easy and cost-effective GenAI computing platform designed for developers to quickly fine-tune, evaluate and deploy LLMs for businesses.

Community Partner

FOSS United Foundation

FOSS United is a non-profit foundation that aims at promoting and strengthening the Free and Open Source Software (FOSS) ecosystem in India. more

Beverage Partner

BONOMI

BONOMI is a ready to drink beverage brand based out of Bangalore. Our first segment into the beverage category is ready to drink cold brew coffee.