The Fifth Elephant 2024 Annual Conference (12th &13th July)

Maximising the Potential of Data — Discussions around data science, machine learning & AI

Manob Chakraborty

Nested Evolution and Schema Transformation (NEST) Framework for Managing Schema Evolution in Spark

Submitted May 31, 2024

Overview

The NEST Framework automates the handling of dynamic and nested schemas, making it easier for developers to manage schema changes and maintain accurate, deduplicated tables in Spark. We are excited to present this innovative solution at the Data Engineering Conference.

Who is the audience for our session:

This session is designed for professionals working with streaming data and managing evolving schema versions at the database level. If you work with deeply nested event tables and need to manage schema evolution and deduplication across different versions which Delta/Iceberg could not manage.

What problem/pain are we trying to solve?

When consuming streaming data, the schema changes with time as new fields are added, old fields are removed or altered. When these fields are nested like struct of array of struct of (string, int, map), the union of data from different schemas becomes cumbersome - in addition to masking or hashing PII information in the nested fields. The data engineer needs to come up with queries to transform the nested fields like arrays and structs.

  • Even though Delta/Iceberg tables support schema evolution, they break if the data type of a column changes or if there are map type fields.
  • Also maintenance of complex SQL is an overhead in addition to regular tasks and may result in failures if not handled immediately.

Broad areas we plan to cover during the session

  • Schema evolution management using Spark.
  • Incremental deduplication of table versions using Delta.
  • Masking and small transformations during the deduplication process.

How will participants benefit from our session?

Participants will gain practical knowledge through slides and a hands-on session on using the framework. They will learn how dynamically manage schema evolution and deduplication processes, see framework in action.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures

Supported by

Gold Sponsor

Atlassian unleashes the potential of every team. Our agile & DevOps, IT service management and work management software helps teams organize, discuss, and compl

Silver Sponsor

Together, we can build for everyone.

Workshop sponsor

Datastax, the real-time AI Company.

Lanyard Sponsor

We reimagine the way the world moves for the better.

Sponsor

MonsterAPI is an easy and cost-effective GenAI computing platform designed for developers to quickly fine-tune, evaluate and deploy LLMs for businesses.

Community Partner

FOSS United is a non-profit foundation that aims at promoting and strengthening the Free and Open Source Software (FOSS) ecosystem in India. more

Beverage Partner

BONOMI is a ready to drink beverage brand based out of Bangalore. Our first segment into the beverage category is ready to drink cold brew coffee.