Tickets

Loading…

Manob Chakraborty

Nested Evolution and Schema Transformation (NEST) Framework for Managing Schema Evolution in Spark

Submitted May 31, 2024

Overview

The NEST Framework automates the handling of dynamic and nested schemas, making it easier for developers to manage schema changes and maintain accurate, deduplicated tables in Spark. We are excited to present this innovative solution at the Data Engineering Conference.

Who is the audience for our session:

This session is designed for professionals working with streaming data and managing evolving schema versions at the database level. If you work with deeply nested event tables and need to manage schema evolution and deduplication across different versions which Delta/Iceberg could not manage.

What problem/pain are we trying to solve?

When consuming streaming data, the schema changes with time as new fields are added, old fields are removed or altered. When these fields are nested like struct of array of struct of (string, int, map), the union of data from different schemas becomes cumbersome - in addition to masking or hashing PII information in the nested fields. The data engineer needs to come up with queries to transform the nested fields like arrays and structs.

  • Even though Delta/Iceberg tables support schema evolution, they break if the data type of a column changes or if there are map type fields.
  • Also maintenance of complex SQL is an overhead in addition to regular tasks and may result in failures if not handled immediately.

Broad areas we plan to cover during the session

  • Schema evolution management using Spark.
  • Incremental deduplication of table versions using Delta.
  • Masking and small transformations during the deduplication process.

How will participants benefit from our session?

Participants will gain practical knowledge through slides and a hands-on session on using the framework. They will learn how dynamically manage schema evolution and deduplication processes, see framework in action.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hybrid Access Ticket

Hosted by

All about data science and machine learning

Supported by

Gold Sponsor

Atlassian unleashes the potential of every team. Our agile & DevOps, IT service management and work management software helps teams organize, discuss, and compl

Silver Sponsor