Enterprise-ready & Compliant Synthetic Data Generation for Data Governance

This submission has been added to the schedule

Enterprise-ready & Compliant Synthetic Data Generation for Data Governance

Submitted Nov 10, 2025

Type of submission: 15 mins talk

In AI driven workflows, a critical bottleneck is the scarcity of realistic training datasets—and strict privacy and security rules forbid using actual customer records as well. To address these challenges, we have developed an agentic synthetic data generation pipeline that produces domain-rich, realistic, and coherent datasets, for training PII (Personally Identifiable Information) and sensitive-reference detection models, preserving customer’s privacy. This LLM-driven workflow autonomously curates synthetic samples across varied industries—such as finance, healthcare, and legal—while incorporating guardrails to ensure that generated content remains non-toxic, unbiased, and contextually safe.

In this session, we will present NetApp’s end-to-end framework for detecting sensitive data and sensitive references, powered by synthetic data. Our approach demonstrates how synthetic datasets can effectively bridge the data availability gap while maintaining strong alignment with real-world linguistic patterns. Through extensive experimentation, we observed progressive improvements in detection accuracy as synthetic data volume and diversity increased. The session will delve into the architecture of the agentic pipeline, data quality validation strategies, and domain adaptation techniques. Attendees will gain insights into how synthetic data can enable responsible AI development, reinforce data governance, and ensure compliance without exposing or relying on real customer information.

Takeaways:

Real-world data scarcity no longer bottlenecks model training or fine-tuning—high-quality synthetic corpora can fill the gap. Diverse, coherent synthetic datasets are key to achieving robust, generalizable performance across domains.
By leveraging agentic synthetic-data generation, we create datasets that so closely mimic real-world documents they’re indistinguishable from genuine records—and we’ve observed consistent performance improvements with each increment of quality synthetic samples, motivating continued investment in this approach

Target audiences

This session will be particularly beneficial for machine learning engineers, data scientists, and AI researchers working on privacy-sensitive applications or responsible AI initiatives. It will also provide valuable insights for leaders/architects working in sensitive or high security domains where data governance and compliance play important role. Attendees from organizations dealing with regulated data—such as finance, healthcare, and government sectors—will gain an understanding of how synthetic data can be strategically leveraged to enhance model performance while maintaining strict privacy guarantees.

Authors

Presenter:
Darshan Adiga,
Senior Data Scientist at NetApp

Co-author:
Lakshya Daulani,
Data Scientist at NetApp

The Fifth Elephant 2025 Winter Edition