Vrunda Gadesha

Vrunda Gadesha

@vrunda91

A Sovereign Stack for Turning Unstructured Text into Multi Turn Conversations

Submitted Jun 24, 2026

Training and aligning enterprise models for regional languages like Telugu is severely limited by text scarcity and the high cost of manual data creation. Without an automated pipeline, creating high-quality conversational data requires hiring bilingual domain experts to manually read documents, extract topics, and draft realistic dialogue trees—a process that is slow, expensive, and difficult to scale. This session demonstrates a production-ready, completely automated data engineering pipeline that successfully transformed an unstructured, 700-page Telugu agricultural PDF into a deeply curated, multi-turn conversation dataset without relying on traditional human annotation.

Our pipeline leverages a completely sovereign architecture because the entire data engineering infrastructure and model execution stack run locally on enterprise hardware. By using open-weight model matrices within a secure local environment, valuable intellectual property and sensitive domain knowledge are protected from third-party API exposure. We walk through the exact technical execution of our entity-guided chunking strategy, showing how the pipeline extracts core focus topics (like “Rice” or “Paddy”) from text segments to prime the model. This ensures the sdg-hub framework generates highly natural, persona-driven queries, single-turn interactions, and multi-turn dialogue streams. Finally, we share how we enforced automated quality controls using dual-gate verification—checking both Faithfulness (factual grounding) and Answerability (logical resolution) to programmatically filter out conversational defects before the data pipeline finishes.

Key Takeaways

  • Understand the architecture of an automated pipeline that extracts focus entities from raw text chunks to guide language models in generating natural, contextually grounded questions.
  • How to use sdg-hub to orchestrate language model execution loops, safely moving from unstructured PDFs to high-depth multi-turn dialogue streams using dual-gate filters.

Target Audience

Data Engineers, AI Infrastructure Architects, and Technical Product Owners who build production data pipelines, orchestrate automated data workflows, or manage high-density enterprise datasets under strict privacy and resource constraints.

Author’s bio

Vrunda Gadesha
AI Advocate | Technical Content Author

Vrunda Gadesha is a Data Scientist, Ph.D. scholar, and AI enthusiast with expertise in Large Language Models, Natural Language Processing, Machine Learning, and technical content creation. Skilled in Python Programming, she has led AI solution development and shared her knowledge through academic writing and corporate training. She is passionate about advancing AI and data science and is committed to continuous learning and impactful innovation.

TODO
{Add the link to draft slides - PDF/PPT - with comments access}

TODO
{Add the link to 2-min elevator pitch video}

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jumpstart better data engineering and AI futures