Speak at The Fifth Elephant 2026 Annual Conference
Share you work with the community
Jul 2026
27 Mon
28 Tue
29 Wed
30 Thu
31 Fri 09:00 AM – 06:00 PM IST
1 Sat
2 Sun
Submitted Jun 24, 2026
Training and aligning enterprise models for regional languages like Telugu is severely limited by text scarcity and the high cost of manual data creation. Without an automated pipeline, creating high-quality conversational data requires hiring bilingual domain experts to manually read documents, extract topics, and draft realistic dialogue trees—a process that is slow, expensive, and difficult to scale. This session demonstrates a production-ready, completely automated data engineering pipeline that successfully transformed an unstructured, 700-page Telugu agricultural PDF into a deeply curated, multi-turn conversation dataset without relying on traditional human annotation.
Our pipeline leverages a completely sovereign architecture because the entire data engineering infrastructure and model execution stack run locally on enterprise hardware. By using open-weight model matrices within a secure local environment, valuable intellectual property and sensitive domain knowledge are protected from third-party API exposure. We walk through the exact technical execution of our entity-guided chunking strategy, showing how the pipeline extracts core focus topics (like “Rice” or “Paddy”) from text segments to prime the model. This ensures the sdg-hub framework generates highly natural, persona-driven queries, single-turn interactions, and multi-turn dialogue streams. Finally, we share how we enforced automated quality controls using dual-gate verification—checking both Faithfulness (factual grounding) and Answerability (logical resolution) to programmatically filter out conversational defects before the data pipeline finishes.
sdg-hub to orchestrate language model execution loops, safely moving from unstructured PDFs to high-depth multi-turn dialogue streams using dual-gate filters.Data Engineers, AI Infrastructure Architects, and Technical Product Owners who build production data pipelines, orchestrate automated data workflows, or manage high-density enterprise datasets under strict privacy and resource constraints.
Vrunda Gadesha
AI Advocate | Technical Content Author
Vrunda Gadesha is a Data Scientist, Ph.D. scholar, and AI enthusiast with expertise in Large Language Models, Natural Language Processing, Machine Learning, and technical content creation. Skilled in Python Programming, she has led AI solution development and shared her knowledge through academic writing and corporate training. She is passionate about advancing AI and data science and is committed to continuous learning and impactful innovation.
TODO
{Add the link to draft slides - PDF/PPT - with comments access}
TODO
{Add the link to 2-min elevator pitch video}
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}