Vrunda Gadesha

Vrunda Gadesha

@vrunda91

Preparing Data for LLM Applications using Data Prep Kit

Submitted Mar 28, 2025

Abstract

Preparing high-quality datasets is a critical yet time-consuming process when building large language models (LLMs). Data Prep Kit, an open-source Python toolkit, simplifies and automates data preparation tasks, enabling faster and more efficient workflows for LLM applications. This session will explore how Data Prep Kit addresses key challenges like text extraction, deduplication, and data quality scoring, along with insights from real-world use cases such as creating an exclusive RAG. Attendees will learn how to leverage the toolkit to streamline their data pipelines, enhance dataset quality, and maximize the efficiency of LLM development.

Overview of Features and Concepts for Data Preparation

  • Text Extraction from Diverse Formats - Extracts text from PDFs, HTML, Word documents, and other formats while preserving document structure like tables and headings.
  • Duplicate and Near-Duplicate Removal - Uses hash-based exact deduplication and fuzzy deduplication with MinHash to eliminate redundant data and improve dataset quality.
  • Document Quality Scoring - Calculates metrics like word count, bad word detection, placeholder text (e.g., Lorem Ipsum), and bullet point ratios to filter out low-quality data.
  • PII and Malicious Content Removal - Identifies and removes personal identifiable information (PII) and malicious code to ensure datasets are safe and compliant.
  • Scalability for Large Datasets - Runs efficiently on laptops for small datasets and scales seamlessly to distributed environments like Ray and Spark for large-scale data preparation.
  • Integration with Open-Source Tools - Leverages libraries like DocLing for text extraction and ClamAV for malware detection, making it extensible and modular.
  • Real-World Use Case - Successfully used to prepare terabytes of data for IBM’s Granite LLM, demonstrating its effectiveness in enterprise-scale applications.

Takeaways

  • A comprehensive understanding of how to use Data Prep Kit to automate and enhance data preparation for LLM pipelines.
  • Insights into best practices for cleaning, deduplicating, and scoring data quality.
  • Practical knowledge of scalable workflows for preparing multimodal datasets.

Which Audiences is Your Session Going to Benefit?

  • AI engineers and practitioners working on training or fine-tuning LLMs.
  • Data scientists and data engineers responsible for preparing datasets for machine learning applications.

Additional Resources

Here are some key resources to explore and understand the capabilities of Data Prep Kit and its applications in LLM development:

  • GitHub Repository: - Access the full source code, documentation, and examples for Data Prep Kit on GitHub. Check this link to access Data Prep Kit GitHub Repository.
  • IBM Granite Open Source LLM: - The transforms developed in this toolkit were instrumental in preparing data for IBM’s Granite LLM models, now available on Hugging Face. Click here to access IBM Granite Models from HuggingFace.
  • LF AI & Data Foundation: - Data Prep Kit is hosted as a project under the LF AI & Data Foundation, reflecting its importance and adoption in the open-source AI community. Click here to check out LF AI & Data Foundation
  • IBM Developer Learning Path: - A step-by-step guide to get started with Data Prep Kit, including details on its architecture and sample use cases like RAG and fine-tuning with real-world data. Checkout the IBM Developer learning path

These resources provide technical depth, practical examples, and community-driven insights to help you fully leverage Data Prep Kit in your projects.

Speaker’s Bio

Vrunda Gadesha - AI Advocate | IBM

She is a Data Scientist, Ph.D. scholar, and AI enthusiast with expertise in Large Language Models, Natural Language Processing, Machine Learning, and technical content creation. Skilled in Python Programming, she has led AI solution development and shared her knowledge through academic writing and corporate training. She is passionate about advancing AI and data science and is committed to continuous learning and impactful innovation.

Note: We are open for the full 30 mins talk as well as a lightning talk as well for this topic.

Comments

Login to leave a comment

No comments posted yet

Hosted by

Jump starting better data engineering and AI futures

Supported by

Meet-up sponsor

Nutanix is a global leader in cloud software, offering organizations a single platform for running apps and data across clouds.

Community sponsor