The Fifth Elephant 2024 Annual Conference (12th &13th July)

Maximising the Potential of Data — Discussions around data science, machine learning & AI

Abhijeet Kumar

@abhijeet3922

RAG Vs Fine-Tuning: Implementation Anecdotes from Data Catalog Enrichment Solution

Submitted Jun 12, 2024

Abstract

This talk will take the audience through our experience from building a content generation solution for data catalog enrichment effort from modeling perspective (RAG based pre-trained model & RAG based FineTuned model).

For this use-case, I will talk about the approach taken to

  • Understand the data inputs in prompt
  • Enrich the prompt.
  • Construct a few-shot setup using RAG.
  • Finetuning Llama Vs Pretrained Llama, GPT3.5-turbo
  • Multiple evaluation metrics from monitoring and governance perspective.

I will talk about finetuning details suing LORA technique and will also compare the results from 3 models namely few-shot pretrained Llama2-13B, few-shot finetuned Llama2-7B and GPT-3.5 turbo.

The talk will draw various insights about behaviour of these models in terms of content generation. This also includes accuracy (with ground truth), alignment (factual consistency with prompt inputs) and toxicity detection.

Use-case

Enterprise Data Catalog is a large effort in any enterprise to keep curated meta-data about the data for the user reference. This includes majorly writing descriptions about the tables and its columns for business consumption. This has always been a manual effort.

Here, we are talking about hundreds of database schemas, thousands of database tables and millions of columns in data-catalog. Often curated content is merely 3-5%. The objective is to enrich the data catalog using AI solution.

Intended Audience

This talk is intended for data enginners, data scientists or researchers in GenAI space and wants to understand model behaviour in different construct (RAG, Finetuning etc).

This talk is intended for data leaders, data stewards, data SMEs who are closer to enterprise data. The topic might interests as an initiative to enrich meta-data of data catalogs for enterprises.

In general, The talk will align to any professional working on Gen AI usecase with python.

Outline

  • Defineing the scope of Use-Case
  • LLM Solution - Design and Implementation
  • Prompt Enggineering
    • Prompt Enrichment
    • Few-shot using RAG
  • Implemented Models: Finetuned Llama2-7B, Llama2-13B, GPT3.5 Turbo
  • Evaluation Metrics and Interpretation
    • Bert-Score F1 (Accuracy)
    • Factual Consistency Score (Alignement)
    • Toxicity Detection
  • Learning & Challenges - Lessons

Impact

  • The solution improve meta-data coverage and intend to enrich the catalog by 25% from 3-5% (curated and prepared zones).
  • It saves time for data stewards to quickly curate content by 50% (notional).
  • It improves search over data catalog.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures

Supported by

Gold Sponsor

Atlassian unleashes the potential of every team. Our agile & DevOps, IT service management and work management software helps teams organize, discuss, and compl

Silver Sponsor

Together, we can build for everyone.

Workshop sponsor

Datastax, the real-time AI Company.

Lanyard Sponsor

We reimagine the way the world moves for the better.

Sponsor

MonsterAPI is an easy and cost-effective GenAI computing platform designed for developers to quickly fine-tune, evaluate and deploy LLMs for businesses.

Community Partner

FOSS United is a non-profit foundation that aims at promoting and strengthening the Free and Open Source Software (FOSS) ecosystem in India. more

Beverage Partner

BONOMI is a ready to drink beverage brand based out of Bangalore. Our first segment into the beverage category is ready to drink cold brew coffee.