The Fifth Elephant 2025 Annual Conference CfP

The Fifth Elephant 2025 Annual Conference CfP

Speak at The Fifth Elephant 2025 Annual Conference

Abhijeet Kumar

Abhijeet Kumar

@abhijeet3922

Vision-RAG: Developing Visual Augmented Q&A using Vision & Multimodal LLM

Submitted May 6, 2025

Problem Statement Most of the documents include infographics (visual elements of information) such as tables, charts, images etc. often used to convey complex information to readers. Multi-modal LLMs are powerful tools that can be used for question answering on such complex documents. However, there are two challenges which limits the productivity and value add:

  1. Multi-modal LLMs performance degrades with longer context (lost in the middle issue)
  2. Existing Text-RAG based Q&A systems could only process textual information (embeddings)

Solution: Vision-RAG systems are modern state-of-the-art architectures which encodes text and infographics jointly to answer user’s queries. Vision Language Models like ColPali can encodes visual elements along with text information.

Why matter ? Many analyst teams in business or captive companies manually research complex documents with turnaround time of days to weeks.

Outline
It will be a hand-on session for participants and will cover following modules:

  • Module 1: What is Visual Augmented Q&A (talk)

    • Introduction to Multi-modal LLM
    • Introduction to Visual Language Model.
  • Module 2: Foundation: Prompting for Q&A using Multi-modal LLM.

  • Module 3: Setting up Vision based RAG:

    • Vision Embedding using ColPali (talk)
    • Setting up Late interaction Retrieval using ColPali
    • Hands-on Develop end-to-end Visual augmented Q&A
  • Module 4: Practical challenges with Vision based RAG (talk)

  • Module 5: Integration with Vector DB

    • Overall Architecture (talk)
    • Hands on: Storing multi-vector representation in Vector DB
    • Hands-on: Embedding based retrieval & ColPali based re-ranker
    • End to end python process - Demo

Takeaways
By end of the Multi-modal RAG workshop, participants will be able to:

  • Understand working with Vision based Vector DB
  • Develop end to end process for Visual Augmented Q&A
  • Practical challenges & strategies to address them

Audience

  • Aspiring Data Scientists
  • AI/Devops Engineer
  • Researchers in Gen AI space

Biography
I am Director, Data Science at Fidelity Investments with 12+ years of relevant experience in solving problems leveraging advanced analytics, machine learning and deep learning techniques. I started my career as a computer scientist in a government research organization (Bhabha Atomic Research Center) and did research on variety of domains such as conversational speech, satellite imagery and texts.

As part of my work, I have published and presented several research papers in multiple research conferences over years. I had an opportunity to be speaker in past 5th Elephant & PyCon conferences in past years. I had trained professionals in machine learning (M.Tech course) as Guest Faculty at BITS, Pilani, WILP program.

Workshop Material
In Progress: https://github.com/abhijeet3922/vision-RAG/

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures