Building Large-Scale Visual Augmented Q&A with Vision Language Models

Submitted May 6, 2025

I am submitting for: Speaking at the Fifth Elephant 2025 Annual Conference Type of submission: 30 mins talk Choose the topic your submission falls under: Applied AI Engineering & Agentic AI track

Existing Retrieval Augmented Generation (RAG) based Q&A systems could only process textual information and are unable to answer from infographics (visual elements of information) such as tables, charts, images etc. in documents limiting the value and productivity.

Vision Language models encodes visual elements along with textual information which can be used for complex documents retrieval. However there are few challenges in scaling such as:

Multi-vector representation (not supported by popular vector DBs)
Requires in-memory computation for late interaction retrieval.

The talk presents an efficient state-of-the-art visual augmented search & question-answering system at scale by integrating vision embeddings with popular vector databases (OpenSearch, ElasticSearch, FAISS). The RAG based solution retrieves best matches, does late interaction re-ranking and utilizes multi-modal LM for generating exact answers. Our benchmarking results shows high performance accuracy in scalable setting.

Outline:

Problem Statement: Pain point & use-cases of Visual augmented Q&A (2 min)
Basic architecture of Multi-modal RAG & Advantage of Visual Language Model (5 min)
- Visual LM embedding & Late Interaction Retrieval
- Multimodal Language model
Key-challenges of Vision based RAG in Scaling (3 min)
Building Scalable Q&A using Vision Embedding Retriever & Late Interaction Re-ranker (12 min)
- Vector DB based Retriever
- Late Interaction Re-ranker
- Final Architecture
Results & Conclusion (3 min)
- Benchmarking results (ViDoRe)
- Current state of Vector DBs
Q&A (5 min)

Takeaways:

Awareness about Vision-RAG (a modern RAG which may replace text-RAG)
Challenges of Storing Vision Embedding in Vector DB.
Learn a scalable implementation of vision-RAG (Improving efficiency)

Audience

Data scientists & Researchers
Product Managers (AI)
AI & Devops engineers
Architects

Biography
I am Director, Data Science with 12+ years of relevant experience in solving problems leveraging advanced analytics, machine learning and deep learning techniques. I started my career as a computer scientist in a government research organization (Bhabha Atomic Research Center) and did research on variety of domains such as conversational speech, satellite imagery and texts.

As part of my work, I have published and presented several research papers in multiple research conferences over years. I had an opportunity to be speaker in past 5th Elephant & PyCon conferences in past years. I had trained professionals in machine learning (M.Tech course) as Guest Faculty at BITS, Pilani, WILP program.

Slides
Draft here: Presentation

The Fifth Elephant 2025 Annual Conference CfP

Building Large-Scale Visual Augmented Q&A with Vision Language Models

Comments