The Fifth Elephant 2025 Annual Conference CfP

The Fifth Elephant 2025 Annual Conference CfP

Speak at The Fifth Elephant 2025 Annual Conference

Abhijeet Kumar

Abhijeet Kumar

@abhijeet3922

Building Large-Scale Visual Augmented Q&A with Vision Language Models

Submitted May 6, 2025

Existing Retrieval Augmented Generation (RAG) based Q&A systems could only process textual information and are unable to answer from infographics (visual elements of information) such as tables, charts, images etc. in documents limiting the value and productivity.

Vision Language models encodes visual elements along with textual information which can be used for complex documents retrieval. However there are few challenges in scaling such as:

  1. Multi-vector representation (not supported by popular vector DBs)
  2. Requires in-memory computation for late interaction retrieval.

The talk presents an efficient state-of-the-art visual augmented search & question-answering system at scale by integrating vision embeddings with popular vector databases (OpenSearch, ElasticSearch, FAISS). The RAG based solution retrieves best matches, does late interaction re-ranking and utilizes multi-modal LM for generating exact answers. Our benchmarking results shows high performance accuracy in scalable setting.

Outline:

  1. Problem Statement: Pain point & use-cases of Visual augmented Q&A (2 min)
  2. Basic architecture of Multi-modal RAG & Advantage of Visual Language Model (5 min)
    • Visual LM embedding & Late Interaction Retrieval
    • Multimodal Language model
  3. Key-challenges of Vision based RAG in Scaling (3 min)
  4. Building Scalable Q&A using Vision Embedding Retriever & Late Interaction Re-ranker (12 min)
    • Vector DB based Retriever
    • Late Interaction Re-ranker
    • Final Architecture
  5. Results & Conclusion (3 min)
    • Benchmarking results (ViDoRe)
    • Current state of Vector DBs
  6. Q&A (5 min)

Takeaways:

  1. Awareness about Vision-RAG (a modern RAG which may replace text-RAG)
  2. Challenges of Storing Vision Embedding in Vector DB.
  3. Learn a scalable implementation of vision-RAG (Improving efficiency)

Audience

  • Data scientists & Researchers
  • Product Managers (AI)
  • AI & Devops engineers
  • Architects

Biography
I am Director, Data Science at Fidelity Investments with 12+ years of relevant experience in solving problems leveraging advanced analytics, machine learning and deep learning techniques. I started my career as a computer scientist in a government research organization (Bhabha Atomic Research Center) and did research on variety of domains such as conversational speech, satellite imagery and texts.

As part of my work, I have published and presented several research papers in multiple research conferences over years. I had an opportunity to be speaker in past 5th Elephant & PyCon conferences in past years. I had trained professionals in machine learning (M.Tech course) as Guest Faculty at BITS, Pilani, WILP program.

Slides
Coming soon.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures