Seeing is retrieving: search chatbots with vision embeddings

Name: Seeing is retrieving: search chatbots with vision embeddings
Start: 2025-07-18T08:45:00+05:30
End: 2025-07-18T13:30:00+05:30
Location: Underline Centre 2nd floor

Hands-on workshop - The Fifth Elephant 2025 Annual Conference

Jul 2025

14 Mon

15 Tue

16 Wed

17 Thu

18 Fri 08:45 AM – 01:30 PM IST

19 Sat

20 Sun

Underline Centre 2nd floor, Bengaluru

Jul 2025

14 Mon

15 Tue

16 Wed

17 Thu

18 Fri 08:45 AM – 01:30 PM IST

19 Sat

20 Sun

Underline Centre 2nd floor, Bengaluru

Pinned update

Feedback for the workshop Feedback is believing - just as seeing is retrieving. :) more

In-person seats unavailable. Participate remotely via the live stream.

Thank you for the overwhelming response. This workshop is housefull! Registrations are closed for in-person attendance.

The workshop will be live streamed. Stream links will be shared with The Fifth Elephant 2025 edition ticket holders and annual members for remote participation on the workshop day.

🔍 Workshop overview

Most business and research documents today are rich with infographics - tables, charts, images - that carry essential context for decision-making. While Multi-modal LLMs offer the ability to query such content, traditional Text-RAG systems fall short as they only process text. And even modern multi-modal LLMs struggle with longer context windows, often missing critical information midway through.

In this hands-on workshop, participants will learn how to build a Vision-RAG system that jointly encodes text and visual information for powerful Visual-Augmented Q&A. Using ColPali (a Vision Language Model) and vector databases, the session will teach participants how to process complex documents and enable image+text Q&A workflows at scale.

Note

This workshop is of 4 hours duration.
It is an advanced hands-on workshop.
Beginners are welcome, but must be prepared to catch up on:
- Python proficiency
- A basic understanding of LLMs and embeddings
- Familiarity with vector databases and some exposure to multi-modal or image processing concepts
  This workshop is best suited for those with intermediate to advanced AI/ML backgrounds.
Limited seats available for participation.
Live stream available for The Fifth Elephant members to participate remotely.
Workshop materials and code in progress: GitHub Repository

🧭 Agenda

Module 1: Introduction to Visual Augmented Q&A
- What are Multi-modal LLMs?
- Overview of Visual Language Models
Module 2: Foundation — Prompting for Q&A using Multi-modal LLMs
- Introduction to basic prompt engineering for multi-modal systems
Module 3: Setting up Vision-based RAG
- Vision Embeddings using ColPali (talk)
- Late interaction retrieval using ColPali
- Hands-on: Build an end-to-end Visual Augmented Q&A workflow
Module 4: Practical Challenges with Vision-based RAG
- Discuss limitations, pitfalls, and mitigation strategies for productionizing Vision-RAG systems
Module 5: Integration with Vector Databases
- Overall Architecture overview (talk)
- Hands-on:
  - Storing multi-vector representations in a Vector DB
  - Embedding-based retrieval and ColPali-based re-ranking
  - End-to-end Python implementation demo
Conclusion + Q&A

💻 Prerequisites

Python 3.8+ environment
Familiarity with LLMs, embeddings, and retrieval systems
Basic understanding of vector databases (e.g., FAISS, Milvus, or others)
Interest in multi-modal AI applications and document Q&A pipelines

👥 Who should attend

Aspiring Data Scientists and AI Engineers
DevOps/ML Ops Engineers working on AI infrastructure
Researchers and ML practitioners in the GenAI space
Product engineers interested in multi-modal AI systems
Note: Beginners with some Python and ML experience can participate but should review multi-modal LLM and embedding concepts in advance.

📚 What will participants learn?

How to build a Vision-RAG system for Visual Augmented Q&A
How to work with multi-vector vision embeddings and vector databases
Integration strategies for ColPali with Vector DBs
Practical challenges in building multi-modal retrieval systems and how to address them

👨 🏫 Instructor bio

Abhijeet Kumar is a data science leader with over 12 years of experience applying advanced analytics, machine learning, and deep learning solutions to real-world problems. He began his career as a computer scientist at Bhabha Atomic Research Center (BARC), conducting research in domains such as conversational speech, satellite imagery, and document processing. Abhijeet has published multiple research papers, presented at past The Fifth Elephant 2024 and PyCon conferences, and taught Machine Learning as Guest Faculty for the BITS Pilani WILP M.Tech program.

Rachna Saxena is a Data Scientist with 8 years of experience in AI domain. She holds a Master’s degree in ML from Georgia Tech. Prior to this, she has extensive experience in the semiconductor industry. She has worked across the software stack, from firmware to application development for consumer electronic products. She has authored research papers and holds patents in the field of Machine learning.