Seeing is retrieving: search chatbots with vision embeddings

Seeing is retrieving: search chatbots with vision embeddings

Hands-on workshop - The Fifth Elephant 2025 Annual Conference

Housefull!

Thank you for the overwhelming response. This workshop is housefull! Registrations are closed for in-person attendance.
The workshop will be live streamed. Stream links will be shared with 2025 edition ticket holders and annual members on the day before the workshop.

🔍 Workshop overview

Most business and research documents today are rich with infographics - tables, charts, images - that carry essential context for decision-making. While Multi-modal LLMs offer the ability to query such content, traditional Text-RAG systems fall short as they only process text. And even modern multi-modal LLMs struggle with longer context windows, often missing critical information midway through.

In this hands-on workshop, participants will learn how to build a Vision-RAG system that jointly encodes text and visual information for powerful Visual-Augmented Q&A. Using ColPali (a Vision Language Model) and vector databases, the session will teach participants how to process complex documents and enable image+text Q&A workflows at scale.

Note

  • This workshop is of 4 hours duration.
  • It is an advanced hands-on workshop.
  • Beginners are welcome, but must be prepared to catch up on:
    • Python proficiency
    • A basic understanding of LLMs and embeddings
    • Familiarity with vector databases and some exposure to multi-modal or image processing concepts
      This workshop is best suited for those with intermediate to advanced AI/ML backgrounds.
  • Limited seats available for participation.
  • Live stream available for The Fifth Elephant members to participate remotely.
  • Workshop materials and code in progress: GitHub Repository

🧭 Agenda

  • Module 1: Introduction to Visual Augmented Q&A

    • What are Multi-modal LLMs?
    • Overview of Visual Language Models
  • Module 2: Foundation — Prompting for Q&A using Multi-modal LLMs

    • Introduction to basic prompt engineering for multi-modal systems
  • Module 3: Setting up Vision-based RAG

    • Vision Embeddings using ColPali (talk)
    • Late interaction retrieval using ColPali
    • Hands-on: Build an end-to-end Visual Augmented Q&A workflow
  • Module 4: Practical Challenges with Vision-based RAG

    • Discuss limitations, pitfalls, and mitigation strategies for productionizing Vision-RAG systems
  • Module 5: Integration with Vector Databases

    • Overall Architecture overview (talk)
    • Hands-on:
      • Storing multi-vector representations in a Vector DB
      • Embedding-based retrieval and ColPali-based re-ranking
      • End-to-end Python implementation demo
  • Conclusion + Q&A

💻 Prerequisites

  • Python 3.8+ environment
  • Familiarity with LLMs, embeddings, and retrieval systems
  • Basic understanding of vector databases (e.g., FAISS, Milvus, or others)
  • Interest in multi-modal AI applications and document Q&A pipelines

👥 Who should attend

  • Aspiring Data Scientists and AI Engineers
  • DevOps/ML Ops Engineers working on AI infrastructure
  • Researchers and ML practitioners in the GenAI space
  • Product engineers interested in multi-modal AI systems
    Note: Beginners with some Python and ML experience can participate but should review multi-modal LLM and embedding concepts in advance.

📚 What will participants learn?

  • How to build a Vision-RAG system for Visual Augmented Q&A
  • How to work with multi-vector vision embeddings and vector databases
  • Integration strategies for ColPali with Vector DBs
  • Practical challenges in building multi-modal retrieval systems and how to address them

👨 🏫 Instructor bio

Abhijeet Kumar is a data science leader with over 12 years of experience applying advanced analytics, machine learning, and deep learning solutions to real-world problems. He began his career as a computer scientist at Bhabha Atomic Research Center (BARC), conducting research in domains such as conversational speech, satellite imagery, and document processing. Abhijeet has published multiple research papers, presented at past The Fifth Elephant 2024 and PyCon conferences, and taught Machine Learning as Guest Faculty for the BITS Pilani WILP M.Tech program.

Rachna Saxena is a Data Scientist with 8 years of experience in AI domain. She holds a Master’s degree in ML from Georgia Tech. Prior to this, she has extensive experience in the semiconductor industry. She has worked across the software stack, from firmware to application development for consumer electronic products. She has authored research papers and holds patents in the field of Machine learning.

How to attend this workshop

This workshop is open for The Fifth Elephant members and for The Fifth Elephant 2025 annual conference ticket buyers.

Seats are limited and available on a first-come-first-serve basis. 🎟️

Contact information ☎️

For inquiries about the workshop, contact +91-7676332020 or write to info@hasgeek.com

Venue

Underline Centre 2nd floor

24, 1st Main, 3rd Cross Road, 3rd Floor,

Above Blue Tokai 24, 3rd A Cross, 1st Main Rd,

Bengaluru - 560071

Karnataka, IN

Loading…

Hosted by

Jump starting better data engineering and AI futures