Thank you for the overwhelming response. This workshop is housefull! Registrations are closed for in-person attendance.
The workshop will be live streamed. Stream links will be shared with 2025 edition ticket holders and annual members on the day before the workshop.
Most business and research documents today are rich with infographics - tables, charts, images - that carry essential context for decision-making. While Multi-modal LLMs offer the ability to query such content, traditional Text-RAG systems fall short as they only process text. And even modern multi-modal LLMs struggle with longer context windows, often missing critical information midway through.
In this hands-on workshop, participants will learn how to build a Vision-RAG system that jointly encodes text and visual information for powerful Visual-Augmented Q&A. Using ColPali (a Vision Language Model) and vector databases, the session will teach participants how to process complex documents and enable image+text Q&A workflows at scale.
- This workshop is of 4 hours duration.
- It is an advanced hands-on workshop.
- Beginners are welcome, but must be prepared to catch up on:
- Python proficiency
- A basic understanding of LLMs and embeddings
- Familiarity with vector databases and some exposure to multi-modal or image processing concepts
This workshop is best suited for those with intermediate to advanced AI/ML backgrounds.
- Limited seats available for participation.
- Live stream available for The Fifth Elephant members to participate remotely.
- Workshop materials and code in progress: GitHub Repository
-
Module 1: Introduction to Visual Augmented Q&A
- What are Multi-modal LLMs?
- Overview of Visual Language Models
-
Module 2: Foundation — Prompting for Q&A using Multi-modal LLMs
- Introduction to basic prompt engineering for multi-modal systems
-
Module 3: Setting up Vision-based RAG
- Vision Embeddings using ColPali (talk)
- Late interaction retrieval using ColPali
- Hands-on: Build an end-to-end Visual Augmented Q&A workflow
-
Module 4: Practical Challenges with Vision-based RAG
- Discuss limitations, pitfalls, and mitigation strategies for productionizing Vision-RAG systems
-
Module 5: Integration with Vector Databases
- Overall Architecture overview (talk)
- Hands-on:
- Storing multi-vector representations in a Vector DB
- Embedding-based retrieval and ColPali-based re-ranking
- End-to-end Python implementation demo
-
Conclusion + Q&A
- Python 3.8+ environment
- Familiarity with LLMs, embeddings, and retrieval systems
- Basic understanding of vector databases (e.g., FAISS, Milvus, or others)
- Interest in multi-modal AI applications and document Q&A pipelines
- Aspiring Data Scientists and AI Engineers
- DevOps/ML Ops Engineers working on AI infrastructure
- Researchers and ML practitioners in the GenAI space
- Product engineers interested in multi-modal AI systems
Note: Beginners with some Python and ML experience can participate but should review multi-modal LLM and embedding concepts in advance.
- How to build a Vision-RAG system for Visual Augmented Q&A
- How to work with multi-vector vision embeddings and vector databases
- Integration strategies for ColPali with Vector DBs
- Practical challenges in building multi-modal retrieval systems and how to address them
Abhijeet Kumar is a data science leader with over 12 years of experience applying advanced analytics, machine learning, and deep learning solutions to real-world problems. He began his career as a computer scientist at Bhabha Atomic Research Center (BARC), conducting research in domains such as conversational speech, satellite imagery, and document processing. Abhijeet has published multiple research papers, presented at past The Fifth Elephant 2024 and PyCon conferences, and taught Machine Learning as Guest Faculty for the BITS Pilani WILP M.Tech program.
Rachna Saxena is a Data Scientist with 8 years of experience in AI domain. She holds a Master’s degree in ML from Georgia Tech. Prior to this, she has extensive experience in the semiconductor industry. She has worked across the software stack, from firmware to application development for consumer electronic products. She has authored research papers and holds patents in the field of Machine learning.
This workshop is open for The Fifth Elephant members and for The Fifth Elephant 2025 annual conference ticket buyers.
Seats are limited and available on a first-come-first-serve basis. 🎟️
For inquiries about the workshop, contact +91-7676332020 or write to info@hasgeek.com