The Fifth Elephant

The Fifth Elephant 2025 Annual Conference

Less hype. More engineering.

Jul 2025

14 Mon

15 Tue

16 Wed

17 Thu

18 Fri

19 Sat 08:45 AM – 05:55 PM IST

20 Sun

Bangalore International Centre, Bangalore

Journey of Finetuning Open Source Stable Diffusion Models

Submitted May 29, 2025

Choose the topic your submission falls under: Training datasets - case studies; experience reports I am submitting for: Speaking at the Fifth Elephant 2025 Annual Conference Type of submission: Tutorial (lecture style) - 60-90 mins

Abstract:
AI Images are revolutionizing the world from multiple perspectives which includes High Quality creative art, fashion images, animes and much more. Diffusion models are the heart of these use cases. Businesses are revolving around this technology to create niche use cases, However adapting these general purpose models specific use cases like : Can a Diffusion model generate an Indo-Nepalese woman image with correct aesthetic , are always difficult. It needs conventional fine-tuning methods.

Though conventional full finetuning is way difficult, as it has multiple pillars which need to be perfect. This talk will focus on how we at Caimera finetuning SDXL Diffusion Model from scratch to solve Indian Fashion use-cases, also to solve the void of information present around finetuning. The audience will understand what are the steps to finetune a model, what things need to be taken care of, and some best practices from our learnings.

Agenda:

Why to finetune a Model?
- Understanding when a business needs to full-finetune a Foundation model from scratch without training a basic LoRa or solving the use case by any other solution like prompting? Finetuning being a GPU melting job, this answer needs to be addressed first.
- We will deep dive into what to expect when you finetune a model, what can be fixed and what cannot ? Most of the people end up finetuning to replicate a style for example If I am expecting a model to re-generate my own images, then I will be needing a LoRa training not a full finetune. But when I am trying to make the model learn a full concept or a domain of information I will be needing finetuning. Will showcase why we at Caimera finetuned a model.
Gate to Finetuning Data collection
- Data being always the heart of any finetuning approach, if we train on a garbage dataset we will expect garbage as output.
- We will discuss our approach of how we understood which problems we tried to solve using finetuning and built a dataset. This will include data collection processes from different web sources, how to select a dataset and deal with the copyrights and dealing with what breadth of diversity in the Images we should put.
- How to increase the real world data with synthetic images, what are the steps involved there and how we answered the question of ratio between real world vs synthetic images.
- Comparison Demo on results of training with different data collection methods
Data Pre-Processing:
- Datasets scraped from sources have multiple issues mainly around quality and representation of the concept in the dataset for which we are trying to finetune. This is the key-step to remove garbage from a dataset.
- Will talk about our approaches and how we handled it.
- Comparison Demo on results of training with different data processing - techniques.
Captioning is the Key:
- We will talk about the experience that we encountered around captioning and how it impacts.
- What are the best practices to caption the dataset, how to represent information in a best way so that the model’s Text encoders learn it in a better way.
- A comparative analysis with results of Manual vs Auto Captioning using LLM’s.
- Comparison Demo on results of training with different captioning strategy and what worked best.
How to choose which model to fine tune and best configuration to finetune:
- This will include what questions need to be answered to choose a Base model which needs to be finetuned, it will mostly be around quality and model architecture.
- We will talk about Trainers present in Open source like Kohya-SS, OneTrainer , SimpleTuner and Diffusers which can be used to train. Which one we used and theoretical reasoning behind using and choosing a best trainer for a use case.
- Importance of getting a perfect Configuration to train. Theoretical understanding of impact and working of Learning Rates, Optimizers, Loss Functions and Network Dimensions and how they impact the training.
- Comparison Demo on results of training with different Configurations and our learning how we selected the best config.
How to understand the training is going wrong while training ?
- Being a GPU heavy task it’s always a question if we train the model and then it acts poor then the compute is wasted ! We will discuss how we figured out how to understand while the training is in progress. Will it go correctly or not?
- This will include understanding of training samples and Loss Curves.
- We will also talk about a custom approach of extracting learning maps from each training layer to showcase what the model is learning.
Evaluation Criteria:
- Here we will talk about how to set up an evaluation metric for the training purpose and learn from how we set it up.
- What are the best practices for testing each training iteration and finalize the best working version based on metrics.
Pathway to achieve superior quality using Model Merging:
- Generative AI models are always treated as experts, but what Mixtral models taught us when multiple experts combined the results are always amazing !
- We will talk about how we scaled the results we got to an amazing quality by merging our fine-tuned version with other models , along with steps to do for model merging.
Out of the box approaches we tried for Finetuning and how they worked:
- Along with conventional finetuning method we tried different research and community based approaches like DPO(Direct Preference Optimization), Distillation Techniques with Byte dance models and replacing Clip text encoders with LLaMa model (as proposed in Playground V3 paper), will throw a light on how we did these, how they worked out and what learning we had.
Conclusion
QnA

Takeaways for audience:

Understanding business needs about how and when to finetune a Model to get better quality.
Understanding of best practices to finetune a Model
Understanding of how a trivial training approach looks like
Understanding of what amazing things anyone can solve and not solve by finetuning, getting out of a jargon that training always works!

Target Audience :

Machine Learning Engineers, ML Practitioners
Businesses, startup founders
Product Managers
Students learning about Artificial Intelligence.
WIP slide: https://docs.google.com/presentation/d/1hxycD_mvbTZGQW7iVqGNkV7TSQtJnjEaqxhjINZJw7s/edit?usp=sharing
About the Speaker:

MLE at Caimera AI, Former MLE at Newton School, Dark Horse, Shell, WRI, Metvy. Contributed to Google Tensorflow(GSOC), Samsung(Prism), IIT Patna (Projects). Founded MBK Health tech backed by supreme ventures to apply AI for early detection of cardiac diseases and create a hyper local support network for patients using wearables. Holding 4 patents on medical Imaging automations using AI algorithms,holding multiple research papers and Indian young Achievers award winner for contributions in artificial intelligence towards nation. Spoken at Py-Bangalore, Belgium-Py conference, Keras Community Day 23, Girlscript India Summit, MIT TECH X , HPAIR(delegate) etc and multiple meetups, hackathons and events.
(Linkedin: https://www.linkedin.com/in/anustupmukherjee/)

The Fifth Elephant 2025 Annual Conference

Journey of Finetuning Open Source Stable Diffusion Models

Comments