Beyond Pixels: Enhancing Wrist X-ray Diagnostics with Hybrid Vision-Language Models

Submitted May 29, 2025

Choose the topic your submission falls under: Applied ML - health I am submitting for: Speaking at the Fifth Elephant 2025 Annual Conference Type of submission: 30 mins talk

Wrist injuries are among the most common reasons for emergency imaging, yet accurate interpretation of wrist X-rays remains challenging — especially in time-sensitive or resource-limited settings. Fractures are frequently overcalled or missed due to complex bone overlaps, variable views, and the subtle presentation of critical injuries like scaphoid fractures. Vision-Language Models (VLMs) have shown potential, but they often fail to capture such clinical nuance. This talk introduces a hybrid architecture combining visual embeddings, structured findings, and an instruction-tuned language model to deliver interpretable, robust, and clinically useful wrist X-ray diagnostics. We’ll explore how a BioWrist encoder (inspired by BioViL-T), a hybrid classifier, and a fine-tuned LLaVA-7B work together to support accurate reporting, even in difficult or ambiguous cases.

The Problem

High Misdiagnosis Rate: Wrist fractures are often misread — some are overcalled due to misleading views, while others (especially subtle or non-displaced fractures) are completely missed.
Emergency Setting Challenges: Many wrist X-rays are taken during off-hours when radiology experts may not be available, leading to greater reliance on automated triage or decision support.
Complexity in Detection: Fractures like scaphoid or ulnar styloid are particularly difficult to detect early or may not be clearly visible in standard projections.
Visual Ambiguity: Dense bone overlap, varied wrist orientations, and low-quality images make visual-only interpretation unreliable.
Language Models Alone Are Insufficient: Standard LLMs lack domain-specific medical grounding, increasing the risk of irrelevant, incorrect, or hallucinated outputs.

The Proposed Solution: A Hybrid Modular Pipeline

Visual Encoding: The BioWrist encoder, a domain-adapted visual encoder inspired by BioViL-T, extracts high-level semantic features tailored to wrist anatomy and pathology.
Clinical Token Injection: A hybrid CNN-Transformer classifier predicts structured pathology findings (e.g., “scaphoid fracture suspected”, “no cortical disruption”).
Prompt Construction: Image tokens, structured findings, and task-specific instructions are combined into a prompt tailored to diagnostic or reporting needs.
Language Generation: A custom fine-tuned LLaVA-7B model, trained on wrist radiology reports, generates clinically grounded outputs — full impressions or targeted responses (e.g., fracture presence, region, uncertainty).

What Makes This Approach Effective?

Structured Context Injection
Classifier findings act as high-level guides, grounding the LLM in meaningful anatomical and pathological signals.
Robustness in Ambiguous Cases
When fractures are partially hidden, scaphoid lines are unclear, or images are noisy, the classifier still provides stable and informative labels.
Clinically Focused Generation
By injecting suspected regions or abnormalities into the prompt, the language model can prioritize the most relevant structures and avoid speculative or vague reporting.
Faster & Reliable Output
With domain-aligned hints (e.g., “no evidence of fracture, but scaphoid not fully visualized”), the model converges on clinically safe outputs faster than blind generation.

Talk Structure (30 Minutes)

Part 1 – The Diagnostic Gap (0–5 mins)

Challenges in wrist fracture interpretation
Emergency imaging scenarios with limited expert access
Pitfalls of end-to-end VLMs in high-risk, subtle diagnostic settings

Part 2 – Architecture Walkthrough (5–15 mins)

Overview of BioWrist encoder, classifier, prompt design, and LLaVA-7B
Integration from X-ray to structured report
Examples of evolving prompts for wrist-specific cases

Part 3 – Results & Real-World Impact (15–25 mins)

Before vs. after classifier injection: Output clarity, factual correctness
Scenarios with ambiguous or subtle fractures (e.g., early scaphoid fracture)
How the system aids triage in emergency or low-resource setups

Part 4 – Lessons & Future Opportunities (25–30 mins)

Adapting this model for other extremity imaging (e.g., ankle, elbow)
Bridging AI outputs with PACS and radiology workflow
Open challenges: uncertainty quantification, interpretability, validation

Key Takeaways

Wrist X-ray diagnostics demand a clinically grounded, modular AI system—not generic image captioning.
Hybrid pipelines that fuse visual embeddings with structured pathology cues can catch subtle fractures missed by end-to-end models.
Grounding LLMs with domain knowledge is key to safe, useful AI in emergency and orthopedic radiology.

Speaker

Elakkiya R – Data Scientist, 5C Network
Elakkiya is a Data Scientist specializing in Computer Vision, Deep learning, and Vision Language models (VLMs) in medical imaging. She designs and builds end-to-end deep learning pipelines to enhance diagnostic accuracy across various modalities in Radiology. Her work focuses on integrating multi-modal AI systems to deliver clinically relevant, interpretable, and robust solutions.

Slide Deck

View the Slide Deck

The Fifth Elephant 2025 Annual Conference CfP