The Fifth Elephant 2025 Annual Conference CfP

The Fifth Elephant 2025 Annual Conference CfP

Speak at The Fifth Elephant 2025 Annual Conference

Elakkiya R

Elakkiya R

@elakkiyar

Beyond Pixels: Enhancing Wrist X-ray Diagnostics with Hybrid Vision-Language Models

Submitted May 29, 2025

Wrist injuries are among the most common reasons for emergency imaging, yet accurate interpretation of wrist X-rays remains challenging — especially in time-sensitive or resource-limited settings. Fractures are frequently overcalled or missed due to complex bone overlaps, variable views, and the subtle presentation of critical injuries like scaphoid fractures. Vision-Language Models (VLMs) have shown potential, but they often fail to capture such clinical nuance. This talk introduces a hybrid architecture combining visual embeddings, structured findings, and an instruction-tuned language model to deliver interpretable, robust, and clinically useful wrist X-ray diagnostics. We’ll explore how a BioWrist encoder (inspired by BioViL-T), a hybrid classifier, and a fine-tuned LLaVA-7B work together to support accurate reporting, even in difficult or ambiguous cases.

The Problem

  • High Misdiagnosis Rate: Wrist fractures are often misread — some are overcalled due to misleading views, while others (especially subtle or non-displaced fractures) are completely missed.

  • Emergency Setting Challenges: Many wrist X-rays are taken during off-hours when radiology experts may not be available, leading to greater reliance on automated triage or decision support.

  • Complexity in Detection: Fractures like scaphoid or ulnar styloid are particularly difficult to detect early or may not be clearly visible in standard projections.

  • Visual Ambiguity: Dense bone overlap, varied wrist orientations, and low-quality images make visual-only interpretation unreliable.

  • Language Models Alone Are Insufficient: Standard LLMs lack domain-specific medical grounding, increasing the risk of irrelevant, incorrect, or hallucinated outputs.

The Proposed Solution: A Hybrid Modular Pipeline

  • Visual Encoding: The BioWrist encoder, a domain-adapted visual encoder inspired by BioViL-T, extracts high-level semantic features tailored to wrist anatomy and pathology.

  • Clinical Token Injection: A hybrid CNN-Transformer classifier predicts structured pathology findings (e.g., “scaphoid fracture suspected”, “no cortical disruption”).

  • Prompt Construction: Image tokens, structured findings, and task-specific instructions are combined into a prompt tailored to diagnostic or reporting needs.

  • Language Generation: A custom fine-tuned LLaVA-7B model, trained on wrist radiology reports, generates clinically grounded outputs — full impressions or targeted responses (e.g., fracture presence, region, uncertainty).

What Makes This Approach Effective?

  • Structured Context Injection
    Classifier findings act as high-level guides, grounding the LLM in meaningful anatomical and pathological signals.

  • Robustness in Ambiguous Cases
    When fractures are partially hidden, scaphoid lines are unclear, or images are noisy, the classifier still provides stable and informative labels.

  • Clinically Focused Generation
    By injecting suspected regions or abnormalities into the prompt, the language model can prioritize the most relevant structures and avoid speculative or vague reporting.

  • Faster & Reliable Output
    With domain-aligned hints (e.g., “no evidence of fracture, but scaphoid not fully visualized”), the model converges on clinically safe outputs faster than blind generation.

Talk Structure (30 Minutes)

Part 1 – The Diagnostic Gap (0–5 mins)

  • Challenges in wrist fracture interpretation
  • Emergency imaging scenarios with limited expert access
  • Pitfalls of end-to-end VLMs in high-risk, subtle diagnostic settings

Part 2 – Architecture Walkthrough (5–15 mins)

  • Overview of BioWrist encoder, classifier, prompt design, and LLaVA-7B
  • Integration from X-ray to structured report
  • Examples of evolving prompts for wrist-specific cases

Part 3 – Results & Real-World Impact (15–25 mins)

  • Before vs. after classifier injection: Output clarity, factual correctness
  • Scenarios with ambiguous or subtle fractures (e.g., early scaphoid fracture)
  • How the system aids triage in emergency or low-resource setups

Part 4 – Lessons & Future Opportunities (25–30 mins)

  • Adapting this model for other extremity imaging (e.g., ankle, elbow)
  • Bridging AI outputs with PACS and radiology workflow
  • Open challenges: uncertainty quantification, interpretability, validation

Key Takeaways

  • Wrist X-ray diagnostics demand a clinically grounded, modular AI system—not generic image captioning.
  • Hybrid pipelines that fuse visual embeddings with structured pathology cues can catch subtle fractures missed by end-to-end models.
  • Grounding LLMs with domain knowledge is key to safe, useful AI in emergency and orthopedic radiology.

Speaker

Elakkiya R – Data Scientist, 5C Network
Elakkiya is a Data Scientist specializing in Computer Vision, Deep learning, and Vision Language models (VLMs) in medical imaging. She designs and builds end-to-end deep learning pipelines to enhance diagnostic accuracy across various modalities in Radiology. Her work focuses on integrating multi-modal AI systems to deliver clinically relevant, interpretable, and robust solutions.

Slide Deck

View the Slide Deck

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures