The Fifth Elephant 2025 Annual Conference CfP
Speak at The Fifth Elephant 2025 Annual Conference
Submitted May 29, 2025
Wrist injuries are among the most common reasons for emergency imaging, yet accurate interpretation of wrist X-rays remains challenging — especially in time-sensitive or resource-limited settings. Fractures are frequently overcalled or missed due to complex bone overlaps, variable views, and the subtle presentation of critical injuries like scaphoid fractures. Vision-Language Models (VLMs) have shown potential, but they often fail to capture such clinical nuance. This talk introduces a hybrid architecture combining visual embeddings, structured findings, and an instruction-tuned language model to deliver interpretable, robust, and clinically useful wrist X-ray diagnostics. We’ll explore how a BioWrist encoder (inspired by BioViL-T), a hybrid classifier, and a fine-tuned LLaVA-7B work together to support accurate reporting, even in difficult or ambiguous cases.
High Misdiagnosis Rate: Wrist fractures are often misread — some are overcalled due to misleading views, while others (especially subtle or non-displaced fractures) are completely missed.
Emergency Setting Challenges: Many wrist X-rays are taken during off-hours when radiology experts may not be available, leading to greater reliance on automated triage or decision support.
Complexity in Detection: Fractures like scaphoid or ulnar styloid are particularly difficult to detect early or may not be clearly visible in standard projections.
Visual Ambiguity: Dense bone overlap, varied wrist orientations, and low-quality images make visual-only interpretation unreliable.
Language Models Alone Are Insufficient: Standard LLMs lack domain-specific medical grounding, increasing the risk of irrelevant, incorrect, or hallucinated outputs.
Visual Encoding: The BioWrist encoder, a domain-adapted visual encoder inspired by BioViL-T, extracts high-level semantic features tailored to wrist anatomy and pathology.
Clinical Token Injection: A hybrid CNN-Transformer classifier predicts structured pathology findings (e.g., “scaphoid fracture suspected”, “no cortical disruption”).
Prompt Construction: Image tokens, structured findings, and task-specific instructions are combined into a prompt tailored to diagnostic or reporting needs.
Language Generation: A custom fine-tuned LLaVA-7B model, trained on wrist radiology reports, generates clinically grounded outputs — full impressions or targeted responses (e.g., fracture presence, region, uncertainty).
Structured Context Injection
Classifier findings act as high-level guides, grounding the LLM in meaningful anatomical and pathological signals.
Robustness in Ambiguous Cases
When fractures are partially hidden, scaphoid lines are unclear, or images are noisy, the classifier still provides stable and informative labels.
Clinically Focused Generation
By injecting suspected regions or abnormalities into the prompt, the language model can prioritize the most relevant structures and avoid speculative or vague reporting.
Faster & Reliable Output
With domain-aligned hints (e.g., “no evidence of fracture, but scaphoid not fully visualized”), the model converges on clinically safe outputs faster than blind generation.
Part 1 – The Diagnostic Gap (0–5 mins)
Part 2 – Architecture Walkthrough (5–15 mins)
Part 3 – Results & Real-World Impact (15–25 mins)
Part 4 – Lessons & Future Opportunities (25–30 mins)
Elakkiya R – Data Scientist, 5C Network
Elakkiya is a Data Scientist specializing in Computer Vision, Deep learning, and Vision Language models (VLMs) in medical imaging. She designs and builds end-to-end deep learning pipelines to enhance diagnostic accuracy across various modalities in Radiology. Her work focuses on integrating multi-modal AI systems to deliver clinically relevant, interpretable, and robust solutions.
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}