MedVCoT: Bridging Modality Gap in Medical VQA through Latent Visual Reasoning

1Dalian University of Technology, China 2RMIT University, Australia

Abstract

With the rising demand for trustworthy AI in clinical practice, strong interpretability is now a critical requirement as well as accuracy. However, the modality gap for medical visual question answering is quite severe when continuous visual signals are forcibly projected into discrete text space for reasoning, and the loss of necessary diagnostic information leads to low precision and black-box opacity.

To address this problem, we propose MedVCoT, which incorporates latent visual reasoning into the medical visual question answering(VQA) domain. Rather than merely integrating modules, MedVCoT utilizes the specialized expertise of MedSAM to train a large vision-language model so that it can autonomously generate consistent and continuous latent visual tokens within Visual Chain-of-Thought. This mechanism forces the model to explicitly "see" the lesion in the latent space before formulating a textual diagnosis, ensuring answers are causally rooted in verifiable visual evidence rather than statistical hallucination. We achieve this through a progressive 3-stage training procedure: medical feature alignment, visual reasoning learning by utilizing latent tokens generated, and instruction tuning for complex clinical scenarios.

Extensive experiments show that MedVCoT can achieve state-of-the-art performance on multiple benchmarks, outperforming other methods by large margins. Meanwhile, it provides pixel-level segmentation masks to validate its diagnostic reasoning.

Video

Methodology

Medical Visual Question Answering (Medical VQA) stands as a cornerstone of AI-assisted diagnosis, which needs both high accuracy and rigorous interpretability. The “black-box” diagnosis made to a patient, although it may be correct, will be unconvincing without verifiable visual evidence. However, current state-of-the-art Medical Vision-Language Models (VLMs) come to a fundamental bottleneck: the modality gap. When continuous and fine-grained visual signals (such as the boundaries of a lesion or texture gradients) are forcibly projected into a discrete textual space for reasoning, there is a lot of visual evidence ignored inevitably. This information loss is particularly detrimental in clinical practice, where the subtlest cues in visual content can matter to diagnosis. In that connection, the lack of direct visualization of reasoning has been a barrier to the application of medical AI systems applied to clinical scenarios.

Overview of MedVCoT methodology

Figure: Overview of the proposed MedVCoT methodology.

To address these challenges, we propose MedVCoT, the first approach to incorporate a Visual Chain-of-Thought (Visual CoT) into medical VQA. Different from traditional methods, which only perform reasoning in the text space, our model generates a sequence of continuous latent visual tokens autonomously within a separate thought horizon (denoted as \texttt{} tags) before answering. These tokens act as a bridge between linguistic reasoning and visual perception. In particular, under the guidance of the specialized medical visual expert MedSAM, the VLM learns to generate these tokens as an intermediate reasoning output. These generated tokens can be further mapped with a projector to trigger MedSAM to decode explicit segmentation masks. This mechanism forces the model to “see” and localize the lesion in latent space before “speaking” the diagnosis so as to ensure that answers are causally predicated on verifiable visual evidence.

Training Pipeline of MedVCoT methodology

Figure: Training Pipeline of the proposed MedVCoT methodology.

Experiment

We benchmarked MedVCoT against 14 leading models, including closed-source (GPT-4o, Gemini 2.0) and open-source (Qwen2.5-VL, LLaVA-Med, Lingshu, etc.) baselines, across four medical VQA datasets. For a fair comparison, all baseline models were evaluated using their official pre-trained weights directly on the test sets under a unified evaluation protocol.

Main Result.

Figure: Main Result.

Based on the ablation study visualized in the figure below. we analyze the cumulative impact of our three-stage training strategy compared to the baseline. The baseline model, lacking domain-specific adaptation, achieves a modest overall accuracy of 43.3 %.

Ablation study.

Figure: Abalation Study.

To rigorously assess robustness and generalization boundaries, we extend beyond aggregate metrics to conduct a fine-grained capability analysis based on imaging modalities. We categorize test samples into 14 distinct sub-categories , spanning macroscopic imaging (e.g., Dermoscopy, Endoscopy), microscopic pathology (e.g., Histopathology), diverse radiological modalities (e.g., CT, MRI, X-ray), and a miscellaneous 'Others' class. This stratified evaluation comprehensively profiles the model's consistency across varying imaging mechanisms and anatomical structures, elucidating its cross-modal strengths.

Fine-grained capability analysis.

Figure: Fine-grained capability analysis.