V-Reflection introduces a "think-then-look" visual reflection mechanism where latent states act as dynamic probes that actively interrogate the visual feature space — grounding each reasoning step for task-critical evidence.
(a) Traditional MLLMs treat visual information as a static input, leading to perception hallucinations (e.g., "Kevlar") by prioritizing language priors over visual evidence. (b) V-Reflection's "think-then-look" mechanism uses evolving latent states as dynamic probes (Qdyn) to retrace global visual features, accurately localizing task-critical evidence (e.g., the rubber glove) for a precise answer.
MLLMs remain prone to perception-related hallucinations because their reasoning is confined to the language domain, treating visual input as a static, reasoning-agnostic preamble. We propose V-Reflection, a framework that transforms the MLLM into an active interrogator through a "think-then-look" visual reflection mechanism. A two-stage distillation strategy synthesizes explicit visual grounding with continuous latent reasoning: the BCM establishes stable pixel-to-latent targets; the DAC distills this spatial expertise into dynamic latent probes. During inference, both modules remain entirely inactive, maintaining purely end-to-end autoregressive decoding with optimal efficiency.
A two-stage distillation strategy that bridges explicit spatial grounding with continuous latent reasoning
Fig. 1 — V-Reflection Architecture. (a) Stage 1: BCM distills regional patches into grounded latent tokens ZT. (b) Stage 2: DAC trains hidden states H as dynamic probes that interrogate global features. (c) Inference: both modules remain entirely inactive; purely end-to-end autoregressive decoding.
BCM uses RoI-Align to extract local region features from bounding boxes, then compresses them into grounded latent tokens ZT via cross-attention. A Stochastic Decoupled Alignment strategy prevents representation collapse by alternating gradient flow between latent tokens ZT and hidden states H.
DAC projects the LLM's S evolving hidden states H into dynamic queries Qdyn, which interrogate the global visual feature map. MSE + KL divergence distillation from BCM teacher transfers spatial grounding expertise without requiring bounding-box priors at inference.
Both BCM and DAC remain entirely inactive. The model decodes purely autoregressively in latent space as the next-step input embedding for latent reasoning steps with optimal efficiency.
Latent reasoning autonomously localizes task-critical visual evidence — no external tools or bounding box priors
Latent reasoning visualization at inference. Dynamic probes interrogate the visual feature space step-by-step. Each latent step progressively sharpens the model's visual focus onto the task-critical region, driving a precise final answer without any external grounding tool.
Teacher vs. Student attention maps during Stage 2 distillation. The BCM teacher (top row) produces spatially precise attention over ground-truth regions; the DAC student (bottom row) progressively learns to match this spatial focus using only its own latent states as queries.
V-Reflection consistently outperforms the Qwen2.5-VL-7B baseline across all six perception-intensive benchmarks
| Benchmark | Qwen2.5-VL-7B | V-Reflection (ours) | Gain |
|---|---|---|---|
| MMVP | 66.7 | 72.3 | +5.6 |
| BLINK | 54.5 | 56.4 | +1.9 |
| V* Bench | 78.5 | 81.7 | +3.2 |
| HRBench-4K | 68.0 | 72.6 | +4.6 |
| HRBench-8K | 63.8 | 66.3 | +2.5 |
| MME-RealWorld-Lite | 45.8 | 53.9 | +8.1 |
Get started with V-Reflection in a few commands
conda env create -f environment.yaml
conda activate train
pip install qwen-vl-utils
pip install flash-attn --no-build-isolation
# Stage 1: Box-Guided Compression (BCM)
bash scripts_release/train/sft_7b_stage1_box_resampler.sh
# Set --data_path and --image_folder in the script if needed
# Stage 2: Dynamic Autoregressive Compression (DAC)
export CHECKPOINT_PATH="path/to/stage1_checkpoint"
bash scripts_release/train/sft_7b_stage2_distillation.sh
# Full benchmark evaluation (BLINK, MMVP, VSTAR, HRBench4K/8K, MME-RealWorld-Lite)
export EVAL_CHECKPOINT_PATH="path/to/checkpoint"
bash scripts_release/evaluation/evaluation_7b_stage2.sh