Transform MLLMs into
Active Interrogators.

V-Reflection introduces a "think-then-look" visual reflection mechanism where latent states act as dynamic probes that actively interrogate the visual feature space — grounding each reasoning step for task-critical evidence.

V-Reflection teaser visualization

(a) Traditional MLLMs treat visual information as a static input, leading to perception hallucinations (e.g., "Kevlar") by prioritizing language priors over visual evidence. (b) V-Reflection's "think-then-look" mechanism uses evolving latent states as dynamic probes (Qdyn) to retrace global visual features, accurately localizing task-critical evidence (e.g., the rubber glove) for a precise answer.

Abstract

MLLMs remain prone to perception-related hallucinations because their reasoning is confined to the language domain, treating visual input as a static, reasoning-agnostic preamble. We propose V-Reflection, a framework that transforms the MLLM into an active interrogator through a "think-then-look" visual reflection mechanism. A two-stage distillation strategy synthesizes explicit visual grounding with continuous latent reasoning: the BCM establishes stable pixel-to-latent targets; the DAC distills this spatial expertise into dynamic latent probes. During inference, both modules remain entirely inactive, maintaining purely end-to-end autoregressive decoding with optimal efficiency.

How V-Reflection Works

A two-stage distillation strategy that bridges explicit spatial grounding with continuous latent reasoning

V-Reflection framework overview

Fig. 1 — V-Reflection Architecture. (a) Stage 1: BCM distills regional patches into grounded latent tokens ZT. (b) Stage 2: DAC trains hidden states H as dynamic probes that interrogate global features. (c) Inference: both modules remain entirely inactive; purely end-to-end autoregressive decoding.

Stage 1

Box-Guided Compression (BCM)

BCM uses RoI-Align to extract local region features from bounding boxes, then compresses them into grounded latent tokens ZT via cross-attention. A Stochastic Decoupled Alignment strategy prevents representation collapse by alternating gradient flow between latent tokens ZT and hidden states H.

Pixel-to-LatentBox-Guided
Stage 2

Dynamic Autoregressive Compression (DAC)

DAC projects the LLM's S evolving hidden states H into dynamic queries Qdyn, which interrogate the global visual feature map. MSE + KL divergence distillation from BCM teacher transfers spatial grounding expertise without requiring bounding-box priors at inference.

Dynamic ProbesKL Distillation
Inference

End-to-End Latent Decoding

Both BCM and DAC remain entirely inactive. The model decodes purely autoregressively in latent space as the next-step input embedding for latent reasoning steps with optimal efficiency.

Pure Autoregressive DecodingNo Bbox PriorLatent Reasoning

Visualizations

Latent reasoning autonomously localizes task-critical visual evidence — no external tools or bounding box priors

Latent reasoning visualization at inference. Dynamic probes interrogate the visual feature space step-by-step. Each latent step progressively sharpens the model's visual focus onto the task-critical region, driving a precise final answer without any external grounding tool.

Teacher vs. Student attention maps during Stage 2 distillation. The BCM teacher (top row) produces spatially precise attention over ground-truth regions; the DAC student (bottom row) progressively learns to match this spatial focus using only its own latent states as queries.

Training attention maps — Teacher vs Student

Benchmark Results

V-Reflection consistently outperforms the Qwen2.5-VL-7B baseline across all six perception-intensive benchmarks

Benchmark Qwen2.5-VL-7B V-Reflection (ours) Gain
MMVP 66.7 72.3 +5.6
BLINK 54.5 56.4 +1.9
V* Bench 78.5 81.7 +3.2
HRBench-4K 68.0 72.6 +4.6
HRBench-8K 63.8 66.3 +2.5
MME-RealWorld-Lite 45.8 53.9 +8.1
Download model weights on HuggingFace and reproduce all results with the provided evaluation scripts.

Quick Start

Get started with V-Reflection in a few commands

conda env create -f environment.yaml
conda activate train
pip install qwen-vl-utils
pip install flash-attn --no-build-isolation
# Stage 1: Box-Guided Compression (BCM)
bash scripts_release/train/sft_7b_stage1_box_resampler.sh
# Set --data_path and --image_folder in the script if needed
# Stage 2: Dynamic Autoregressive Compression (DAC)
export CHECKPOINT_PATH="path/to/stage1_checkpoint"
bash scripts_release/train/sft_7b_stage2_distillation.sh
# Full benchmark evaluation (BLINK, MMVP, VSTAR, HRBench4K/8K, MME-RealWorld-Lite)
export EVAL_CHECKPOINT_PATH="path/to/checkpoint"
bash scripts_release/evaluation/evaluation_7b_stage2.sh