Transform MLLMs into
Active Interrogators.

V-Reflection introduces a "think-then-look" visual reflection mechanism where latent states act as dynamic probes that actively interrogate the visual feature space — grounding each reasoning step for task-critical evidence.

Read Paper GitHub HuggingFace

(a) Traditional MLLMs treat visual information as a static input, leading to perception hallucinations (e.g., "Kevlar") by prioritizing language priors over visual evidence. (b) V-Reflection's "think-then-look" mechanism uses evolving latent states as dynamic probes (Q_dyn) to retrace global visual features, accurately localizing task-critical evidence (e.g., the rubber glove) for a precise answer.

Abstract

MLLMs remain prone to perception-related hallucinations because their reasoning is confined to the language domain, treating visual input as a static, reasoning-agnostic preamble. We propose V-Reflection, a framework that transforms the MLLM into an active interrogator through a "think-then-look" visual reflection mechanism. A two-stage distillation strategy synthesizes explicit visual grounding with continuous latent reasoning: the BCM establishes stable pixel-to-latent targets; the DAC distills this spatial expertise into dynamic latent probes. During inference, both modules remain entirely inactive, maintaining purely end-to-end autoregressive decoding with optimal efficiency.

How V-Reflection Works

A two-stage distillation strategy that bridges explicit spatial grounding with continuous latent reasoning

Fig. 1 — V-Reflection Architecture. (a) Stage 1: BCM distills regional patches into grounded latent tokens Z_T. (b) Stage 2: DAC trains hidden states H as dynamic probes that interrogate global features. (c) Inference: both modules remain entirely inactive; purely end-to-end autoregressive decoding.

Stage 1

Box-Guided Compression (BCM)

BCM uses RoI-Align to extract local region features from bounding boxes, then compresses them into grounded latent tokens Z_T via cross-attention. A Stochastic Decoupled Alignment strategy prevents representation collapse by alternating gradient flow between latent tokens Z_T and hidden states H.

Pixel-to-LatentBox-Guided

→

Stage 2

Dynamic Autoregressive Compression (DAC)

DAC projects the LLM's S evolving hidden states H into dynamic queries Q_dyn, which interrogate the global visual feature map. MSE + KL divergence distillation from BCM teacher transfers spatial grounding expertise without requiring bounding-box priors at inference.

Dynamic ProbesKL Distillation

→

Inference

End-to-End Latent Decoding

Both BCM and DAC remain entirely inactive. The model decodes purely autoregressively in latent space as the next-step input embedding for latent reasoning steps with optimal efficiency.

Pure Autoregressive DecodingNo Bbox PriorLatent Reasoning

Visualizations

Latent reasoning autonomously localizes task-critical visual evidence — no external tools or bounding box priors

Latent reasoning visualization at inference. Dynamic probes interrogate the visual feature space step-by-step. Each latent step progressively sharpens the model's visual focus onto the task-critical region, driving a precise final answer without any external grounding tool.

Teacher vs. Student attention maps during Stage 2 distillation. The BCM teacher (top row) produces spatially precise attention over ground-truth regions; the DAC student (bottom row) progressively learns to match this spatial focus using only its own latent states as queries.

Training attention maps — Teacher vs Student

Benchmark Results

V-Reflection consistently outperforms the Qwen2.5-VL-7B baseline across all six perception-intensive benchmarks

Benchmark	Qwen2.5-VL-7B	V-Reflection (ours)	Gain
MMVP	66.7	72.3	+5.6
BLINK	54.5	56.4	+1.9
V* Bench	78.5	81.7	+3.2
HRBench-4K	68.0	72.6	+4.6
HRBench-8K	63.8	66.3	+2.5
MME-RealWorld-Lite	45.8	53.9	+8.1

Download model weights on HuggingFace and reproduce all results with the provided evaluation scripts.

Quick Start

Get started with V-Reflection in a few commands

              conda env create -f environment.yaml
conda activate train
pip install qwen-vl-utils
pip install flash-attn --no-build-isolation
            

              # Stage 1: Box-Guided Compression (BCM)
bash scripts_release/train/sft_7b_stage1_box_resampler.sh
# Set --data_path and --image_folder in the script if needed
            

              # Stage 2: Dynamic Autoregressive Compression (DAC)
export CHECKPOINT_PATH="path/to/stage1_checkpoint"
bash scripts_release/train/sft_7b_stage2_distillation.sh
            

              # Full benchmark evaluation (BLINK, MMVP, VSTAR, HRBench4K/8K, MME-RealWorld-Lite)
export EVAL_CHECKPOINT_PATH="path/to/checkpoint"
bash scripts_release/evaluation/evaluation_7b_stage2.sh
            

Download Model Training Scripts Evaluation Scripts

Transform MLLMs intoActive Interrogators.