Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning

Jun 18, 2026·
Author A
Xuankun Yang
Xuankun Yang
,
Author C
,
Author D
,
Author E
· 1 min read
EADP overview
Abstract
Visual token pruning is a crucial strategy for accelerating Vision-Language Models by compressing redundant image patches, yet existing methods often fail to preserve critical cues under dense instructions and fine-grained queries. In this work, we investigate this failure and identify two underlying bottlenecks, the widespread dispersion of textual noise that corrupts dense cross-modal scoring and the feature fragmentation inherent to standard token selection. To address these issues, we propose Entropy-Aware Dense Pruning (EADP), a framework that reformulates pruning as a structured compression problem. EADP first leverages statistical entropy to quantify and filter out textual noise, yielding a robust, fine-grained instruction relevance score. Subsequently, instead of naive Top-K selection, EADP casts token selection as a submodular maximization problem with a spatial prior, explicitly encouraging a holistic and non-redundant visual representation. Extensive experiments show that EADP improves the accuracy-efficiency trade-off of VLMs, preserving fine-grained visual cues under strict token budgets on challenging multimodal benchmarks.
Type
Publication
In European Conference on Computer Vision (ECCV), 2026

Motivation

Motivation examples for EADP

(a) illustrates a limitation of global guidance: it tends to attend to background regions. (b) highlights the dispersion phenomenon caused by textual noise. (c) reveals the issues of feature fragmentation and selection redundancy.

Method Overview

Overview of the EADP method

EADP acts as a lightweight, plug-and-play module that compresses the original set of visual tokens into a smaller, more informative subset before the downstream LLM consumes them.

Stage 1: Entropy-Aware Dense Scoring

EADP computes dense cross-modal similarities between non-EOS text tokens and visual tokens, then estimates the spatial entropy of each text token’s similarity distribution. High-entropy tokens are treated as dispersed textual noise and filtered or down-weighted. The remaining low-entropy dense guidance is fused with the global EOS score to produce an instruction relevance map with both local precision and global semantic stability.

Stage 2: Structured Token Selection

After scoring, EADP refines the relevance map with spatial smoothing and score polarization. Gaussian smoothing propagates local structure, while polarization sharpens core visual entities against the background. Instead of selecting tokens with naive Top-K, EADP formulates token selection as a facility-location submodular maximization problem, encouraging non-redundant coverage of the original visual content.

Results

Results on LLaVA-1.5

Results on LLaVA-1.5

Results on LLaVA-1.6

Results on LLaVA-1.6

Results on Qwen2.5-VL

Results on Qwen2.5-VL

Results on LLaVA-Video

Results on LLaVA-Video

Efficiency Analysis

Efficiency analysis

More results are provided in our paper.