Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning
EADP overviewMotivation

(a) illustrates a limitation of global guidance: it tends to attend to background regions. (b) highlights the dispersion phenomenon caused by textual noise. (c) reveals the issues of feature fragmentation and selection redundancy.
Method Overview

EADP acts as a lightweight, plug-and-play module that compresses the original set of visual tokens into a smaller, more informative subset before the downstream LLM consumes them.
Stage 1: Entropy-Aware Dense Scoring
EADP computes dense cross-modal similarities between non-EOS text tokens and visual tokens, then estimates the spatial entropy of each text token’s similarity distribution. High-entropy tokens are treated as dispersed textual noise and filtered or down-weighted. The remaining low-entropy dense guidance is fused with the global EOS score to produce an instruction relevance map with both local precision and global semantic stability.
Stage 2: Structured Token Selection
After scoring, EADP refines the relevance map with spatial smoothing and score polarization. Gaussian smoothing propagates local structure, while polarization sharpens core visual entities against the background. Instead of selecting tokens with naive Top-K, EADP formulates token selection as a facility-location submodular maximization problem, encouraging non-redundant coverage of the original visual content.
Results
Results on LLaVA-1.5

Results on LLaVA-1.6

Results on Qwen2.5-VL

Results on LLaVA-Video

Efficiency Analysis

More results are provided in our paper.