<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Vision-Language Models | Learn more about Xuankun Yang</title><link>https://xuankunyang.github.io/tags/vision-language-models/</link><atom:link href="https://xuankunyang.github.io/tags/vision-language-models/index.xml" rel="self" type="application/rss+xml"/><description>Vision-Language Models</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Thu, 18 Jun 2026 00:00:00 +0000</lastBuildDate><image><url>https://xuankunyang.github.io/media/icon_hu_2097bf43f9a65fef.png</url><title>Vision-Language Models</title><link>https://xuankunyang.github.io/tags/vision-language-models/</link></image><item><title>Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning</title><link>https://xuankunyang.github.io/publication/2026-eccv-eadp/</link><pubDate>Thu, 18 Jun 2026 00:00:00 +0000</pubDate><guid>https://xuankunyang.github.io/publication/2026-eccv-eadp/</guid><description>&lt;h3 id="motivation"&gt;Motivation&lt;/h3&gt;
&lt;figure class="eadp-figure"&gt;&lt;img src="https://xuankunyang.github.io/publication/2026-eccv-eadp/motivation.png"
alt="Motivation examples for EADP"&gt;
&lt;/figure&gt;
&lt;p&gt;(a) illustrates a limitation of global guidance: it tends to attend to background regions.
(b) highlights the dispersion phenomenon caused by textual noise.
(c) reveals the issues of feature fragmentation and selection redundancy.&lt;/p&gt;
&lt;h3 id="method-overview"&gt;Method Overview&lt;/h3&gt;
&lt;figure class="eadp-figure"&gt;&lt;img src="https://xuankunyang.github.io/publication/2026-eccv-eadp/featured.png"
alt="Overview of the EADP method"&gt;
&lt;/figure&gt;
&lt;p&gt;EADP acts as a lightweight, plug-and-play module that compresses the original set of visual tokens into a smaller, more informative subset before the downstream LLM consumes them.&lt;/p&gt;
&lt;h4 id="stage-1-entropy-aware-dense-scoring"&gt;Stage 1: Entropy-Aware Dense Scoring&lt;/h4&gt;
&lt;p&gt;EADP computes dense cross-modal similarities between non-EOS text tokens and visual tokens, then estimates the spatial entropy of each text token&amp;rsquo;s similarity distribution. High-entropy tokens are treated as dispersed textual noise and filtered or down-weighted. The remaining low-entropy dense guidance is fused with the global EOS score to produce an instruction relevance map with both local precision and global semantic stability.&lt;/p&gt;
&lt;h4 id="stage-2-structured-token-selection"&gt;Stage 2: Structured Token Selection&lt;/h4&gt;
&lt;p&gt;After scoring, EADP refines the relevance map with spatial smoothing and score polarization. Gaussian smoothing propagates local structure, while polarization sharpens core visual entities against the background. Instead of selecting tokens with naive Top-K, EADP formulates token selection as a facility-location submodular maximization problem, encouraging non-redundant coverage of the original visual content.&lt;/p&gt;
&lt;h3 id="results"&gt;Results&lt;/h3&gt;
&lt;h4 id="results-on-llava-15"&gt;Results on LLaVA-1.5&lt;/h4&gt;
&lt;figure class="eadp-figure"&gt;&lt;img src="https://xuankunyang.github.io/publication/2026-eccv-eadp/llava_1p5_7b.png"
alt="Results on LLaVA-1.5"&gt;
&lt;/figure&gt;
&lt;h4 id="results-on-llava-16"&gt;Results on LLaVA-1.6&lt;/h4&gt;
&lt;figure class="eadp-figure"&gt;&lt;img src="https://xuankunyang.github.io/publication/2026-eccv-eadp/llava_1p6_7b.png"
alt="Results on LLaVA-1.6"&gt;
&lt;/figure&gt;
&lt;h4 id="results-on-qwen25-vl"&gt;Results on Qwen2.5-VL&lt;/h4&gt;
&lt;figure class="eadp-figure"&gt;&lt;img src="https://xuankunyang.github.io/publication/2026-eccv-eadp/qwen2p5_vl_7b.png"
alt="Results on Qwen2.5-VL"&gt;
&lt;/figure&gt;
&lt;h4 id="results-on-llava-video"&gt;Results on LLaVA-Video&lt;/h4&gt;
&lt;figure class="eadp-figure"&gt;&lt;img src="https://xuankunyang.github.io/publication/2026-eccv-eadp/llava_video_7b.png"
alt="Results on LLaVA-Video" width="75%"&gt;
&lt;/figure&gt;
&lt;h4 id="efficiency-analysis"&gt;Efficiency Analysis&lt;/h4&gt;
&lt;figure class="eadp-figure"&gt;&lt;img src="https://xuankunyang.github.io/publication/2026-eccv-eadp/efficiency_analysis.png"
alt="Efficiency analysis" width="75%"&gt;
&lt;/figure&gt;
&lt;p&gt;&lt;strong&gt;More results are provided in our paper.&lt;/strong&gt;&lt;/p&gt;</description></item><item><title>EADP</title><link>https://xuankunyang.github.io/project/EADP/</link><pubDate>Thu, 18 Jun 2026 00:00:00 +0000</pubDate><guid>https://xuankunyang.github.io/project/EADP/</guid><description>&lt;h2 id="combating-textual-noise-and-redundancy-entropy-aware-dense-visual-token-pruning"&gt;Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;ECCV 2026&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Author A, Xuankun Yang, Author C, Author D, Author E&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Affiliations:&lt;/strong&gt; Institution A, Institution B, Institution C&lt;/p&gt;
&lt;p&gt;&lt;a href="#resources"&gt;Paper&lt;/a&gt; | &lt;a href="#resources"&gt;Code&lt;/a&gt; | &lt;a href="#citation"&gt;BibTeX&lt;/a&gt;&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; EADP is a plug-and-play visual token pruning framework for VLMs/MLLMs. It combines entropy-aware dense scoring with submodular token selection to preserve fine-grained visual cues under strict token budgets.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;!-- TODO: Replace author list, affiliations, venue metadata, links, and figures after the camera-ready/project assets are finalized. --&gt;
&lt;h3 id="abstract"&gt;Abstract&lt;/h3&gt;
&lt;p&gt;Visual token pruning is a crucial strategy for accelerating Vision-Language Models by compressing redundant image patches, yet existing methods often fail to preserve critical cues under dense instructions and fine-grained queries. In this paper, we investigate this failure and identify two underlying bottlenecks: the widespread dispersion of textual noise that corrupts dense cross-modal scoring, and the feature fragmentation inherent to standard token selection. To address these issues, we propose &lt;strong&gt;E&lt;/strong&gt;ntropy-&lt;strong&gt;A&lt;/strong&gt;ware &lt;strong&gt;D&lt;/strong&gt;ense &lt;strong&gt;P&lt;/strong&gt;runing (EADP), a framework that reformulates pruning as a structured compression problem. EADP first leverages statistical entropy to quantify and filter out textual noise, yielding a robust, fine-grained instruction relevance score. Subsequently, instead of naive Top-$K$ selection, EADP casts token selection as a submodular maximization problem with a spatial prior, explicitly guaranteeing a holistic and non-redundant visual representation. Extensive experiments demonstrate that EADP significantly improves the accuracy-efficiency trade-off of VLMs, robustly preserving fine-grained visual cues under strict token budgets while achieving state-of-the-art performance on challenging multimodal benchmarks.&lt;/p&gt;
&lt;h3 id="motivation"&gt;Motivation&lt;/h3&gt;
&lt;figure class="eadp-figure"&gt;&lt;img src="https://xuankunyang.github.io/project/EADP/motivation.png"
alt="Motivation examples for EADP"&gt;
&lt;/figure&gt;
&lt;p&gt;(a) illustrates a limitation of global guidance: it tends to attend to background regions.
(b) highlights the dispersion phenomenon caused by textual noise.
(c) reveals the issues of feature fragmentation and selection redundancy.&lt;/p&gt;
&lt;!-- TODO: Replace this table with the paper's motivation figure or a web-optimized recreation. --&gt;
&lt;h3 id="method-overview"&gt;Method Overview&lt;/h3&gt;
&lt;figure class="eadp-figure"&gt;&lt;img src="https://xuankunyang.github.io/project/EADP/featured.png"
alt="Overview of the EADP method"&gt;
&lt;/figure&gt;
&lt;p&gt;EADP acts as a lightweight, plug-and-play module that compresses the original set of visual tokens into a smaller, more informative subset before the downstream LLM consumes them.&lt;/p&gt;
&lt;h4 id="stage-1-entropy-aware-dense-scoring"&gt;Stage 1: Entropy-Aware Dense Scoring&lt;/h4&gt;
&lt;p&gt;EADP computes dense cross-modal similarities between non-EOS text tokens and visual tokens, then estimates the spatial entropy of each text token&amp;rsquo;s similarity distribution. High-entropy tokens are treated as dispersed textual noise and filtered or down-weighted. The remaining low-entropy dense guidance is fused with the global EOS score to produce an instruction relevance map with both local precision and global semantic stability.&lt;/p&gt;
&lt;h4 id="stage-2-structured-token-selection"&gt;Stage 2: Structured Token Selection&lt;/h4&gt;
&lt;p&gt;After scoring, EADP refines the relevance map with spatial smoothing and score polarization. Gaussian smoothing propagates local structure, while polarization sharpens core visual entities against the background. Instead of selecting tokens with naive Top-K, EADP formulates token selection as a facility-location submodular maximization problem, encouraging non-redundant coverage of the original visual content.&lt;/p&gt;
&lt;!-- TODO: Add the final method diagram and, if useful, a short pseudocode block adapted for the project page. --&gt;
&lt;h3 id="results"&gt;Results&lt;/h3&gt;
&lt;h4 id="results-on-llava-15"&gt;Results on LLaVA-1.5&lt;/h4&gt;
&lt;figure class="eadp-figure"&gt;&lt;img src="https://xuankunyang.github.io/project/EADP/llava_1p5_7b.png"
alt="Results on LLaVA-1.5"&gt;
&lt;/figure&gt;
&lt;h4 id="results-on-llava-16"&gt;Results on LLaVA-1.6&lt;/h4&gt;
&lt;figure class="eadp-figure"&gt;&lt;img src="https://xuankunyang.github.io/project/EADP/llava_1p6_7b.png"
alt="Results on LLaVA-1.6"&gt;
&lt;/figure&gt;
&lt;h4 id="results-on-qwen25-vl"&gt;Results on Qwen2.5-VL&lt;/h4&gt;
&lt;figure class="eadp-figure"&gt;&lt;img src="https://xuankunyang.github.io/project/EADP/qwen2p5_vl_7b.png"
alt="Results on Qwen2.5-VL"&gt;
&lt;/figure&gt;
&lt;h4 id="results-on-llava-video"&gt;Results on LLaVA-Video&lt;/h4&gt;
&lt;figure class="eadp-figure"&gt;&lt;img src="https://xuankunyang.github.io/project/EADP/llava_video_7b.png"
alt="Results on LLaVA-Video" width="75%"&gt;
&lt;/figure&gt;
&lt;h4 id="efficiency-analysis"&gt;Efficiency Analysis&lt;/h4&gt;
&lt;figure class="eadp-figure"&gt;&lt;img src="https://xuankunyang.github.io/project/EADP/efficiency_analysis.png"
alt="Efficiency analysis" width="75%"&gt;
&lt;/figure&gt;
&lt;!-- TODO: Replace the placeholder table with the final public results and add visual comparisons. --&gt;
&lt;p&gt;&lt;strong&gt;More results are provided in our paper.&lt;/strong&gt;&lt;/p&gt;
&lt;h3 id="citation"&gt;Citation&lt;/h3&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bibtex" data-lang="bibtex"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nc"&gt;@inproceedings&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;eadp2026&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="na"&gt;title&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;{Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning}&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="na"&gt;author&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;{Author A and Xuankun Yang and Author C and Author D and Author E}&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="na"&gt;booktitle&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;{European Conference on Computer Vision (ECCV)}&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="na"&gt;year&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;{2026}&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="na"&gt;note&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;{Placeholder citation. Replace with the official camera-ready metadata.}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description></item></channel></rss>