FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging

Ziyang, Fan; Keyu, Chen

FlashVID: Efficient Video Large Language Models via Training-free Tree-Based Spatiotemporal Token Merging

Ziyang Fan¹, Keyu Chen¹, Ruilong Xing¹, Yulin Li¹, Li Jiang^2,3, Zhuotao Tian^1,3*

¹Harbin Institute of Technology, Shenzhen
²The Chinese University of Hong Kong, Shenzhen
³Shenzhen Loop Area Institute

ICLR 2026 Oral

^*Indicates Corresponding Author

Paper Code arXiv

Highlights

Our FlashVID significantly outperforms previous state-of-the-art acceleration frameworks (e.g., VisionZip, FastVID) across three representative VLLMs (i.e., LLaVA-OneVision, LLaVA-Video, Qwen2.5-VL) on five widely used video understanding benchmarks (i.e., VideoMME, EgoSchema, LongVideoBench, MVBench, MLVU).
FlashVID can serve as a training-free and plug-and-play module for extending long video frames, enabling a 10x increase in video frame input to Qwen2.5-VL, resulting in 8.6% within the same computational budget.
Existing efficient Video LLM methods often independently compress spatial and temporal redundancy, overlooking the intrinsic spatiotemporal relationships in videos. To address this, we present a simple yet effective solution: Tree-based Spatiotemporal Token Merging (TSTM) for fine-grained spatiotemporal redundancy compression.

Abstract

Although Video Large Language Models (VLLMs) have shown remarkable capabilities in video understanding, they are required to process high volumes of visual tokens, causing significant computational inefficiency. Existing VLLMs acceleration frameworks usually compress spatial and temporal redundancy independently, which overlooks the spatiotemporal relationships, thereby leading to suboptimal spatiotemporal compression. The highly correlated visual features are likely to change in spatial position, scale, orientation, and other attributes over time due to the dynamic nature of video. Building on this insight, we introduce FlashVID, a training-free inference acceleration framework for VLLMs. Specifically, FlashVID utilizes Attention and Diversity-based Token Selection (ADTS) to select the most representative tokens for basic video representation, then applies Tree-based Spatiotemporal Token Merging (TSTM) for fine-grained spatiotemporal redundancy elimination. Extensive experiments conducted on three representative VLLMs across five video understanding benchmarks demonstrate the effectiveness and generalization of our method. Notably, by retaining only 10% of visual tokens, FlashVID preserves 99.1% of the performance of LLaVA-OneVision. Consequently, FlashVID can serve as a training-free and plug-and-play module for extending long video frames, which enables a 10x increase in video frame input to Qwen2.5-VL, resulting in a relative improvement of 8.6% within the same computational budget. Code is available at https://github.com/Fanziyang-v/FlashVID.

Motivation

Temporal redundancy is not bound to fixed spatial locations. Semantically consistent elements in videos often shift in spatial position, scale, or appearance due to motion and scene dynamics, making rigid spatial correspondence across frames unreliable;
Spatial and temporal redundancy are inherently coupled. Redundant regions within a single frame frequently persist across multiple frames. Decoupled spatiotemporal redundancy compression overlooks the intrinsic spatiotemporal relationships, leading to suboptimal compression.

Method

Illustration of FlashVID. We compress visual tokens by two synergistic modules.

Attention and Diversity-based Token Selection (ADTS) prioritizes spatiotemporally informative tokens while ensuring feature diversity by solving a calibrated Max-Min Diversity Problem (MMDP);
Tree-based Saptiotemporal Token Merging (TSTM) models redundancy by spatiotemporal redundancy trees, which effectively capture fine-grained video dynamics. Each redundancy tree will be aggregated into a single token representation.

Visualization of TSTM (Example 1)

Visualization of TSTM (Example 2)

Visualization of TSTM (Example 3)

Experiments

Video Presentation

Another Carousel

Poster

BibTeX

@inproceedings{fan2026flashvid,
  title={FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging},
  author={Fan, Ziyang and Chen, Keyu and Xing, Ruilong and Li, Yulin and Jiang, Li and Tian, Zhuotao},
  booktitle={Proceedings of the 14th International Conference on Learning Representations},
  year={2026}
}

More Works from Our Lab

Paper Title 1

Paper Title 2

Paper Title 3