arXiv 2026

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Selim Kuzucu2,†, Alessio Tonioni1, Vasile Lup1, Bernt Schiele1, Federico Tombari1,3, Muhammad Ferjad Naeem1

1Google, 2Max Planck Institute for Informatics, SIC, 3Technical University of Munich

Work done while interning at Google.

Abstract

Abstract

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression induces spectral aliasing that obscures fine-grained detail, while query-only compression sacrifices explicit grid-aligned spatial structure. PARCEL resolves this conflict by establishing pooled spatial tokens as low-frequency layout anchors and conditioning elastic query tokens on these anchors through Pool-Conditioned Query Resampling. Across 27 benchmarks, PARCEL improves the performance-efficiency Pareto frontier over prior elastic baselines while preserving the “train once, deploy anywhere” paradigm.

Teaser

Elastic inference across tasks and visual-token budgets.

A single PARCEL checkpoint can be deployed at multiple operating points, spanning aggressive 16-token compression up to richer 256-token inference.

Animated overview of elastic inference with PARCEL across tasks and token budgets.
Overview of elastic inference. PARCEL supports a single training recipe while exposing multiple deployment operating points, from severe 16-token compression to richer 256-token inference.

Overview

PARCEL separates spatial anchoring from semantic exploration.

Existing elastic LVLM compression methods typically fail by overcommitting to one of these roles. PARCEL explicitly splits them across pooled anchors and conditioned query tokens.

Observation

Existing elastic LVLM compression methods fail in opposite ways under aggressive budgets. Spatial-only pooling blurs detail through aliasing, while query-only resampling weakens spatial relationships.

Design

PARCEL divides the work explicitly: pooled 2D anchors preserve low-frequency layout, and pool-aware query tokens recover complementary visual detail from the full feature grid.

Effect

Across 27 benchmarks, PARCEL shifts the Pareto frontier, outperforming M3 and MQT at matched budgets while retaining the practical “train once, deploy anywhere” story.

Method

Pool-Conditioned Query Resampling conditions queries on spatial anchors.

The pooled anchors preserve the low-frequency layout, so the explorer query tokens can spend their capacity on complementary visual detail.

Animated explanation of Pool-Conditioned Query Resampling in PARCEL.
Pool-Conditioned Query Resampling in motion. The anchor tokens establish spatial structure first; conditioned queries then become targeted semantic explorers instead of redundant layout encoders.
01

2D Anchors

Budget-aware pooling creates deterministic spatial anchors that preserve low-frequency geometry and explicit layout.

02

Pool-Aware Queries

Query tokens first attend to the anchors, so they no longer need to infer coarse layout from scratch.

03

Semantic Explorers

The conditioned queries cross-attend to the full ViT features to recover complementary details that the pooled grid drops.

04

Budget-Aware Routing

PARCEL dynamically scales anchor size and query count so the model stays effective across both severe and generous budgets.

Budget-Aware Routing

The anchor-query split is determined by a piecewise routing strategy.

PARCEL dynamically balances spatial anchoring and semantic exploration based on the available budget, defining two distinct routing regimes without exhausting the allocated tokens.

Animated explanation of budget-aware routing in PARCEL.
Budget-aware piecewise routing. PARCEL dynamically determines the resolution of the spatial anchor and the number of complementary query tokens based on the budget constraints.
16 ≤ B < 64 Tokens

Low Budgets

Visual features are pooled into a 4x4 spatial anchor grid to preserve minimal layout, yielding 16 structural tokens. The remaining budget (B - 16) is allocated to pool-conditioned query tokens.

64 ≤ B ≤ 256 Tokens

Medium-to-High Budgets

The model upgrades to an 8x8 spatial anchor grid (64 tokens) for a richer spatial base. Any additional budget (B - 64) is assigned to the complementary query pathway to recover high-frequency details.

Spectral Analysis

Frequency analysis highlights the "division of labor".

The architecture is not only intuitive at the design level; its anchor-query decomposition is also reflected in measurable spectral behavior on ChartQA samples.

Baseband concentration plot from the paper.
Baseband concentration. PARCEL concentrates compressed spatial tokens in the low frequencies more cleanly than M3 on ChartQA samples, indicating a cleaner low-frequency baseband for spatial anchoring.
Spectral decoupling figure from the paper.
Spectral disentanglement. On ChartQA, the pool tokens capture layout while the query pathway keeps access to higher-frequency detail beyond the pooled grid.

These visualizations are drawn from ChartQA samples, and the same separation is reflected in downstream performance gains of +4.7 points at 64 tokens and +3.4 points at 256 tokens over M3.

Results

PARCEL improves the efficiency-retention frontier across 27 benchmarks.

Across 27 benchmarks, PARCEL improves the aggregate retention-efficiency trade-off while remaining compatible with the train-once, deploy-anywhere elastic inference setup.

Aggregate retention-efficiency trade-off for PARCEL.
Aggregate retention-efficiency trade-off. PARCEL improves mean retention over 27 benchmarks while scaling cleanly across image and video budgets.
Dense Recognition

Up to +8.9 over baselines

Across the RefCOCO suite, PARCEL improves over both MQT and M3 baselines, with gains reaching up to +8.9 points over M3.

Video Aggregate

Up to +4.4 over baselines

PARCEL reaches 98.0%, 97.9%, and 95.0% retention on the video aggregate, compared to second-best baseline retentions of 94.4%, 93.5%, and 94.0%.

Resolution-Sensitive

Up to +3.8 over baselines

On the top-3 resolution-sensitive tasks, including ChartQA and DocVQA, PARCEL reaches 90.8%, 90.0%, and 77.1% retention across 256, 64, and 16 tokens, while the second-best baseline reaches 90.0%, 86.2%, and 75.5%.

ChartQA (Human)

Up to +2.9 over baselines

PARCEL improves over the second-best baseline by +1.6, +2.9, and +1.3 points at 256, 64, and 16 tokens.

MSR-VTT-Cap

Up to +4.4 over baselines

PARCEL improves over the second-best baseline by +1.8, +4.4, and +1.0 points at 256, 64, and 16 tokens.

BibTeX

Citation

If you find PARCEL useful for your work, please cite our work below.

@article{kuzucu2026parcel,
  title   = {PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding},
  author  = {Kuzucu, Selim and Tonioni, Alessio and Lup, Vasile and Schiele, Bernt and Tombari, Federico and Naeem, Muhammad Ferjad},
  journal = {arXiv preprint arXiv:2605.30126},
  year    = {2026}
}