🔗 Source: arXiv

More Images, More Problems? A Controlled Analysis of VLM Failure Modes.

🚀 Technical Novelty

  • Mechanism: Procedural synthetic data generation combined with layer-wise causal attention masking that restricts vision tokens to intra-image attention during fine-tuning.
  • Nuance: Unlike prior works that repurpose existing datasets, this approach isolates unitary reasoning failures (aggregation, tracking, distractors) via a controlled testbed and explicitly modifies the optimization landscape to curb spurious cross-image interactions.

💡 Yield

  • Establishes new state-of-the-art across MuirBench, Blink, MMIU, and MIMIC benchmarks; reduces computational cost by ~81% FLOPs while significantly boosting multi-concept tracking and information aggregation capabilities.

⚠️ Limitations

  • Benchmark domain restricted to MS-COCO for controlled variable isolation; adaptive resolution strategies for fine-grained pixel tasks are unexplored; validation primarily focuses on open-weight architectures.