Hierarchical Vision-LLM Fusion
đź”— Source: arXiv
Hierarchical Pre-Training of Vision Encoders with Large Language Models
🚀 Technical Novelty
- Mechanism: Introduces a hierarchical cross-attention framework that injects multi-level visual features from a vision encoder into an LLM across multiple layers, stabilized by a three-stage progressive training pipeline (projector alignment → joint optimization → end-to-end fine-tuning).
- Nuance: Differs from standard late-fusion or frozen-LLM paradigms (e.g., BLIP-2, Flamingo) by explicitly pre-training the vision encoder through structured cross-modal gradient flow, preserving fine-grained spatial hierarchies rather than collapsing embeddings into a single flattened vector.
đź’ˇ Yield
- Consistently outperforms self-attention baselines on image classification (ImageNet, CIFAR) and complex vision-language reasoning benchmarks (MME, GQA, OK-VQA, ScienceQA).
- Delivers a 3Ă— speedup in per-epoch training time and reduces peak GPU memory consumption by 55% while enabling granular gradient propagation into early encoder layers.
⚠️ Limitations
- Connection density ablations were minimal (“without extensive ablations”), leaving optimal integration sparsity less rigorously quantified.
- Currently validated only on static images and specific LLM scales, with explicit extensions to video modalities and dynamic layer selection deferred to future work.