Task-Aware Multimodal ICL
🔗 Source: arXiv
ADVANCING MULTIMODAL IN-CONTEXT LEARNING IN LARGE VISION-LANGUAGE MODELS WITH TASK-AWARE DEMONSTRATIONS
🚀 Technical Novelty
- Mechanism: SabER deploys a lightweight decoder-only transformer with task-aware attention to autoregressively select and order in-context demonstrations (ICDs), iteratively refining cross-modal feature extraction and explicit task mapping.
- Nuance: Unlike prior heuristic or similarity-based retrieval methods that treat selection and ordering as disjoint steps, SabER jointly optimizes ICD configuration end-to-end using explicit task-specific queries and hierarchical masking to resolve modality misalignment and complex input-output mappings.
💡 Yield
- Consistently outperforms SOTA across 5 LVLMs and 9 benchmarks (gains of +2.00% to 9.26%), with the largest improvements on complex VQA tasks (+8.41% on VizWiz).
- Empirically proves Task Recognition (TR) dominates Task Learning (TL) in LVLMs, and that explicit task semantics (queries) are more critical than ground-truth labels for guiding ICL.
- Ablations confirm hierarchical layer interaction and binary gating modules are optimal for cross-modal grounding without overfitting or feature loss.
⚠️ Limitations
- Trade-off exists between instruction detail and task recognition accuracy; overly verbose prompts can skew semantics and hinder model convergence.
- Performance gains are more modest on simpler tasks (e.g., image captioning) compared to complex, composition-heavy benchmarks.
- Relies on fixed pre-trained CLIP encoders for initial embedding generation, which may limit adaptability to novel visual domains without re-encoding or adapter fine-tuning.