Visual ICL Demo Selection
🔗 Source: arXiv
Learning to Select Visual In-Context Demonstrations
🚀 Technical Novelty
- Mechanism: A Dueling DQN agent equipped with a query-centric Transformer Decoder sequentially constructs demonstration sets by optimizing for downstream MLLM accuracy (negative MAE reward).
- Nuance: Shifts from static, similarity-first kNN retrieval to a dynamic RL policy that explicitly trades visual redundancy for label-space diversity and regression boundary coverage.
💡 Yield
- LSD significantly outperforms kNN on objective visual regression benchmarks (e.g., UTKFace, KADID-10k) while kNN remains optimal for subjective preference tasks.
- The learned policy generalizes across unseen MLLMs (Gemma, Qwen, Phi) and exhibits emergent label-awareness despite receiving no explicit label supervision during training.
- Empirical analysis confirms the set of selected demonstrations matters far more than their sequential order for downstream performance.
⚠️ Limitations
- The diversity-seeking policy can introduce unnecessary variance for subjective preference tasks, where strict visual similarity (kNN) remains superior.
- Performance gains are task-dependent, requiring careful reward design to avoid over-optimizing for specific regression boundaries rather than general ICL robustness.