🔗 Source: arXiv

Learning to Select Visual In-Context Demonstrations

Mechanism: A Dueling DQN agent equipped with a query-centric Transformer Decoder sequentially constructs demonstration sets by optimizing for downstream MLLM accuracy (negative MAE reward).
Nuance: Shifts from static, similarity-first kNN retrieval to a dynamic RL policy that explicitly trades visual redundancy for label-space diversity and regression boundary coverage.

LSD significantly outperforms kNN on objective visual regression benchmarks (e.g., UTKFace, KADID-10k) while kNN remains optimal for subjective preference tasks.
The learned policy generalizes across unseen MLLMs (Gemma, Qwen, Phi) and exhibits emergent label-awareness despite receiving no explicit label supervision during training.
Empirical analysis confirms the set of selected demonstrations matters far more than their sequential order for downstream performance.

The diversity-seeking policy can introduce unnecessary variance for subjective preference tasks, where strict visual similarity (kNN) remains superior.
Performance gains are task-dependent, requiring careful reward design to avoid over-optimizing for specific regression boundaries rather than general ICL robustness.