🔗 Source: arXiv

Visual In-Context Learning for Large Vision-Language Models

Mechanism: Implements a three-stage pipeline: visual demonstration retrieval via ViT, cross-modal reranking with CLIP, and intent-oriented image summarization that converts visual examples into concise, task-specific text descriptions for language-only prompt composition.
Nuance: Unlike standard LVLM ICL that concatenates raw images (causing token bloat and cross-modal representation mismatches), VICL fully translates demonstrations into language, leveraging the LLM’s native reasoning pathways while introducing in-context unlearning capabilities.

Achieves consistent performance gains across five visual reasoning datasets by aligning demonstration semantics with task intent.
Information flow analysis confirms that text-only demonstrations improve cross-modal interaction efficiency and reduce representation disparity.
Demonstrates viable in-context unlearning, allowing models to selectively discard or reset specific knowledge via prompt engineering without parameter updates.

Relies heavily on the quality of external retrieval (ViT/CLIP) and the LVLM’s own captioning accuracy for intent summarization.
In-context unlearning is presented as a promising exploratory finding rather than a rigorously validated standalone technique.
Still subject to standard LLM context window constraints, albeit mitigated by reduced token counts per demonstration.