LatentLens Interpretable Visual Tokens

🔗 Source: arXiv

LATENTLENS: Revealing Highly Interpretable Visual Tokens in LLMs

🚀 Technical Novelty

Mechanism: Maps latent visual token representations to natural language descriptions by finding nearest neighbors in a large indexed corpus of contextualized text embeddings across intermediate LLM layers.
Nuance: Replaces vocabulary-bound subword matching (LogitLens/EmbeddingLens) with full-sentence contextual comparisons, uncovering that early visual tokens align with mid-layer semantic representations rather than lexical inputs.

💡 Yield

Across 10 VLMs, LATENTLENS renders 72% of visual tokens interpretable versus 30%/23% for prior lenses; reveals a “Mid-Layer Leap” proving frozen LLMs natively align vision with semantic language structures via minimal projection.

⚠️ Limitations

Requires constructing and storing a massive index of contextual embeddings, limiting real-time deployment; primarily an interpretability diagnostic tool rather than a training or inference accelerator; focuses on frozen LLMs with simple connectors, leaving complex fine-tuned VLM pipelines less examined.