LatentLens Interpretable Visual Tokens
🔗 Source: arXiv
LATENTLENS: Revealing Highly Interpretable Visual Tokens in LLMs
🚀 Technical Novelty
- Mechanism: Maps latent visual token representations to natural language descriptions by finding nearest neighbors in a large indexed corpus of contextualized text embeddings across intermediate LLM layers.
- Nuance: Replaces vocabulary-bound subword matching (LogitLens/EmbeddingLens) with full-sentence contextual comparisons, uncovering that early visual tokens align with mid-layer semantic representations rather than lexical inputs.
💡 Yield
- Across 10 VLMs, LATENTLENS renders 72% of visual tokens interpretable versus 30%/23% for prior lenses; reveals a “Mid-Layer Leap” proving frozen LLMs natively align vision with semantic language structures via minimal projection.
⚠️ Limitations
- Requires constructing and storing a massive index of contextual embeddings, limiting real-time deployment; primarily an interpretability diagnostic tool rather than a training or inference accelerator; focuses on frozen LLMs with simple connectors, leaving complex fine-tuned VLM pipelines less examined.