🔗 Source: arXiv

LATENTLENS: Revealing Highly Interpretable Visual Tokens in LLMs

🚀 Technical Novelty

  • Mechanism: Training-free mapping of latent visual token representations to natural language descriptions by retrieving nearest neighbors from a large corpus of contextualized text embeddings across intermediate LLM layers.
  • Nuance: Replaces static vocabulary/unembedding matrix lookups (LogitLens/EmbeddingLens) with dynamic, sentence-level contextual references, capturing richer semantic alignment and avoiding subword/punctuation noise.

💡 Yield

  • Evaluated across 10 VLMs, LATENTLENS renders 72% of visual tokens interpretable versus 30% (EmbeddingLens) and 23% (LogitLens).
  • Identifies a “Mid-Layer Leap,” showing early visual tokens align with mid-layer semantic representations rather than lexical embeddings.
  • Provides consistent, full-sentence descriptions for visual patches across all LLM depths, including accurately decoding rendered text.

⚠️ Limitations

  • Requires pre-computing and storing a massive index of contextualized embeddings from the target LLM, limiting real-time deployment.
  • Functions primarily as an interpretability/analysis framework rather than a method for improving model performance or training efficiency.
  • Interpretability metrics rely on external VLM-judge evaluations, which may introduce subjective bias in assessing semantic alignment.