🔗 Source: arXiv

Moshi RAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

🚀 Technical Novelty

  • Mechanism: Triggers a special <ret> token during autoregressive speech generation to asynchronously invoke an external knowledge retrieval system, exploiting the natural temporal gap between response onset and key informational content.
  • Nuance: Dynamically activates back-end retrieval only when context demands it within a strict ~2-second window, avoiding the fixed-interval calls, synchronous processing bottlenecks, or heavy retraining required by prior full-duplex RAG approaches.

💡 Yield

  • Achieves factuality parity with top publicly released non-duplex speech LMs while preserving real-time interactivity metrics on full-duplex benchmarks.
  • Enables plug-and-play integration of diverse retrieval backends (e.g., web search, LLMs) at inference time without retraining the base model.
  • Demonstrates strong zero-shot generalization to out-of-domain mathematical reasoning tasks by effectively leveraging external tools for complex problem-solving.

⚠️ Limitations

  • Relies on a separate streaming ASR module for text transcription, introducing potential error propagation and added pipeline complexity.
  • Strict ~2-second retrieval delay constraint may bottleneck highly complex queries requiring extended multi-step reasoning or slower external backends.
  • Performance remains tightly coupled to the quality, latency, and availability of external knowledge sources, limiting robustness in offline or restricted-network environments.