🔗 Source: arXiv

Moshi RAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

Mechanism: Triggers a special <ret> token during autoregressive speech generation to asynchronously invoke an external knowledge retrieval system, exploiting the natural temporal gap between response onset and key informational content.
Nuance: Dynamically activates back-end retrieval only when context demands it within a strict ~2-second window, avoiding the fixed-interval calls, synchronous processing bottlenecks, or heavy retraining required by prior full-duplex RAG approaches.

Achieves factuality parity with top publicly released non-duplex speech LMs while preserving real-time interactivity metrics on full-duplex benchmarks.
Enables plug-and-play integration of diverse retrieval backends (e.g., web search, LLMs) at inference time without retraining the base model.
Demonstrates strong zero-shot generalization to out-of-domain mathematical reasoning tasks by effectively leveraging external tools for complex problem-solving.

Relies on a separate streaming ASR module for text transcription, introducing potential error propagation and added pipeline complexity.
Strict ~2-second retrieval delay constraint may bottleneck highly complex queries requiring extended multi-step reasoning or slower external backends.
Performance remains tightly coupled to the quality, latency, and availability of external knowledge sources, limiting robustness in offline or restricted-network environments.