Multi-Modal Latent CoT

🔗 Source: arXiv

Mechanism: Replaces static off-the-shelf vision extractors with a diffusion-based latent space learning module that iteratively aligns visual features with textual reasoning steps via noise prediction and denoising.
Nuance: Unlike prior SOTA methods that fuse shallow, fixed CLIP/DETR features via attention, this approach captures high-level semantic dependencies through iterative transformations, enabling deeper cross-modal understanding specifically tailored for complex logical inference.

Achieves state-of-the-art performance on ScienceQA (90.97% base / 93.35% large), surpassing ChatGPT by 18.18% with under 1B parameters.
Demonstrates strong cross-task generalization, significantly boosting multi-modal machine translation BLEU scores across multiple benchmarks.
Ablation studies confirm that fine-tuning the VAE and UNet components during CoT training is essential for generating reasoning-aligned visual latents over static pre-training.

Relies on a sequential two-stage pipeline (rationale generation followed by answer inference), making it vulnerable to error propagation if initial rationales are flawed.
Requires careful input masking for image-less questions (preferring zero tensors over blank images) to prevent diffusion noise from introducing misleading visual priors.
The iterative diffusion process increases computational overhead during training and inference compared to single-pass feature extraction baselines.

Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models