🔗 Source: arXiv

Learning to Orchestrate Agents in Natural Language with the Conductor

Mechanism: Trains a 7B LLM via GRPO reinforcement learning to output natural-language agentic workflows (subtasks, agent assignments, and access lists) that dynamically coordinate worker models at inference.
Nuance: Replaces static multi-agent scaffolds and manual prompting with end-to-end RL reward maximization, allowing flexible coordination topologies and prompt engineering to emerge naturally while adapting to arbitrary agent pools at runtime.

Achieves state-of-the-art results on LiveCodeBench and GPQA Diamond with a 7B model, surpassing costly multi-agent baselines using fewer API calls.
Generalizes across diverse math, coding, and science domains by training with randomized agent pools.
Introduces recursive topologies where the Conductor calls itself, enabling tunable inference-time scaling through online iterative adaptation.

Performance gains come at the cost of increased test-time compute and latency due to iterative multi-agent coordination.
Effectiveness is contingent on the capabilities and diversity of the available worker agent pool, despite randomization during training.