第 20 章：多 Agent 协作

阅读契约： 本章回答一个问题：Codex 协调多个 agents 时，哪些事实属于 live thread topology，哪些属于 model-visible communication，哪些属于 offline trace reconstruction？阅读时跟住 thread identity、graph edges、collaboration events、pending reducer edges 和 result ownership。读完后，应该能解释为什么 child agent 是有 lifecycle state 的 thread，而不是匿名后台 prompt。

多 agent 协作图：展示 parent/child threads、spawn edges、mailbox、wait state、outcome 与 trace reduction — 多 Agent 工作是一张显式边图：spawn、mailbox、result、close、live updates 和 trace evidence 都有边界。

源码边界： 本章只有在链接到固定 Codex commit 或本章源码地图的 files、structs、enums、handlers、event shapes、graph-store operations、reducer behavior 时，才把说法视为 verified source。把 live topology 与 trace topology 概括成两种不同 graphs，是从这些可见边界得出的 surrounding contract inference。本章不声称知道隐藏 scheduler 或 provider-side coordination policy。

第 19 章在第五部结尾说明 Codex 如何导入外部状态，却不继承外部 runtime。本章把这个纪律推进到运行时协作：当 Codex 已经拥有 native threads 与 native extension surfaces，多 agent 工作就应该表示为 threads、tools、messages、status 和 trace evidence 之间的显式关系。

关键变化是：multi-agent 不是一组后台 prompt，而是一张 lifecycle graph。父 thread 可以 spawn child，可以 send 或 assign work，可以 wait status，可以接收 result，可以 resume descendant，也可以 close relationship。每个动作都有三个 surface：

Surface	Owner	回答什么
live runtime	`AgentControl`、thread manager、graph store	现在可以 spawn、message、wait、resume 或 close 什么
protocol events	`Collab*Event` shapes	clients 可以不用解析 prose 就渲染什么
rollout trace	reducer interaction edges	raw events 捕获后实际发生了什么

不变量是：coordination 不能把 terminal text 当成唯一事实来源。

一、协作单位仍然是 thread

单个 Codex turn 已经有 model input、streamed output、tool calls、approvals、hooks、persistence、cancellation 和 replay。Multi-agent coordination 没有发明第二套 runtime；它创建更多 threads，并记录信息如何在这些 threads 之间移动。

源码给这个 thread-centered design 一个具体 control plane。AgentControl 在 root session tree 内共享。它可以 spawn threads、send input、interrupt、subscribe status、close agents、resume agents from rollout，并 list live agents。usage_hint_text() 也显示 root 与 subagent sessions 可以根据 SessionSource 接收不同 usage hints。

Live graph store 刻意很窄。ThreadSpawnEdgeStatus 只有 Open 或 Closed。LocalAgentGraphStore upsert parent/child edge，设置 child-edge status，列 children，并按可选 status filter 列 descendants。

这种窄边界是优点。Graph store 不该知道模型如何描述任务，不该知道 TUI 如何渲染 child，也不该知道 trace viewer 如何画 causality。它只回答 operational topology questions。

Thread runtime record：thread identity、session facade、queues、history、projections、resume state、fork state 与 rollback ledger — Child agent 仍然是 thread record，拥有 queues、history、projections、resume state 和 rollback evidence。

二、Spawn 同时创建 thread、status edge 和 client event

spawn_agent handler 把 lifecycle 展开得很清楚。spawn.rs 解析 arguments，检查 depth，发 CollabAgentSpawnBeginEvent，构建 child config，通过 agent_control.spawn_agent_with_metadata() 启动 child，发 CollabAgentSpawnEndEvent，记录 telemetry，并返回 model-visible result。

Shape-level tool input：

{
  "message": "Audit chapter 17 source links",
  "agent_type": "reviewer",
  "model": "optional override",
  "reasoning_effort": "optional override",
  "fork_context": false
}

Shape-level tool output：

{
  "agent_id": "child-thread-id",
  "nickname": "optional nickname"
}

Live topology 随后在 AgentControl 中写入。spawn_agent_internal() 准备 SessionSource::SubAgent(ThreadSpawn)，启动或 fork child thread，通知 clients thread 已创建，持久化 spawn edge，并发送 initial input。persist_thread_spawn_edge_for_source() 把 edge upsert 成 Open。

2.1 Forking 有 recovery boundary

当 fork_context 被设置时，child 不是只由抽象 prompt 创建。spawn_forked_thread() 会 flush parent rollout，读取 stored parent history，可选截断到 last N fork turns，移除 parent usage hints 让 child 获得 fresh hints，过滤保留哪些 rollout items，并用 InitialHistory::Forked 启动 forked thread。

这保护两个不变量：

Pressure	会失败的简单方案	Source mechanism	保护的不变量
child 需要 parent context	复制 live in-memory state	fork 前 flush/read stored rollout	fork 有 durable baseline
parent usage hints 与 child hints 不同	把 parent developer hints replay 到 child	filter configured usage hints	child prompt 匹配 child session source
tool/history noise 会污染 fork	保留每个 rollout item	`keep_forked_rollout_item()` 过滤 item kinds	child 得到 useful history，而不是 parent runtime debris

三、发送 work 是 mailbox delivery，不是 shared memory

向 agent 发送 work，会把另一个 thread 命名为参与者。send_input.rs 解析 target thread ID，把 message 或 items 变成 user input，可选 interrupt receiver，发 interaction begin/end events，调用 agent_control.send_input()，并返回 submission ID。

Shape-level：

{
  "target": "child-thread-id",
  "message": "Check whether this migration claim is source-backed.",
  "interrupt": false
}

{
  "submission_id": "operation-id"
}

Mailbox model 有用，是因为 sender 与 receiver 的观察可能在 raw event stream 中分离。Parent tool call 证明 parent 请求了 delivery；receiver-side model-visible message 证明 delivery 在 child 的哪里进入。这两个事实相关，但不是同一个事实。

四、Wait 与 close 是 lifecycle operations

wait_agent 与 close_agent 把 status/lifecycle boundary 显式化。

wait_agent 解析 non-empty targets，clamp 正数 timeout，发 waiting begin/end events，订阅每个 child status，在至少一个 final status 到达或 timeout 时返回，并输出 map 与 timed_out。

{
  "status": {
    "child-thread-id": "completed"
  },
  "timed_out": false
}

close_agent 发 close begin/end events，观察 previous status，然后调用 agent_control.close_agent()。close_agent() 在 shutdown agent tree 之前，把 persisted child spawn edge 标为 Closed。

关闭关系不是删除 history。它改变 descendants 的 operational interpretation。Graph-store tests 展示了 status-filter behavior：open descendant traversal 只沿 open edges 走，closed traversal 则沿 closed edges 走。Closed branch 可以继续被审计，但不再被当成 active runtime work。

五、Collaboration events 是产品事件

Codex 发 collaboration events，而不是让 clients 从 tool prose 反推 multi-agent state。Protocol source 定义了：

Event family	Shape owner	clients 可以渲染什么
`CollabAgentSpawnBegin/End`	`CollabAgentSpawn*`	sender、child thread、prompt preview、model、effort、status
`CollabAgentInteractionBegin/End`	`CollabAgentInteraction*`	sender、receiver、prompt preview、receiver metadata、status
`CollabWaitingBegin/End`	`CollabWaiting*`	target set、receiver refs、status map
`CollabCloseBegin/End`	`CollabClose*`	close target、receiver metadata、previous status
`CollabResumeBegin/End`	`CollabResume*`	resume target、receiver metadata、status

这就是第 14 章的 app-server discipline 在 collaboration 上的应用。Product events 不是装饰；它们保存 identity 与 lifecycle，让 clients 可以渲染 status、subscriptions 和 history，而不用解析 free-form assistant text。

六、Live graph 与 trace graph 回答不同问题

这里有两张有用的图。

Live graph 是紧凑的。它保存 parent/child spawn edges 与 open/closed status。它优化的是 runtime questions：哪些 children 存在，哪些 descendants 仍然 open，哪些 branches 应该 shutdown 或 resume，哪条 child edge 应该标 closed。

Trace graph 是语义图。它在 raw protocol/runtime/tool/model events 被捕获之后构建。它优化的是 explanation：哪个 tool call 创建了 child，哪个 mailbox item 接收任务，哪个 child output 产生 parent notice，哪些 raw payloads 支撑这条 edge。

混淆它们会产生设计 bug：

Confusion	Consequence
把 live graph 当完整 trace	runtime store 膨胀进 UI/explanation policy
把 trace graph 当 runtime state	active coordination 依赖 replay artifact availability
把 transcript text 当 graph	compaction 或 formatting 会抹掉 coordination facts
把 nickname 当 identity	renamed agents 会破坏 routing 与 trace joins

Raw trace events 和 payload references 进入 strict reducer，输出带 pending queues 和 raw links 的 rollout trace graph — Trace graph 是从记录事实重建出来的，所以 pending queues 与 raw links 必须显式存在，而不是靠 prose 推断。

七、Pending queue 把 races 显式化

Trace reducer 对 evidence 很严格，对合法 ordering races 很宽容。PendingAgentInteractionEdge 保存一条正在等待 recipient-side conversation item 的 edge。它携带 edge kind、source、target thread ID、message content、可选 spawn fallback thread ID、timestamps 和 raw payload IDs。

Reducer lifecycle：

sender tool begin/end observed
  -> queue pending edge with target thread and message content
  -> receiver-side inter-agent message item is reduced
  -> resolve pending edge to exact conversation item
  -> if spawn target item never appears but child thread exists
     resolve spawn to child thread fallback

queue_or_resolve_agent_interaction_edge() 在已有 unlinked matching message item 时立即 resolve；遇到重复 pending observations 时，只有 endpoints 一致才 merge，否则 reject conflicting data。resolve_pending_agent_edges_for_item() 在 matching inter-agent message item 被 reduce 时 resolve pending edge。resolve_pending_spawn_edge_fallbacks() 只在 child thread 存在时，才把 spawn edge materialize 到 thread target。

最小序列说明 queue 为什么必要：

P1: parent calls spawn_agent("Audit links")
P2: parent receives tool result with child_thread_id
C1: child thread starts
C2: child receives model-visible mailbox/task message

If P2 is reduced before C2, the edge waits.
If C2 appears, the target is the conversation item.
If C2 never appears but C1 exists, spawn falls back to the child thread.

Reducer 也会避免 false edges。upsert_close_agent_interaction() 不会给 absent thread 创建 close edge。queue_agent_result_interaction_edge() 在 latest assistant item 可用时把 result delivery anchor 到该 item；如果 failed/cancelled child 没有 final assistant message，则 anchor 到 child thread。

八、Failure modes：identity loss 会污染 coordination

Multi-agent 系统在丢失 identity 时最难解释。Thread ID、agent path、nickname、tool call ID、model-visible call ID、conversation item ID、raw payload ID 服务不同目的。

Identifier	Owner	错误用法
thread ID	runtime thread manager 与 graph store	用 nickname 代替
agent path	user-facing agent tree reference	假设它证明 persisted history
tool call ID	parent model/tool lifecycle	把它当 receiver message identity
conversation item ID	model-visible transcript item	receiver delivery 出现前就使用
raw payload ID	trace evidence ledger	把它渲染成 user-facing state

Reducer 的设计经验是平衡的：reject conflicting endpoints、duplicate model-visible tool relationships 和 inconsistent tool-call pairs；容忍 pending delivery、spawn fallback、missing close targets，以及没有 final assistant item 的 child results。可靠 coordination 既不是“接受每条 edge”，也不是“缺一个细节就失败”。它保留 evidence，只在有依据时 materialize semantic edges。

Trace Ledger

问题	第 20 章答案
用户请求现在在哪里？	它可能在 parent thread、child thread、mailbox delivery、wait status observation、close operation，或连接这些事实的 trace edge 中。
什么数据结构携带它？	`AgentControl`、`SessionSource::SubAgent(ThreadSpawn)`、graph-store spawn edges、collaboration protocol events、tool outputs、inter-agent messages 和 rollout trace interaction edges。
谁拥有下一步决策？	模型选择 collaboration tools；handlers validate 并调用 `AgentControl`；graph store 记录 live topology；clients 渲染 protocol events；reducer 稍后重建 explanatory edges。
必须保持什么不变？	Child agents 是有 identity 和 lifecycle 的 threads；live graph state 保持 operational；trace graph state 保持 evidentiary；result delivery 不能只依赖 transcript prose。
这里可能怎么失败？	depth limits、invalid targets、missing threads、timed-out waits、closed branches、child failure before final assistant output、conflicting pending edges，或 raw events 永远没有 valid target。

应用到实践

把 agents 建模成 threads。 当 sub-work 需要 lifecycle 与 replay 时使用。给每个 agent thread 明确 identity、status 与 parentage。风险：把 subagents 当成 anonymous background prompts。
分开 live topology 与 trace explanation。 用 graph store 回答 operational open/closed descendants，用 reducer 回答 causal edges。风险：把 UI/trace policy 塞进 live store。
发 collaboration events。 对 spawn、send、wait、close、resume 使用 typed events。风险：让 clients 从 tool output prose 反推 coordination。
跨 race queue edges。 当 sender 和 receiver observations 可能乱序时使用 pending edges。风险：因为 best target 尚未出现就创建 false endpoints。
诚实 anchor failures。 对真实 spawned child 但没有 message item 的情况，使用 thread fallback；无法 resolve 的 payload 留在 raw tool evidence 上。风险：为了让 graph 完整而 invent conversation items。

收束

Multi-agent Codex 仍然是 Codex：threads、turns、tools、events、persistence 和 replayable state。新增的是跨 thread boundaries 的显式 information flow。第 21 章会把同一原则推进到 cloud tasks：工作可以远程运行，但仍然必须以 typed task state、identity 和 locally verifiable changes 回到用户手中。

源码地图

概念	源码锚点
Graph edge status	`codex-rs/agent-graph-store/src/types.rs`
Local graph store	`codex-rs/agent-graph-store/src/local.rs`
Agent control lifecycle	`codex-rs/core/src/agent/control.rs`
Session multi-agent integration	`codex-rs/core/src/session/multi_agents.rs`
Multi-agent tool handlers	`codex-rs/core/src/tools/handlers/multi_agents.rs`
Spawn/send/wait/close handlers	`spawn.rs`, `send_input.rs`, `wait.rs`, `close_agent.rs`
Collaboration event protocol	`codex-rs/protocol/src/protocol.rs`
Agent trace reducer	`codex-rs/rollout-trace/src/reducer/tool/agents.rs`