Observability: Capture Facts Before You Interpret Them

Reading Contract: Treat this chapter as the evidence map for a Codex run. Follow the difference between replay persistence, diagnostic trace bundles, reduced graph state, product analytics, OTEL telemetry, response debug context, and bounded local logs. After the chapter, you should be able to explain why the transcript is only one projection of the run.

Observability evidence lanes separating rollout persistence, trace bundles, reducers, analytics, OTEL, and response debug context — Codex records runtime evidence before it asks any one view to explain the run. Rollout persistence, trace bundles, reducers, analytics, OTEL, and debug context answer different questions.

Source boundary: direct source claims in this chapter are pinned to OpenAI Codex commit 569ff6a1c400bd514ff79f5f1050a684dc3afde3. TraceWriter, TraceBundleManifest, RawTraceEvent, RolloutTrace, replay_bundle, protocol_event, parse_turn_item, AnalyticsFact, AnalyticsReducer, SessionTelemetry, ResponseDebugContext, and StateRuntime::insert_logs are verified source where linked. Claims about why these owners are separated are surrounding contract inference from those visible types, functions, comments, and tests, not claims about private OpenAI service internals.

Four local terms will carry the argument. Runtime fact means a turn, inference, tool, compaction, protocol event, or log observation before it is made friendly for a particular audience. Trace bundle means the local diagnostic artifact rooted at manifest.json, trace.jsonl, and payloads/. Reduced graph means the RolloutTrace object built by replaying raw events and payload references. Projection means a consumer-specific view: transcript, client event, analytics event, OTEL span, support context, or log query.

Chapter 7 stopped at the provider boundary: HTTP streaming, WebSocket streaming, local providers, Bedrock signing, realtime setup, and backend tasks must become typed runtime events before the turn loop can continue. This chapter follows those events after they enter the runtime. Some evidence must resume or replay a thread. Some evidence must let a developer explain one failed rollout. Some evidence must aggregate across product usage. Some evidence must diagnose transport performance. The system becomes debuggable because those jobs do not collapse into one transcript.

Problem: an agent runtime cannot be debugged from the final assistant message, and it cannot safely push every internal byte into product analytics.

Thesis: Codex captures ordered facts first, then lets replay, trace reduction, analytics, OTEL, debug context, and logs interpret only the subset each audience should own.

Mental model: transcript history is one projection; trace evidence is a separate local artifact; telemetry and analytics are narrower projections with different retention and privacy rules.

Guiding questions: What fact was observed? Which owner stores it? Which reducer interprets it? Which payloads are deliberately kept out of the public view?

1. There Is No Single Observability Plane

The first discipline is to stop asking one artifact to answer every question. The same tool call can appear as a model-visible response item, a protocol event, a raw trace payload, a reduced tool object, a client notification, an analytics event, an OTEL span, and a local log line. That is not duplication if each plane has a distinct owner and retention model.

Plane	Primary question	Typical consumer	What it should not become
Rollout persistence	Can the thread resume, fork, or replay?	runtime reconstruction	unbounded debug dump
Rollout trace bundle	What raw evidence explains this run?	local trace reducer and graph viewer	product analytics stream
Reduced trace graph	Which semantic objects existed?	offline debugging and audits	raw payload store
Product analytics	What happened across sessions?	aggregate product analysis	exact replay source of truth
OTEL telemetry	How did runtime operations perform?	engineering diagnostics	durable transcript
Response debug context	Which upstream request failed?	support and failure triage	copied HTTP response body
Local state logs	What process logs are locally inspectable?	feedback and local inspection	infinite log archive

The architecture payoff is practical. If a user asks why a thread resumed with the wrong history, inspect rollout reconstruction and the reduced trace graph. If an operator asks why provider streaming is slow, inspect OTEL counters, histograms, and spans. If product wants app/plugin usage trends, inspect analytics facts. If support needs to correlate an upstream 401, inspect response debug context. The transcript alone should not carry any of those jobs by accident.

2. The Trace Bundle Keeps Capture Simple

TraceWriter capturing runtime facts into manifest, trace log, payloads, sequence order, and reducer input — `TraceWriter` is intentionally a capture owner: it writes a manifest, an append-only event log, and payload files. It does not try to build the final graph on the hot path.

2.1 The Local Layout Is a Contract

The trace bundle layout is small enough to audit. In codex-rs/rollout-trace/src/bundle.rs, the constants name the manifest, raw event log, payload directory, and reduced state cache. TraceBundleManifest then records the trace identity, rollout identity, root thread, start time, and standard local paths.

pub(crate) const MANIFEST_FILE_NAME: &str = "manifest.json";
pub(crate) const RAW_EVENT_LOG_FILE_NAME: &str = "trace.jsonl";
pub(crate) const PAYLOADS_DIR_NAME: &str = "payloads";
pub const REDUCED_STATE_FILE_NAME: &str = "state.json";

pub(crate) struct TraceBundleManifest {
    pub(crate) schema_version: u32,
    pub(crate) trace_id: String,
    pub(crate) rollout_id: String,
    pub(crate) root_thread_id: AgentThreadId,
    pub(crate) started_at_unix_ms: i64,
    pub(crate) raw_event_log: String,
    pub(crate) payloads_dir: String,
}

One source comment is especially important: replay should fail rather than invent a placeholder root thread. That makes the root thread a required identity anchor, not a viewer convenience. Every reduced object is scoped back into that thread tree.

2.2 The Writer Captures, Then Gets Out of the Way

TraceWriter states its own boundary in the file comment and struct comment: it is the hot-path trace bundle writer; it appends raw events and writes payload files; it does not keep a reduced RolloutTrace in memory. Replay belongs to the reducer.

The creation path writes the manifest, opens trace.jsonl in append mode, and initializes two monotonic counters:

let payloads_dir = bundle_dir.join(PAYLOADS_DIR_NAME);
std::fs::create_dir_all(&payloads_dir)?;

let started_at_unix_ms = unix_time_ms();
let manifest =
    TraceBundleManifest::new(trace_id, rollout_id, root_thread_id, started_at_unix_ms);
write_json_file(&bundle_dir.join(MANIFEST_FILE_NAME), &manifest)?;

let event_log_path = bundle_dir.join(RAW_EVENT_LOG_FILE_NAME);
let event_log = OpenOptions::new()
    .create(true)
    .append(true)
    .open(&event_log_path)?;

next_seq: 1,
next_payload_ordinal: 1,

The payload path is the more interesting part. A large model request, response, tool payload, or protocol payload can be written as a separate JSON file; the raw event then carries a reference. The writer creates payload files before the event that references them (writer.rs#L85-L106):

let ordinal = inner.next_payload_ordinal;
inner.next_payload_ordinal += 1;
let raw_payload_id = format!("raw_payload:{ordinal}");
let relative_path = format!("{PAYLOADS_DIR_NAME}/{ordinal}.json");
let absolute_path = inner.payloads_dir.join(format!("{ordinal}.json"));

// Payload files are created before the event that references them.
write_json_file(&absolute_path, value)?;
Ok(RawPayloadRef {
    raw_payload_id,
    kind,
    path: relative_path,
})

That ordering matters. If a session is interrupted after an event is appended, replay should never point at a payload file the writer intended to create but had not written. The same file also flushes each JSONL event after writing it (writer.rs#L108-L133). The writer even recovers a poisoned mutex so a panic in tracing code does not silence later diagnostic events in the exact session being debugged (writer.rs#L136-L140).

The invariant is simple: the hot path captures stable facts and raw payload references; it avoids doing graph interpretation while the run is still live.

3. Raw Events Preserve Order Before Meaning

3.1 `RawTraceEvent` Is a Shared Envelope

RawTraceEvent is the append-only envelope. It assigns a writer-local sequence, wall-clock time, rollout identity, optional thread/turn context, and a typed payload.

pub type RawEventSeq = u64;
pub(crate) const RAW_TRACE_EVENT_SCHEMA_VERSION: u32 = 1;

pub struct RawTraceEvent {
    pub schema_version: u32,
    pub seq: RawEventSeq,
    pub wall_time_unix_ms: i64,
    pub rollout_id: String,
    pub thread_id: Option<AgentThreadId>,
    pub codex_turn_id: Option<CodexTurnId>,
    pub payload: RawTraceEventPayload,
}

pub struct RawTraceEventContext {
    pub thread_id: Option<AgentThreadId>,
    pub codex_turn_id: Option<CodexTurnId>,
}

The source comment says why every event uses the same envelope: partial replay and corruption checks can run before the reducer understands every payload variant. That is the right ordering for diagnostics. The writer assigns sequence numbers and a common envelope up front; the reducer can then parse the log generically, retain payload references, and let typed replay discover missing or malformed payload files only when a semantic arm actually reads them.

3.2 Payload Variants Are Runtime Boundaries

RawTraceEventPayload is not a transcript enum. It names runtime boundaries: rollout start/end, thread start/end, Codex turn start/end, inference lifecycle, tool lifecycle, code-cell lifecycle, compaction requests, compaction installation, agent result delivery, protocol breadcrumbs, and an Other arm for early instrumentation (raw_event.rs#L65-L226).

pub enum RawTraceEventPayload {
    RolloutStarted { trace_id: String, root_thread_id: AgentThreadId },
    ThreadStarted {
        thread_id: AgentThreadId,
        agent_path: String,
        metadata_payload: Option<RawPayloadRef>,
    },
    CodexTurnStarted {
        codex_turn_id: CodexTurnId,
        thread_id: AgentThreadId,
    },
    InferenceStarted {
        inference_call_id: InferenceCallId,
        thread_id: AgentThreadId,
        codex_turn_id: CodexTurnId,
        model: String,
        provider_name: String,
        request_payload: RawPayloadRef,
    },
    ToolCallStarted {
        tool_call_id: ToolCallId,
        model_visible_call_id: Option<String>,
        code_mode_runtime_tool_id: Option<String>,
        requester: RawToolCallRequester,
        kind: ToolCallKind,
        summary: ToolCallSummary,
        invocation_payload: Option<RawPayloadRef>,
    },
    ProtocolEventObserved {
        event_type: String,
        event_payload: RawPayloadRef,
    },
    Other {
        kind: String,
        summary: String,
        payloads: Vec<RawPayloadRef>,
        metadata: Value,
    },
    // ...
}

Several fields are deliberately not model-facing. RawToolCallRequester uses runtime-local identifiers for model-triggered calls versus code-cell-triggered calls; the reducer is the only owner that maps those handles to graph identities such as CodeCellId (raw_event.rs#L51-L63). That keeps raw capture faithful to the runtime while still allowing the reduced graph to use stable semantic IDs.

3.3 Protocol Events Feed Trace Without Becoming the Whole Trace

The trace layer reuses existing session protocol events instead of adding a second hook system in core. The module comment in protocol_event.rs is explicit: long EventMsg matches are intentional because most protocol events are not trace runtime boundaries, and new protocol variants should force a compile-time decision about trace capture.

The borrowed payload enum keeps exact protocol payload shape for end-to-end debugging, while typed trace events provide the reducer boundary (protocol_event.rs#L92-L135):

pub(crate) enum ToolRuntimePayload<'a> {
    ExecCommandBegin(&'a ExecCommandBeginEvent),
    ExecCommandEnd(&'a ExecCommandEndEvent),
    PatchApplyBegin(&'a PatchApplyBeginEvent),
    PatchApplyEnd(&'a PatchApplyEndEvent),
    McpToolCallBegin(&'a McpToolCallBeginEvent),
    McpToolCallEnd(&'a McpToolCallEndEvent),
    CollabAgentSpawnBegin(&'a codex_protocol::protocol::CollabAgentSpawnBeginEvent),
    CollabAgentSpawnEnd(&'a codex_protocol::protocol::CollabAgentSpawnEndEvent),
    CollabAgentInteractionBegin(&'a codex_protocol::protocol::CollabAgentInteractionBeginEvent),
    CollabAgentInteractionEnd(&'a codex_protocol::protocol::CollabAgentInteractionEndEvent),
    // ...
}

This is the second major separation: protocol breadcrumbs are evidence, but the reduced graph should still be built by typed trace semantics. A raw ExecCommandEndEvent is useful for debugging, yet the graph also needs to know which tool call, code cell, terminal operation, turn, and thread own it.

4. The Reducer Owns Interpretation

Raw trace events and payload references entering a strict reducer that emits a RolloutTrace graph with pending queues and raw links — The reducer is strict because it turns append-only evidence into semantic graph objects. Pending queues bridge real ordering gaps without pretending missing owners exist.

4.1 The Reduced Model Is a Graph, Not a Chat Log

The reduced model is declared in codex-rs/rollout-trace/src/model/mod.rs. The file comment says these types describe deterministic replay output and intentionally separate model-visible conversation from runtime/debug objects.

pub struct RolloutTrace {
    pub schema_version: u32,
    pub trace_id: String,
    pub rollout_id: String,
    pub started_at_unix_ms: i64,
    pub ended_at_unix_ms: Option<i64>,
    pub status: RolloutStatus,
    pub root_thread_id: AgentThreadId,
    pub threads: BTreeMap<AgentThreadId, AgentThread>,
    pub codex_turns: BTreeMap<CodexTurnId, CodexTurn>,
    pub conversation_items: BTreeMap<ConversationItemId, ConversationItem>,
    pub inference_calls: BTreeMap<InferenceCallId, InferenceCall>,
    pub code_cells: BTreeMap<CodeCellId, CodeCell>,
    pub tool_calls: BTreeMap<ToolCallId, ToolCall>,
    pub terminal_sessions: BTreeMap<TerminalId, TerminalSession>,
    pub terminal_operations: BTreeMap<TerminalOperationId, TerminalOperation>,
    pub compactions: BTreeMap<CompactionId, Compaction>,
    pub compaction_requests: BTreeMap<CompactionRequestId, CompactionRequest>,
    pub interaction_edges: BTreeMap<EdgeId, InteractionEdge>,
    pub raw_payloads: BTreeMap<RawPayloadId, RawPayloadRef>,
}

That object answers questions a transcript cannot answer cleanly. Which thread spawned which child? Which inference request used which provider and model? Which runtime-local code cell produced a nested tool call? Which terminal operation was a command versus a poll? Which compaction installed replacement history? Which raw payload explains the reduced object?

4.2 Sequence Is Causal Order; Wall Clock Is Display

The session model makes ordering rules visible. ExecutionWindow stores both wall-clock timestamps and raw event sequence numbers; the source comment says sequence numbers are the causal ordering primitive and should be used to pair observations or break same-millisecond ties (session.rs#L68-L80). CodexTurn then warns that a Codex turn is one activation of the runtime for one thread, not a user/assistant message pair (session.rs#L98-L110).

pub struct ExecutionWindow {
    pub started_at_unix_ms: i64,
    pub started_seq: RawEventSeq,
    pub ended_at_unix_ms: Option<i64>,
    pub ended_seq: Option<RawEventSeq>,
    pub status: ExecutionStatus,
}

pub struct CodexTurn {
    pub codex_turn_id: CodexTurnId,
    pub thread_id: AgentThreadId,
    pub execution: ExecutionWindow,
    pub input_item_ids: Vec<ConversationItemId>,
}

This is a useful corrective when debugging agent runs. A single user-visible turn may contain several model requests, tool lifecycles, compactions, pending inputs, and child agent interactions. Conversely, a protocol turn lifecycle event does not equal one line in a chat transcript.

4.3 Replay Is Deterministic and Deferred

replay_bundle loads the manifest, initializes an empty RolloutTrace, reads trace.jsonl line by line, parses each RawTraceEvent, applies it, and only then resolves pending spawn-edge fallbacks:

pub fn replay_bundle(bundle_dir: impl AsRef<Path>) -> Result<RolloutTrace> {
    let manifest: TraceBundleManifest =
        serde_json::from_reader(File::open(bundle_dir.join(MANIFEST_FILE_NAME))?)?;
    let mut reducer = TraceReducer {
        rollout: RolloutTrace::new(
            REDUCED_TRACE_SCHEMA_VERSION,
            manifest.trace_id,
            manifest.rollout_id,
            manifest.root_thread_id,
            manifest.started_at_unix_ms,
        ),
        pending_code_cell_starts: BTreeMap::new(),
        pending_code_cell_lifecycle_events: BTreeMap::new(),
        pending_agent_interaction_edges: Vec::new(),
        // ...
    };

    for (line_index, line) in BufReader::new(event_log).lines().enumerate() {
        let event: RawTraceEvent = serde_json::from_str(&line?)?;
        reducer.apply_event(event)?;
    }
    reducer.resolve_pending_spawn_edge_fallbacks()?;
    Ok(reducer.rollout)
}

The pending queues are not leniency. The comments around TraceReducer explain real ordering gaps. Core can begin executing tools before the stream completion hook records the response payload that requested them. Fast code cells can return before the inference response payload that proves their model-visible source item. Agent tool deliveries can arrive before the recipient thread’s transcript materializes the mailbox item. The reducer queues those facts, then attaches them to the precise owner once replay reveals it.

The apply path keeps raw payload references reducer-wide before typed interpretation:

fn apply_event(&mut self, event: RawTraceEvent) -> Result<()> {
    for payload in event.payload.raw_payload_refs() {
        self.insert_raw_payload(payload);
    }

    match event.payload {
        RawTraceEventPayload::RolloutStarted { trace_id, root_thread_id } => {
            self.rollout.trace_id = trace_id;
            self.rollout.root_thread_id = root_thread_id;
        }
        RawTraceEventPayload::InferenceStarted { inference_call_id, thread_id,
            codex_turn_id, model, provider_name, request_payload } => {
            self.start_inference_call(
                event.seq,
                event.wall_time_unix_ms,
                StartedInferenceCall {
                    inference_call_id,
                    thread_id,
                    codex_turn_id,
                    model,
                    provider_name,
                    request_payload,
                },
            )?;
        }
        RawTraceEventPayload::ProtocolEventObserved { .. } => {
            // Protocol wrappers are raw debug breadcrumbs.
        }
        // ...
    }
}

This is why strict typed replay errors are valuable. They tell you a semantic owner is missing, a producer emitted an event that cannot be attached to the known graph, a payload file is unavailable when typed reduction reads it, or the reducer does not yet understand a new event shape. The source does not promise a separate global sequence-gap or payload-existence pass; the discipline is that the reducer should fail at the point where evidence is required rather than quietly inventing an owner.

5. Model-Visible Conversation Is Only One Projection

The reduced graph has conversation_items, but it still does not equate every runtime byte with model-visible history. The public event mapper in codex-rs/core/src/event_mapping.rs filters contextual messages, image wrappers, system messages, and unsupported items while converting ResponseItems into client turn items.

pub fn parse_turn_item(item: &ResponseItem) -> Option<TurnItem> {
    match item {
        ResponseItem::Message { role, content, id, phase, .. } => match role.as_str() {
            "user" => parse_visible_hook_prompt_message(id.as_ref(), content)
                .map(TurnItem::HookPrompt)
                .or_else(|| parse_user_message(content).map(TurnItem::UserMessage)),
            "assistant" => Some(TurnItem::AgentMessage(parse_agent_message(
                id.as_ref(),
                content,
                phase.clone(),
            ))),
            "system" => None,
            _ => None,
        },
        ResponseItem::Reasoning { id, summary, content, .. } => {
            let summary_text = summary
                .iter()
                .map(|entry| match entry {
                    ReasoningItemReasoningSummary::SummaryText { text } => text.clone(),
                })
                .collect();
            let raw_content = content
                .clone()
                .unwrap_or_default()
                .into_iter()
                .map(|entry| match entry {
                    ReasoningItemContent::ReasoningText { text }
                    | ReasoningItemContent::Text { text } => text,
                })
                .collect();
            Some(TurnItem::Reasoning(ReasoningItem {
                id: id.clone(),
                summary_text,
                raw_content,
            }))
        }
        ResponseItem::WebSearchCall { id, action, .. } => { /* ... */ }
        ResponseItem::ImageGenerationCall { id, result, .. } => { /* ... */ }
        _ => None,
    }
}

That function is not the rollout trace reducer, but it illustrates the same source habit: convert broad runtime/API input into an audience-specific view. The model may have seen a tool call. The client may render a card. The trace may retain the raw protocol payload. The analytics reducer may emit a track event. Those projections are related, but each owner decides what is meaningful and safe for its audience.

The most common debugging mistake is to ask the transcript to explain runtime state it was never meant to own. If a terminal operation fails, the transcript may contain only the summarized observation. The trace graph can still point to raw payloads, runtime lifecycle, terminal operation IDs, and sequence order. If a product metric is missing, the analytics reducer may have intentionally dropped it because thread metadata was unavailable. Those are different failure classes.

6. Analytics and OTEL Branch From the Evidence Stream

Runtime facts branching into an analytics reducer with track events and missing context, and an OTEL path with spans, metrics, and logs — Analytics and OTEL are sibling projections. Analytics reduces product facts into track events; OTEL records spans, counters, histograms, and runtime timing.

6.1 Analytics Facts Are Product Inputs, Not Replay Truth

The analytics input vocabulary lives in codex-rs/analytics/src/facts.rs. It includes app-server JSON-RPC requests/responses, server notifications, and custom facts that do not naturally exist on the protocol surface.

pub(crate) enum AnalyticsFact {
    Initialize { connection_id: u64, params: InitializeParams, /* ... */ },
    ClientRequest { connection_id: u64, request_id: RequestId, request: Box<ClientRequest> },
    ClientResponse { connection_id: u64, request_id: RequestId, response: Box<ClientResponsePayload> },
    ErrorResponse { connection_id: u64, request_id: RequestId, error_type: Option<AnalyticsJsonRpcError>, /* ... */ },
    ServerRequest { connection_id: u64, request: Box<ServerRequest> },
    ServerResponse { completed_at_ms: u64, response: Box<ServerResponse> },
    Notification(Box<ServerNotification>),
    Custom(CustomAnalyticsFact),
}

pub(crate) enum CustomAnalyticsFact {
    SubAgentThreadStarted(SubAgentThreadStartedInput),
    Compaction(Box<CodexCompactionEvent>),
    GuardianReview(Box<GuardianReviewEventParams>),
    TurnResolvedConfig(Box<TurnResolvedConfigFact>),
    TurnTokenUsage(Box<TurnTokenUsageFact>),
    SkillInvoked(SkillInvokedInput),
    AppMentioned(AppMentionedInput),
    AppUsed(AppUsedInput),
    HookRun(HookRunInput),
    PluginUsed(PluginUsedInput),
    PluginStateChanged(PluginStateChangedInput),
}

TurnResolvedConfigFact shows why analytics is valuable but not replay truth: it carries model, provider, permission profile, approval policy, sandbox network access, collaboration mode, personality, and other resolved turn settings (facts.rs#L63-L84). Those facts are excellent for aggregate product questions. They should not be used to reconstruct an exact thread history or raw provider payload.

6.2 The Analytics Reducer Can Drop Events

AnalyticsReducer keeps request, turn, connection, thread, and tool-start state (reducer.rs#L114-L121). Its main ingest method dispatches facts into specialized reducers (reducer.rs#L283-L330).

The drop behavior is explicit. For tool item analytics, completion is ignored if the matching start notification is missing (reducer.rs#L758-L791):

let key = ToolItemKey {
    thread_id: notification.thread_id.clone(),
    turn_id: notification.turn_id.clone(),
    item_id: item_id.to_string(),
};
let Some(started_at_ms) = self.tool_items_started_at_ms.remove(&key) else {
    tracing::warn!(
        thread_id = %notification.thread_id,
        turn_id = %notification.turn_id,
        item_id,
        "dropping tool item analytics event: missing item started notification"
    );
    return;
};

The context helpers do the same for missing thread connection, connection state, or thread metadata (reducer.rs#L1071-L1131). That is correct for analytics: better to drop an aggregate event with a warning than fabricate missing context. It would be wrong for a replay reducer to use the same policy where evidence consistency is required.

6.3 OTEL Measures Runtime Operations

OTEL has another contract. In session_telemetry.rs, SessionTelemetryMetadata stores conversation ID, auth mode/env metadata, account hints, originator, session source, model, app version, and terminal type. The telemetry object holds optional metrics and knows whether metadata tags should be used.

The Responses event recorder writes span fields for event kind, function-call tool names, and token usage (session_telemetry.rs#L292-L329):

pub fn record_responses(&self, handle_responses_span: &Span, event: &ResponseEvent) {
    handle_responses_span.record("otel.name", SessionTelemetry::responses_type(event));

    match event {
        ResponseEvent::OutputItemDone(item) => {
            handle_responses_span.record("from", "output_item_done");
            if let ResponseItem::FunctionCall { name, .. } = item {
                handle_responses_span.record("tool_name", name.as_str());
            }
        }
        ResponseEvent::OutputItemAdded(item) => {
            handle_responses_span.record("from", "output_item_added");
            if let ResponseItem::FunctionCall { name, .. } = item {
                handle_responses_span.record("tool_name", name.as_str());
            }
        }
        ResponseEvent::Completed { token_usage: Some(token_usage), .. } => {
            handle_responses_span.record("gen_ai.usage.input_tokens", token_usage.input_tokens);
            handle_responses_span.record(
                "gen_ai.usage.cache_read.input_tokens",
                token_usage.cached_input(),
            );
            handle_responses_span.record("gen_ai.usage.output_tokens", token_usage.output_tokens);
            handle_responses_span.record(
                "codex.usage.reasoning_output_tokens",
                token_usage.reasoning_output_tokens,
            );
            handle_responses_span.record("codex.usage.total_tokens", token_usage.total_tokens);
        }
        _ => {}
    }
}

The API request recorder increments counters, records duration histograms, and logs/traces transport metadata such as status, duration, retry/auth recovery state, endpoint, request ID, Cloudflare ray, and auth error fields (session_telemetry.rs#L407-L468). The SSE logger validates special event shapes and records failures, including idle timeout waiting for SSE (session_telemetry.rs#L689-L736).

This plane is operational evidence. It tells engineers how a model call, WebSocket request, SSE event, or tool result behaved. It is not a durable conversation record, and it is not the analytics source of product truth.

7. Debug Context and Logs Stay Bounded

Response debug context extracting request id, cf-ray, and auth code while local logs enforce thread and process caps — Support evidence is intentionally narrow: response debug context extracts identifiers and sanitized status, while local logs enforce per-thread and per-process caps.

7.1 Response Debug Context Extracts Identity, Not Bodies

The response debug crate is small and clear. In response-debug-context/src/lib.rs, ResponseDebugContext stores request ID, Cloudflare ray, auth error, and auth error code. It only extracts them from TransportError::Http; other transport errors produce an empty context.

pub struct ResponseDebugContext {
    pub request_id: Option<String>,
    pub cf_ray: Option<String>,
    pub auth_error: Option<String>,
    pub auth_error_code: Option<String>,
}

pub fn extract_response_debug_context(transport: &TransportError) -> ResponseDebugContext {
    let mut context = ResponseDebugContext::default();

    let TransportError::Http { headers, body: _, .. } = transport else {
        return context;
    };

    let extract_header = |name: &str| {
        headers
            .as_ref()
            .and_then(|headers| headers.get(name))
            .and_then(|value| value.to_str().ok())
            .map(str::to_string)
    };

    context.request_id =
        extract_header(REQUEST_ID_HEADER).or_else(|| extract_header(OAI_REQUEST_ID_HEADER));
    context.cf_ray = extract_header(CF_RAY_HEADER);
    context.auth_error = extract_header(AUTH_ERROR_HEADER);
    context.auth_error_code = extract_header(X_ERROR_JSON_HEADER).and_then(|encoded| {
        let decoded = base64::engine::general_purpose::STANDARD
            .decode(encoded)
            .ok()?;
        let parsed = serde_json::from_slice::<serde_json::Value>(&decoded).ok()?;
        parsed
            .get("error")
            .and_then(|error| error.get("code"))
            .and_then(serde_json::Value::as_str)
            .map(str::to_string)
    });

    context
}

The telemetry error helpers are deliberately narrower than raw transport errors. HTTP transport errors become "http <status>" rather than serialized bodies (lib.rs#L63-L87). The test named telemetry_error_messages_omit_http_bodies constructs a body containing "secret token leaked" and asserts that telemetry still reports only "http 401".

That is a useful support boundary. You need enough identity to correlate a failure; you do not need to copy an upstream response body into telemetry.

7.2 Local Logs Are Reader-Visible but Capped

Local logs live under the state runtime. In state/src/runtime/logs.rs, insert_logs batches rows into a SQLite logs table with timestamp, level, target, feedback body, thread ID, process UUID, module path, file, line, and estimated byte count. The source comment says the runtime keeps about 10 MiB of reader-visible log content per partition, and both query_logs and /feedback read the persisted feedback_log_body.

pub async fn insert_logs(&self, entries: &[LogEntry]) -> anyhow::Result<()> {
    if entries.is_empty() {
        return Ok(());
    }

    let mut tx = self.logs_pool.begin().await?;
    let mut builder = QueryBuilder::<Sqlite>::new(
        "INSERT INTO logs (ts, ts_nanos, level, target, feedback_log_body, \
         thread_id, process_uuid, module_path, file, line, estimated_bytes) ",
    );
    builder.push_values(entries, |mut row, entry| {
        let feedback_log_body = entry.feedback_log_body.as_ref().or(entry.message.as_ref());
        let estimated_bytes = feedback_log_body.map_or(0, String::len) as i64
            + entry.level.len() as i64
            + entry.target.len() as i64
            + entry.module_path.as_ref().map_or(0, String::len) as i64
            + entry.file.as_ref().map_or(0, String::len) as i64;
        // ...
    });
    builder.build().execute(&mut *tx).await?;
    self.prune_logs_after_insert(entries, &mut tx).await?;
    tx.commit().await?;
    Ok(())
}

Pruning runs in the same transaction as insertion (logs.rs#L49-L73). Thread logs are capped per thread_id; threadless logs are capped per process_uuid; rows with no process UUID still form their own threadless partition (logs.rs#L49-L286). That is exactly the same design taste as the debug-context helper: keep enough local evidence to inspect a problem, but make the boundary explicit.

8. How to Debug With the Right Owner

The practical habit is to route a question to the owner that can answer it.

Question	Start here	Reason
Did the thread resume with the wrong history?	rollout reconstruction and `RolloutTrace` graph	replay facts and graph state own durable history
Did a tool run without a model-visible source item?	trace reducer pending queues	runtime starts can precede response payload reduction
Did a provider stream stall?	OTEL SSE/WebSocket metrics and API request spans	transport timing is operational evidence
Did a product metric disappear?	analytics reducer warnings	analytics can drop missing-context events
Did an upstream request fail with auth identity?	response debug context	support context extracts request/ray/auth code
Did local logs grow too large?	state runtime log pruning	local logs are capped by thread/process partitions
Did the user-visible transcript omit details?	raw payload refs and reduced graph	transcript is a projection, not the raw evidence store

This table is the architecture in miniature. Codex does not need one universal observability object. It needs stable capture points, strict reducers where self-consistency matters, and narrow projections where privacy, retention, or aggregation matters more than replay.

Apply This

Capture ordered runtime facts before deriving transcripts, dashboards, or aggregate metrics.
Keep durable replay persistence separate from opt-in diagnostic trace bundles, especially when raw payloads can contain prompts, responses, paths, terminal output, or tool data.
Treat sequence numbers and payload references as causal evidence, and make trace reducers strict when typed replay needs an owner, payload body, or pending edge to materialize.
Let analytics and OTEL be sibling projections: analytics for product facts, OTEL for runtime operation and transport behavior.
Bound support/debug surfaces. Extract request identity and sanitized status, and cap local logs by thread or process instead of storing infinite bodies.

Closing

Part II has built the runtime core: durable threads, live sessions, the turn loop, provider streams, backend boundaries, and observation planes. The common pattern is now visible. Codex first records facts with explicit owners; then it builds audience-specific views from those facts. That is why one run can be resumable, debuggable, measurable, supportable, and still bounded.

Part III moves from evidence to side effects: how Codex exposes tools, executes commands, applies patches, requests approval, and keeps risky work inside explicit authority boundaries.

Source Map

Concept	Source anchor
Trace bundle layout	`bundle.rs`
Hot-path trace writer	`TraceWriter`
Payload-before-event write rule	`write_json_payload`
Raw event envelope	`RawTraceEvent`
Raw payload variants	`RawTraceEventPayload`
Protocol-to-trace mapping rationale	`protocol_event.rs`
Tool runtime payload capture	`ToolRuntimePayload`
Reduced graph model	`RolloutTrace`
Trace session model	`AgentThread`, `ExecutionWindow`, `CodexTurn`
Deterministic replay	`replay_bundle`
Reducer pending queues	`TraceReducer`
Reducer event application	`apply_event`
Client turn item projection	`parse_turn_item`
Analytics fact vocabulary	`AnalyticsFact`
Turn resolved config facts	`TurnResolvedConfigFact`
Analytics reducer state	`AnalyticsReducer`
Analytics missing-context drops	`thread_context_or_warn`
OTEL session metadata	`SessionTelemetryMetadata`
Response event telemetry	`record_responses`
API request telemetry	`record_api_request`
SSE event telemetry	`log_sse_event`
Response debug context	`extract_response_debug_context`
HTTP body omission test	`telemetry_error_messages_omit_http_bodies`
Local log insertion and pruning	`state/src/runtime/logs.rs`
Thread and process log caps	`prune_logs_after_insert`