ContextManager: History as Prompt-Ready State

Reading Contract: Use this chapter to follow the history ledger. Track how items are recorded, normalized, trimmed, and prepared for a provider without losing the protocol shape needed for replay.

History ledger records structured evidence, bounds large outputs, normalizes provider invariants, and projects model-ready history — The ledger records structured evidence first, then derives a provider-safe prompt view; rollback and compaction can clear stale baselines without corrupting the durable record.

Chapter 2 described the turn envelope. The next problem is the durable side: what happens to all the model-visible items accumulated over a thread? Codex does not keep history as an opaque transcript. It uses ContextManager as a ledger of response items ordered oldest to newest, with token information, history versioning, and a reference context baseline for future settings diffs.

The job of the history ledger is deceptively hard. It must record only the items that belong in API history, apply truncation policy to large outputs, preserve function-call invariants, strip unsupported modalities before sampling, estimate token usage, and survive replacement during compaction or rollback.

By the end of this chapter, you should understand the ledger as the bridge between durable thread evidence and prompt-ready model input.

This chapter is grounded in ContextManager fields, record_items, for_prompt, paired removal, and rollback-aware turn dropping.

Ledger Shape

The core shape is small:

Field	Purpose
`items`	Oldest-first response items that are candidates for model-visible history.
`history_version`	A monotonic marker bumped when history is rewritten.
`token_info`	Latest token usage facts or estimates.
`reference_context_item`	Baseline turn context used when injecting settings diffs.

The surprising field is the reference context item. It means history is not only past conversation; it is also the baseline for deciding which runtime facts must be reintroduced on the next turn. When compaction or rollback invalidates that baseline, Codex clears it and falls back to full reinjection.

That four-field shape is compact enough to remember, but the fourth field changes the meaning of the whole object:

Three of the four fields look ordinary. The reference baseline is the one that quietly turns the ledger into more than a vector. It is the link between Chapter 2’s envelope and the diff fragments of Chapter 4.

The state machine is conceptual. The code mostly uses cloning and mutation, but the lifecycle is real: record raw-enough items, normalize for the target model, then record new evidence.

Recording Is Filtered

record_items accepts ordered items and records only API-message items. That filter is essential. The rollout can contain events, UI facts, token counts, and context checkpoints. Not all of those belong in the next model request. The ledger stores the subset that should participate in prompt history.

Before pushing an item, the manager processes it under the active truncation policy. Tool outputs are the classic danger: they can be huge, binary-ish, or image-bearing. Codex lets truncation helpers turn them into bounded history instead of allowing one command to consume the whole context window.

The pattern looks like this:

// Pseudocode -- illustrates filtered ledger recording.
for item in incomingItems:
    if not modelHistoryItem(item):
        continue
    bounded = applyOutputPolicy(item, activeTruncation)
    ledger.append(bounded)

This is the right abstraction because it records policy-shaped evidence, not raw side effects. The raw side effect may still exist in the rollout or UI, but the model-visible ledger stays bounded.

The truncation policy itself is a small state machine driven by the active output limits. That single policy entry point is the reason chapters later in the book can talk about “tool output as a budgeted plane”: tool output enters history through one governed path, not through ad-hoc truncation in every call site.

Normalization Protects Invariants

Function calls and function outputs are paired. Removing one without the other can create a prompt shape the model API rejects or misinterprets. The history manager delegates paired removal to normalization helpers when dropping oldest or newest items. That is why remove_first_item removes a corresponding counterpart when needed.

The same idea appears before sampling. for_prompt clones the manager, applies normalization, and strips items unsuitable for the active model modalities. If a model does not accept images, image content is removed from messages and tool outputs. The original ledger remains able to hold richer history, while the prompt projection respects the model contract.

The trade-off is visible: Codex chooses safe prompt shape over maximal fidelity for every provider. If an item cannot be represented safely for the model, the projection changes rather than corrupting the ledger.

Token Estimates Are Coarse by Design

The manager estimates tokens from base instructions and item estimates using byte-based heuristics. The source explicitly treats this as a coarse lower bound, not tokenizer-perfect accounting. That is a pragmatic choice. Exact tokenization across providers and modalities would be expensive and brittle. Codex needs a signal good enough for compaction thresholds, UI feedback, and budgeting.

The estimate is best understood as a floor: it never promises that the real cost is lower, only that it is not higher than its ceiling-shaped neighbours. That property is what allows downstream code to make conservative decisions:

Consumer	Decision driven by the estimate	Risk if the estimate is too low
Compaction threshold	Trigger pre-sampling compaction.	Compaction triggers slightly late but never early.
Skill budget	Decide whether to truncate descriptions.	Slightly more material than ideal slips through.
Memory write	Truncate rollout payload before writing.	Memory generation receives larger payload than usual.
UI display	Show remaining context bar.	UI underreports; user sees a more comfortable margin than reality.

The exact token count arrives when the model response reports usage. Until then, the estimate prevents the runtime from flying blind.

Rollback and the Reference Baseline

Rollback is where the ledger proves it is more than a vector. Dropping the last N user turns must preserve pre-user material, handle no-op cases, respect assistant inter-agent boundaries, and clear the reference context baseline when the surviving history no longer contains the initial context bundle that established it.

That last behavior is subtle. If Codex kept diffing against a baseline whose source text was removed, future turns could omit important context. Clearing the baseline makes the next regular turn reinject full context instead of trusting a stale diff.

The choice is easiest to read as a conservative rule:

The first case is the happy path. The third case is the one that prevents silent context drift. The decision rule is conservative on purpose: when in doubt, reinject.

Apply This

Prompt Ledger. Store model-visible history as structured items. Filter non-prompt events at insertion time, and keep UI events out of model history.
Normalize on Projection. Repair provider-facing invariants when building the prompt view. Clone before normalization, and keep durable evidence from being rewritten by provider-specific cleanup.
Paired Deletion. Delete tool calls and outputs as a unit. Apply the rule to any request/response protocol, and avoid truncation that leaves orphaned protocol frames.
Baseline Clearing. Invalidate diff baselines when rollback or compaction removes their source. Store explicit baseline metadata, and clear stale context diffs after history rewrites.
Coarse Budget Signal. Use cheap estimates for live decisions and exact counts when available. Choose conservative thresholds, and do not treat estimates as billing-grade truth.