从零实现 LLM Inference：044. Streaming Benchmark Warmup + Repeat（把 cold-start 拆出去）

2 minute read

上一版我们在 streaming benchmark 里加了 --paged-attn/--cuda-graph/--nvtx，很快就碰到一个“看起来很反常”的现象：

开 paged-attn + CUDA Graph 后，ITL/TPOT/吞吐明显变好
但 TTFT p99 反而变得很大（动不动 800ms+）

这不是系统真的变慢了，而是 第一次跑 会把：

Triton kernel 的 JIT/compile（以及可能的 autotune）
CUDA Graph 的 capture（第一次捕获某个 batch size）

这些一次性成本全都吃进 TTFT/吞吐里。这样用 streaming benchmark 做前后对比，结论会被“冷启动噪声”污染。

这一版的目标很简单：给 benchmark_streaming.py 增加 warmup-runs + repeat-runs，把 cold-start 和 steady-state 拆开。

代码变更

`roseinfer/benchmark_streaming.py`

新增两个参数：

--warmup-runs N：先跑 N 次 warmup（不输出统计），用来触发 JIT / graph capture
--repeat-runs M：再跑 M 次测量，输出每次的 TTFT/TPOT/ITL/Latency p99

并把 “跑一次 streaming benchmark” 抽成 run_once()，main 里按 warmup/measure 两段 loop 调用。

核心 diff：

diff --git a/rosellm/roseinfer/benchmark_streaming.py b/rosellm/roseinfer/benchmark_streaming.py
@@
     parser.add_argument("--num-requests", type=int, default=16)
+    parser.add_argument("--warmup-runs", type=int, default=0)
+    parser.add_argument("--repeat-runs", type=int, default=1)
@@
+def run_once(
+    *,
+    engine: InferenceEngine,
+    prompts: list[str],
+    prompt_lens: list[int],
+    args: argparse.Namespace,
+    record_token_timestamps: bool,
+    print_summary: bool,
+):
+    mgr = SchedulerManager(..., record_token_timestamps=record_token_timestamps)
+    ...
+    if not print_summary:
+        return
+    print("=== streaming benchmark ===")
+    ...
@@
 def main() -> None:
     args = parse_args()
+    if args.warmup_runs < 0:
+        raise ValueError("--warmup-runs must be >= 0")
+    if args.repeat_runs <= 0:
+        raise ValueError("--repeat-runs must be >= 1")
@@
+    for i in range(int(args.warmup_runs)):
+        print(f"=== warmup {i + 1}/{int(args.warmup_runs)} ===")
+        run_once(..., record_token_timestamps=False, print_summary=False)
+
+    for i in range(int(args.repeat_runs)):
+        if int(args.repeat_runs) > 1:
+            print(f"=== run {i + 1}/{int(args.repeat_runs)} ===")
+        run_once(..., record_token_timestamps=True, print_summary=True)

运行

pytest -q

.......................                                                  [100%]
=============================== warnings summary ===============================
../anaconda3/lib/python3.10/site-packages/urllib3/util/ssl_.py:260
  /data/projects/anaconda3/lib/python3.10/site-packages/urllib3/util/ssl_.py:260: DeprecationWarning: ssl.PROTOCOL_TLS is deprecated
    context = SSLContext(ssl_version or PROTOCOL_TLS)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
23 passed, 1 warning in 2.11s

Benchmark（HF GPT-2 / streaming）

配置（decode-heavy，强行让 decode 走快路径）：

B=64：--num-requests 64 --max-batch-size 64
T=256：--max-new-tokens 256
submit interval = 0：一次性灌满

命令：

python -m rosellm.roseinfer.benchmark_streaming \
  --hf-model-id gpt2 --device cuda \
  --prompt "Hello" --unique-prompts \
  --num-requests 64 --max-new-tokens 256 --max-batch-size 64 \
  --submit-interval-ms 0 --no-stop-on-eos --no-prefix-cache \
  --paged-attn --cuda-graph

Before（无 warmup）

=== streaming benchmark ===
Model: gpt2
Device: cuda
Paged attention: True
CUDA Graph: True
NVTX: False
Requests: 64
Prompt tokens (total): 256
Completion tokens (total): 16384
Submit wall: 0.084546 s
add_request latency p50/p95/p99: 0.03/0.05/28.68 ms
TTFT p50/p95/p99: 880.48/883.14/883.77 ms
TPOT p50/p95/p99: 5.68/5.68/6.61 ms/token
ITL p50/p95/p99: 4.83/5.81/16.12 ms
Latency p50/p95/p99: 2332.72/2334.55/2335.08 ms
Throughput (completion,total): 6782.33 tokens/s

After（warmup + repeat=3）

python -m rosellm.roseinfer.benchmark_streaming \
  --hf-model-id gpt2 --device cuda \
  --prompt "Hello" --unique-prompts \
  --num-requests 64 --max-new-tokens 256 --max-batch-size 64 \
  --submit-interval-ms 0 --no-stop-on-eos --no-prefix-cache \
  --paged-attn --cuda-graph \
  --warmup-runs 1 --repeat-runs 3

=== warmup 1/1 ===
=== run 1/3 ===
=== streaming benchmark ===
Model: gpt2
Device: cuda
Paged attention: True
CUDA Graph: True
NVTX: False
Requests: 64
Prompt tokens (total): 256
Completion tokens (total): 16384
Submit wall: 0.006119 s
add_request latency p50/p95/p99: 0.04/0.06/0.11 ms
TTFT p50/p95/p99: 31.81/34.16/34.44 ms
TPOT p50/p95/p99: 5.20/5.20/5.23 ms/token
ITL p50/p95/p99: 4.94/5.47/15.84 ms
Latency p50/p95/p99: 1361.67/1362.68/1363.08 ms
Throughput (completion,total): 11991.59 tokens/s
=== run 2/3 ===
=== streaming benchmark ===
Model: gpt2
Device: cuda
Paged attention: True
CUDA Graph: True
NVTX: False
Requests: 64
Prompt tokens (total): 256
Completion tokens (total): 16384
Submit wall: 0.006467 s
add_request latency p50/p95/p99: 0.04/0.07/0.15 ms
TTFT p50/p95/p99: 31.40/34.19/34.44 ms
TPOT p50/p95/p99: 5.12/5.12/5.15 ms/token
ITL p50/p95/p99: 4.85/5.54/15.65 ms
Latency p50/p95/p99: 1337.98/1340.40/1341.19 ms
Throughput (completion,total): 12196.65 tokens/s
=== run 3/3 ===
=== streaming benchmark ===
Model: gpt2
Device: cuda
Paged attention: True
CUDA Graph: True
NVTX: False
Requests: 64
Prompt tokens (total): 256
Completion tokens (total): 16384
Submit wall: 0.006207 s
add_request latency p50/p95/p99: 0.04/0.08/0.11 ms
TTFT p50/p95/p99: 31.29/33.73/34.06 ms
TPOT p50/p95/p99: 5.30/5.30/5.33 ms/token
ITL p50/p95/p99: 5.05/5.83/15.36 ms
Latency p50/p95/p99: 1385.70/1386.90/1387.26 ms
Throughput (completion,total): 11783.53 tokens/s

结论

这版没有改任何推理逻辑，只是在 benchmark 里把“冷启动一次性成本”拆出去，得到更可信的对比口径：

TTFT p99：883.77ms -> ~34ms（约 26x，本质是把 Triton JIT / graph capture 从 TTFT 里移走）
Throughput：6782 -> ~12000 tokens/s（约 1.7x，冷启动被吞掉的时间不再算进来）
repeat=3 的三个 run 很接近：说明这是稳态数据，适合拿来做优化前后的量化对比

后面凡是要对比 paged-attn / CUDA Graph / Triton kernel 的收益，我会默认用：

--warmup-runs 1 --repeat-runs 3

Share on

Twitter Facebook LinkedIn

从零实现 LLM Inference：044. Streaming Benchmark Warmup + Repeat（把 cold-start 拆出去）

代码变更

`roseinfer/benchmark_streaming.py`

运行

Benchmark（HF GPT-2 / streaming）

Before（无 warmup）

After（warmup + repeat=3）

结论

Share on

You may also enjoy

从零实现 LLM Inference：065. KV 元数据数组化（_block_infos / _block_refcounts）

从零实现 LLM Inference：064. KVBlockInfo 改成可变（减少 per-token KV 元数据开销）

从零实现 LLM Inference：063. Triton batched KV COW clone（替换 index_select/index_copy）

从零实现 LLM Inference：062. Triton KV append identity-pos（full batch 但 pos 不一致）