Algorithm from codetop

从零实现 LLM Inference：065. KV 元数据数组化（_block_infos / _block_refcounts）

1 minute read

KVBlockManager 的 block 元数据原来用 dict 做 global_id -> info/refcount 映射，decode 热路径会频繁查表。这里把两张表改成定长 list（按 global_id 直接索引），减少 Python dict 开销。

从零实现 LLM Inference：064. KVBlockInfo 改成可变（减少 per-token KV 元数据开销）

1 minute read

KV append 的热路径里，每层每 token 都要更新一次 block length。之前用 NamedTuple 需要不断创建新对象并回写 dict；这版改成 slots dataclass，length 原地自增，减少 Python 分配和重复查表。

从零实现 LLM Inference：063. Triton batched KV COW clone（替换 index_select/index_copy）

2 minute read

prefix cache hit 后，多条 session 会共享同一批 KV blocks；第一次 decode 写入时如果 last block 还没写满，就会触发 COW：先 clone block 再 append token。原实现用 index_select/index_copy 搬整块 KV；这版加...

从零实现 LLM Inference：062. Triton KV append identity-pos（full batch 但 pos 不一致）

2 minute read

上一版 full-batch KV-append 只覆盖 pos 常量的稳态；但只要 prompt 长度不一致，decode 的 pos 就会按 request 分叉，仍然要构造 batch_idx/pos 这类 index tensor。这版补一个 identity batch 的 Triton kernel：...

Algorithm from codetop

Share on

You may also enjoy

从零实现 LLM Inference：065. KV 元数据数组化（_block_infos / _block_refcounts）

从零实现 LLM Inference：064. KVBlockInfo 改成可变（减少 per-token KV 元数据开销）

从零实现 LLM Inference：063. Triton batched KV COW clone（替换 index_select/index_copy）

从零实现 LLM Inference：062. Triton KV append identity-pos（full batch 但 pos 不一致）