Wine & Chord

从零实现 LLM Inference：063. Triton batched KV COW clone（替换 index_select/index_copy）

2 分钟阅读

prefix cache hit 后，多条 session 会共享同一批 KV blocks；第一次 decode 写入时如果 last block 还没写满，就会触发 COW：先 clone block 再 append token。原实现用 index_select/index_copy 搬整块 KV；这版加...

从零实现 LLM Inference：062. Triton KV append identity-pos（full batch 但 pos 不一致）

1 分钟阅读

上一版 full-batch KV-append 只覆盖 pos 常量的稳态；但只要 prompt 长度不一致，decode 的 pos 就会按 request 分叉，仍然要构造 batch_idx/pos 这类 index tensor。这版补一个 identity batch 的 Triton kernel：...

从零实现 LLM Inference：061. Triton KV append full-batch fast path（少分配 index tensor）

2 分钟阅读

batch decode 稳态里，经常满足 fast_batch_idx 是 [0..B-1] 且同一步 pos 对整个 batch 是常量；这时 Triton KV append 还在每 step 构造 batch_idx/pos 这类小 tensor。这个 PR 新增 full-batch Triton ke...

从零实现 LLM Inference：060. KV append identity fast path（少做一次 index_select）

1 分钟阅读

append_token_batch 里 fast_batch_idx 很多时候就是 [0..B-1]；之前每层都会 index_select 把 key_new/value_new 重新拷一遍，还会构造 pos_t。这个小改动在 identity batch 时直接复用 key_new/value_new，并且...

从零实现 LLM Inference：059. Batched KV rollover（block 满了也别退化成 for-loop）

1 分钟阅读

append_token_batch 之前只覆盖 len<block_size 的 fast path；一旦 last block 满了（len==block_size）就会退化成逐 request 的 append_token。这个点会制造 ITL 的尖刺。这版把 rollover 也塞回 batch：先...

最新文章

从零实现 LLM Inference：063. Triton batched KV COW clone（替换 index_select/index_copy）

从零实现 LLM Inference：062. Triton KV append identity-pos（full batch 但 pos 不一致）

从零实现 LLM Inference：061. Triton KV append full-batch fast path（少分配 index tensor）

从零实现 LLM Inference：060. KV append identity fast path（少做一次 index_select）

从零实现 LLM Inference：059. Batched KV rollover（block 满了也别退化成 for-loop）