Wine & Chord

从零实现 LLM Inference：058. Prefix Cache logits 常驻 device（减少 hit 的 CPU/GPU 拷贝）

1 分钟阅读

prefix cache hit 时我们还在把 last_logits 从 CPU 拷回 GPU；这版把 entry 的 last_logits 直接存成 device 上的一份小 clone，hit 变成真正的零拷贝。

从零实现 LLM Inference：057. Batched KV COW（shared block 的 append_token_batch fast path）

3 分钟阅读

prefix cache 复用会让 decode 的最后一个 KV block refcount>1；之前 append_token_batch 直接退化成逐 request 的 append_token + copy-on-write，CPU/GPU overhead 都很明显。这版把 COW clone...

从零实现 LLM Inference：056. Prefix Cache Radix Tree（longest prefix 查找加速）

3 分钟阅读

055 做了 longest-prefix reuse，但 longest-prefix 查询还是 O(N) 扫描；这版用 token trie 替换掉 scan，把 cache miss 的 longest-prefix 查找从 ms 级降到 us 级，减少 scheduler CPU overhead。

从零实现 LLM Inference：055. Prefix Cache Longest Prefix Reuse（longest prefix KV 复用）

4 分钟阅读

prefix cache 之前只能 exact hit；这版做 longest-prefix 复用：命中“缓存 prompt 是新 prompt 的前缀”时直接挂载 KV blocks，然后用 decode(T=1) teacher-forcing 补齐 suffix。顺手把 paged-attn Triton ...

从零实现 LLM Inference：054. Prefix Cache Token Key（prompt string -> token ids）

3 分钟阅读

prefix cache 之前用 prompt 字符串当 key；在 pretok 场景里，prompt 文本不同但 token ids 相同会导致 cache miss。改成优先用 prompt_token_ids tuple 作为 key，并加了一个 benchmark knob 复现/量化收益。

最新文章

从零实现 LLM Inference：058. Prefix Cache logits 常驻 device（减少 hit 的 CPU/GPU 拷贝）

从零实现 LLM Inference：057. Batched KV COW（shared block 的 append_token_batch fast path）

从零实现 LLM Inference：056. Prefix Cache Radix Tree（longest prefix 查找加速）

从零实现 LLM Inference：055. Prefix Cache Longest Prefix Reuse（longest prefix KV 复用）

从零实现 LLM Inference：054. Prefix Cache Token Key（prompt string -> token ids）