• Skip to primary navigation
  • Skip to content
  • Skip to footer
Wine & Chord
  • Categories
  • Tags

    Algorithm from codetop

    less than 1 minute read

    On this page

    Tags: codetop

    Categories: coding

    Updated: July 20, 2022

    Share on

    Twitter Facebook LinkedIn
    Previous Next

    You may also enjoy

    从零实现 LLM Inference:077. roseinfer vs vLLM vs SGLang vs TensorRT-LLM(baseline)

    7 minute read

    用仓库自带的 online/offline serving benchmark 跑一轮 baseline,对齐参数和环境,比较 roseinfer 与 vLLM / SGLang / TensorRT-LLM 的在线延迟(P50/P90/P99)与离线吞吐。

    从零实现 LLM Inference:076. 继续追性能:和 SGLang / TensorRT-LLM 的差距到底在哪?

    10 minute read

    把 074/075 的 multiprocess serving 稳住之后,继续看 online/offline 的性能差距:roseinfer 的 decode(TPOT/ITL)已经能压住 vLLM,但 TTFT 还是明显落后 SGLang/TRT-LLM。本文先把现象拆清楚,再做两个增量试验:mpasync...

    从零实现 LLM Inference:075. Online P99 长尾时延:从 700ms 尾巴到 vLLM 级别

    4 minute read

    071–074 把 multiprocess serving 的工程税压住之后,online 侧的 P99 还是明显炸尾。顺着 trace/metrics 把锅拆开,发现罪魁祸首其实很朴素:Python GC jitter。照着 vLLM/SGLang 的做法加上 GC freeze,再把 OpenAI SSE ...

    从零实现 LLM Inference:074. Multiprocess Serving 极致优化:Ablation / 稳定性 / 拓扑绑核 / Async Streaming / Profiling

    10 minute read

    把 071 的 API/engine 多进程拆分继续榨:thread cap、topology-aware affinity、cmd budget、pipe bytes IPC、flat events、offline 快计数、async streaming……所有点都有开关、默认按正收益打开;再用一套 onlin...

    • Follow:
    • Feed
    © 2026 Wine & Chord. Powered by Jekyll & Minimal Mistakes.