A high-throughput and memory-efficient inference and serving engine for LLMs
Stars
75.2k
Forks
15.1k
Watchers
75.2k
Open Issues
4.1k
Overall repository health assessment
No package.json found
This might not be a Node.js project
885
commits
773
commits
498
commits
469
commits
464
commits
391
commits
345
commits
272
commits
257
commits
204
commits
[Bug] Fix compile error for `swap_blocks_batch` in CUDA 13 (#38915)
062f1a2View on GitHub[Bugfix][LoRA] Fix missing in_proj_z in Qwen3_5ForConditionalGenerati… (#38927)
81994e1View on GitHubRemoved GPU state confirmation and cleanup steps. (#38238)
121ea5aView on GitHub[Model Runner V2] Add config validation for not-yet-supported features (#38758)
5f1de2bView on GitHub[Bugfix] Re-enable Renormalize routing for TRT-LLM MoE experts (#38859)
a5a623dView on GitHub[vLLM IR] add `import_ir_kernels()` to support OOT platforms (#38807)
f8c3af2View on GitHubFix invalid logprobs with MTP enabled and sync scheduling (#38711)
50cd567View on GitHub[Frontend] new online quantization frontend (#38138)
7b1a742View on GitHub[KVConnector] Skip `register_kv_caches` on profiling (#38558)
97f92c6View on GitHub[Bugfix] Fix AWQ models batch invariance issues (#38670)
46f02e0View on GitHub[XPU] bump up xpu-kernel v0.1.5, transpose moe weights (#38342)
6b48722View on GitHub