Found 58 repositories(showing 30)
jjiantong
[ACL 2026] Towards Efficient Large Language Model Serving: A Survey on System-Aware KV Cache Optimization
SUSTechBruce
[EMNLP 2024 Findings🔥] Official implementation of ": LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference"
psmarter
CUDA编程练习项目-Hands-on CUDA kernels and performance optimization, covering GEMM, FlashAttention, Tensor Cores, CUTLASS, quantization, KV cache, NCCL, and profiling.
xuyang-liu16
[ICLR 2026] Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models
ydyhello
Official implementation of "TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization" (Findings of ACL 2025).
Learn the ins and outs of efficiently serving Large Language Models (LLMs). Dive into optimization techniques, including KV caching and Low Rank Adapters (LoRA), and gain hands-on experience with Predibase’s LoRAX framework inference server.
Chelsi-create
This project develops a high-performance KV-cache management framework for multi-document RAG tasks. It focuses on reducing time-per-output-token (TPOT) and improving throughput through adaptive cache scheduling, GPU–CPU offloading, and reuse of cross-document attention states.
naksshhh
Dynamic batching • KV-cache optimization • Token streaming • Observability
pdscomp
🦙 Docker template for running llama.cpp llama-server in router mode with NVIDIA CUDA and AMD Vulkan GPU acceleration. Features TurboQuant KV cache optimization, long context support (up to 256K tokens), and optimized configurations for 24GB+ VRAM cards.
mohamedAtoui
No description available
cs-wangkang
A PyTorch-based VLLM project that includes inference optimization mechanisms such as KV cache and Pagedattention.
hua-zi
[AAAI2026] AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization
Keyvanhardani
Automatic KV-Cache optimization for HuggingFace Transformers. Find the optimal cache strategy, attention backend, and dtype for your LLM inference workload.
kotgire58
Benchmarking framework for evaluating LLM inference performance before and after optimization techniques like KV-cache, speculative decoding, batching, paged attention, and token throughput improvements.
Alperen012
Ultra-Low Bit KV-Cache Compression optimization layer built on top of llama.cpp for LLM inference. Reduces VRAM overhead by ~75-80% using custom CUDA kernels.
mredencom
Go implementation of TurboQuant (arXiv:2504.19874) — data-oblivious 2/3/4-bit vector quantization via random orthogonal rotation and Lloyd-Max optimization. 7x-14x compression with >0.98 cosine similarity. Built for LLM KV Cache compression and vector databases.
rajveer100704
Production-grade Transformer inference engine with Triton-based FlashAttention, static KV cache for O(1) decoding, and async dynamic batching. Achieves 5.7× speedup over PyTorch baseline with real-time TTFT, latency, and VRAM profiling, showcasing ML + systems + kernel optimization.
adityakamat24
A curated workshop of Jupyter notebooks and deep-dive PDFs that break down how large language models work , from kv-cache internals and fine-tuning to optimization strategies and skill paths. Ideal for engineers who want to go beyond APIs and really understand the guts of LLMs.
Srindot
Implementing all kv caching optimization presented in paper
RaviTejGuntuku
kv cache management optimization research under prof. mootaz elnozahy
No description available
myProjectsRavi
Type-Aware Quantization Gap (TAQG): Phase-aware KV cache quantization for reasoning LLMs. Proves uniform quantization is suboptimal. 58% distortion reduction on DeepSeek-R1-Distill-1.5B
Transformer Inference Optimization using KV-Cache
soumyadipsarkar2
No description available
djura2001
Cloned Andrej Karpahty's min-gpt repo and implemented KV cache technique
DeepOpt-com
BitNet model inference engine with KV caching optimization
RSHVR
Claude Code skill for LLM KV caching and prompt caching optimization
面向大模型长上下文解码场景,实现并分析 paged KV cache 的 block-gather CUDA kernel,验证 fused attention 对减少中间显存流量和提升 decode 吞吐的效果。
calivision
Fibonacci Hashing for KV Cache Optimization with LLM Text Generation
LLM Optimization - KV Cache Flash Attention MQA GQA | Hugging Face Explained