Search Results

Found 58 repositories(showing 30)

Awesome-KV-Cache-Optimization

jjiantong

🧡65

[ACL 2026] Towards Efficient Large Language Model Serving: A Survey on System-Aware KV Cache Optimization

260

Python

Updated 1 day ago

aicomputer-architecturekv-cache+8

LOOK-M

SUSTechBruce

🧡50

[EMNLP 2024 Findings🔥] Official implementation of ": LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference"

104

MIT

Python

Updated 1 month ago

CUDA-Practice

psmarter

💛70

CUDA编程练习项目-Hands-on CUDA kernels and performance optimization, covering GEMM, FlashAttention, Tensor Cores, CUTLASS, quantization, KV cache, NCCL, and profiling.

MIT

Cuda

Updated 1 day ago

cudacuda-kernelscutlass+12

MixKV

xuyang-liu16

💛70

[ICLR 2026] Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models

Apache-2.0

Python

Updated 12 hours ago

TailorKV

ydyhello

❤️40

Official implementation of "TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization" (Findings of ACL 2025).

Python

Updated 2 months ago

Learn the ins and outs of efficiently serving Large Language Models (LLMs). Dive into optimization techniques, including KV caching and Low Rank Adapters (LoRA), and gain hands-on experience with Predibase’s LoRAX framework inference server.

Jupyter Notebook

Updated 2 weeks ago

batch-processingdeep-learning-techniquesinference-optimization+11

KV_Cache_Optimization

Chelsi-create

❤️20

This project develops a high-performance KV-cache management framework for multi-document RAG tasks. It focuses on reducing time-per-output-token (TPOT) and improving throughput through adaptive cache scheduling, GPU–CPU offloading, and reuse of cross-document attention states.

Python

Updated 4 months ago

Mini-vLLM

naksshhh

🧡65

Dynamic batching • KV-cache optimization • Token streaming • Observability

Python

Updated 4 days ago

llama-docker

pdscomp

💛70

🦙 Docker template for running llama.cpp llama-server in router mode with NVIDIA CUDA and AMD Vulkan GPU acceleration. Features TurboQuant KV cache optimization, long context support (up to 256K tokens), and optimized configurations for 24GB+ VRAM cards.

MIT

Dockerfile

Updated 17 hours ago

aicudadocker+5

KV-cache-optimization

mohamedAtoui

❤️35

No description available

Jupyter Notebook

Updated 1 month ago

mini_vllm

cs-wangkang

❤️45

A PyTorch-based VLLM project that includes inference optimization mechanisms such as KV cache and Pagedattention.

Python

Updated 1 month ago

AccKV

hua-zi

💛70

[AAAI2026] AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization

MIT

Updated 1 day ago

kvcache-autotune

Keyvanhardani

🧡50

Automatic KV-Cache optimization for HuggingFace Transformers. Find the optimal cache strategy, attention backend, and dtype for your LLM inference workload.

Apache-2.0

Python

Updated 2 months ago

deep-learninghuggingfaceinference+7

inference-benchmarking

kotgire58

❤️35

Benchmarking framework for evaluating LLM inference performance before and after optimization techniques like KV-cache, speculative decoding, batching, paged attention, and token throughput improvements.

Python

Updated 4 months ago

TurboQuant

Alperen012

🧡55

Ultra-Low Bit KV-Cache Compression optimization layer built on top of llama.cpp for LLM inference. Reduces VRAM overhead by ~75-80% using custom CUDA kernels.

C++

Updated 1 week ago

agent-memorycudainference+6

turboquant

mredencom

🧡55

Go implementation of TurboQuant (arXiv:2504.19874) — data-oblivious 2/3/4-bit vector quantization via random orthogonal rotation and Lloyd-Max optimization. 7x-14x compression with >0.98 cosine similarity. Built for LLM KV Cache compression and vector databases.

MIT

Updated 22 hours ago

Triton-Inference-Engine

rajveer100704

🧡55

Production-grade Transformer inference engine with Triton-based FlashAttention, static KV cache for O(1) decoding, and async dynamic batching. Achieves 5.7× speedup over PyTorch baseline with real-time TTFT, latency, and VRAM profiling, showcasing ML + systems + kernel optimization.

Python

Updated 1 week ago

Mastering-LLMs

adityakamat24

❤️35

A curated workshop of Jupyter notebooks and deep-dive PDFs that break down how large language models work , from kv-cache internals and fine-tuning to optimization strategies and skill paths. Ideal for engineers who want to go beyond APIs and really understand the guts of LLMs.

Jupyter Notebook

Updated 5 months ago

fine-tuninginferencekv-cache+2

kv_cache_optimization

Srindot

❤️40

Implementing all kv caching optimization presented in paper

MIT

Jupyter Notebook

Updated 5 months ago

kv_cache_optimization

RaviTejGuntuku

🧡65

kv cache management optimization research under prof. mootaz elnozahy

Updated 3 days ago

058-kv-cache-optimization

suenot

🧡50

No description available

MIT

Python

Updated 3 weeks ago

taqg-kv-cache-optimization

myProjectsRavi

💛70

Type-Aware Quantization Gap (TAQG): Phase-aware KV cache quantization for reasoning LLMs. Proves uniform quantization is suboptimal. 58% distortion reduction on DeepSeek-R1-Distill-1.5B

MIT

Python

Updated 2 days ago

Transformer-Inference-Optimization-using-KV-Cache

1ritika

❤️35

Transformer Inference Optimization using KV-Cache

Python

Updated 5 months ago

kv-cache-optimization-model-serving

soumyadipsarkar2

❤️45

No description available

Updated 1 week ago

min-gpt-optimization-KV-cache-

djura2001

❤️40

Cloned Andrej Karpahty's min-gpt repo and implemented KV cache technique

MIT

Python

Updated 4 months ago

WrinkleFree-Inference-Engine

DeepOpt-com

❤️20

BitNet model inference engine with KV caching optimization

Python

Updated 3 months ago

llm-caching

RSHVR

❤️45

Claude Code skill for LLM KV caching and prompt caching optimization

Updated 2 months ago

Paged-KV-Cache-CUDA-Kernel-Optimization-for-LLM-Decoding

Zlatanwic

🧡55

面向大模型长上下文解码场景，实现并分析 paged KV cache 的 block-gather CUDA kernel，验证 fused attention 对减少中间显存流量和提升 decode 吞吐的效果。

Python

Updated 3 weeks ago

fibkvc

calivision

❤️45

Fibonacci Hashing for KV Cache Optimization with LLM Text Generation

MIT

Python

Updated 1 month ago

LLM-Optimization---KV-Cache-Flash-Attention-MQA-GQA-Hugging-Face-Explained

switch2ai

🧡55

LLM Optimization - KV Cache Flash Attention MQA GQA | Hugging Face Explained

Jupyter Notebook

Updated 3 weeks ago

GitHub Explorer

Search Results

Awesome-KV-Cache-Optimization

LOOK-M

CUDA-Practice

MixKV

TailorKV

Efficiently-Serving-LLMs

KV_Cache_Optimization

Mini-vLLM

llama-docker

KV-cache-optimization

mini_vllm

AccKV

kvcache-autotune

inference-benchmarking

TurboQuant

turboquant

Triton-Inference-Engine

Mastering-LLMs

kv_cache_optimization

kv_cache_optimization

058-kv-cache-optimization

taqg-kv-cache-optimization

Transformer-Inference-Optimization-using-KV-Cache

kv-cache-optimization-model-serving

min-gpt-optimization-KV-cache-

WrinkleFree-Inference-Engine

llm-caching

Paged-KV-Cache-CUDA-Kernel-Optimization-for-LLM-Decoding

fibkvc

LLM-Optimization---KV-Cache-Flash-Attention-MQA-GQA-Hugging-Face-Explained

Awesome-KV-Cache-Optimization

LOOK-M

CUDA-Practice

MixKV

TailorKV

Efficiently-Serving-LLMs

KV_Cache_Optimization

Mini-vLLM

llama-docker

KV-cache-optimization

mini_vllm

AccKV

kvcache-autotune

inference-benchmarking

TurboQuant

turboquant

Triton-Inference-Engine

Mastering-LLMs

kv_cache_optimization

kv_cache_optimization

058-kv-cache-optimization

taqg-kv-cache-optimization

Transformer-Inference-Optimization-using-KV-Cache

kv-cache-optimization-model-serving

min-gpt-optimization-KV-cache-

WrinkleFree-Inference-Engine

llm-caching

Paged-KV-Cache-CUDA-Kernel-Optimization-for-LLM-Decoding

fibkvc

LLM-Optimization---KV-Cache-Flash-Attention-MQA-GQA-Hugging-Face-Explained