Found 149 repositories(showing 30)
tonbistudio
From-scratch PyTorch implementation of Google's TurboQuant (ICLR 2026) for LLM KV cache compression. 5x compression at 3-bit with 99.5% attention fidelity.
0xSero
TurboQuant: Near-optimal KV cache quantization for LLM inference (3-bit keys, 2-bit values) with Triton kernels + vLLM integration
scrya-com
KV cache compression via block-diagonal rotation. Beats TurboQuant: better PPL (6.91 vs 7.07), 28% faster decode, 5.3x faster prefill, 44x fewer params. Drop-in llama.cpp integration.
SharpAI
โก Native MLX Swift LLM inference server for Apple Silicon. OpenAI-compatible API, SSD streaming for 100B+ MoE models, TurboQuant KV cache compression, + iOS iPhone app.
quantumaikr
Embeddable LLM inference in pure C. 33K LOC, zero dependencies. Delta KV compression โ 4x longer context. Inspired by TurboQuant (ICLR 2026).
arozanov
TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.
OnlyTerp
First open-source implementation of Google TurboQuant (ICLR 2026) -- near-optimal KV cache compression for LLM inference. 5x compression with near-zero quality loss.
OmarHory
Open-source implementation of Google's TurboQuant (ICLR 2026) โ KV cache compression to 2.5โ4 bits with near-zero quality loss. 3.8โ5.7x memory reduction on Mistral-7B, no training required.
hackimov
Open-source PyTorch implementation of Google TurboQuant (ICLR 2026) โ extreme KV-cache quantization to ~3 bits with zero accuracy loss. 6x less memory, up to 8x faster inference.
DevTechJr
turboquant-based compression engine for LLM KV cache
animehacker
TurboQuant for GGML: 4.57x KV Cache Compression with 72K+ Context for Llama-3.3-70B on Consumer GPUs.
Alberto-Codes
TurboQuant KV cache compression plugin for vLLM โ asymmetric K/V, 8 models validated, consumer GPUs
helgklaizar
Extreme KV Cache Compression (1-3 bit) for LLMs natively on Apple Silicon (MLX). Features TurboQuant, asymmetric PolarQuant caching, and OpenAI server compatibility.
nisten
1bit llama.cpp gguf weights paired with turboquant 4 bit kv cache
AmesianX
TurboQuant KV Cache Compression for llama.cpp โ 5.2x memory reduction with near-lossless quality | Implementation of Google DeepMind's TurboQuant (ICLR 2026)
onur-gokyildiz-bhi
Pure Rust implementation of Google's TurboQuant (ICLR 2026) โ KV cache compression for LLMs
back2matching
First open-source TurboQuant KV cache compression for LLM inference. Drop-in for HuggingFace. pip install turboquant.
DeadByDawn101
First MLX implementation of TurboQuant KV cache compression for Apple Silicon
RecursiveIntell
Rust implementation of TurboQuant, PolarQuant, and QJL โ zero-overhead vector quantization for semantic search and KV cache compression (ICLR 2026)
mindtro
Vector compression with TurboQuant codecs for embeddings, retrieval, and KV-cache. 10x compression, pure NumPy core โ optional GPU acceleration via PyTorch (CUDA/MPS) or MLX (Metal).
jhammant
Turbo1Bit: Combining 1-bit LLM weights (Bonsai) with TurboQuant KV cache compression for maximum inference efficiency. 4.2x KV cache compression + 16x weight compression = ~10x total memory reduction.
varjoranta
TurboQuant+ KV cache compression for vLLM. 3.8x smaller KV cache, same conversation quality. Fused CUDA kernels with automatic PyTorch fallback.
yashkc2025
Python implementation of TurboQuant (arXiv 2504.19874). Data-oblivious, near-optimal 1โ4 bit vector quantization for streaming KV-caches and databases.
Arclabs001
Yet Another TurboQuant in PyTorch (YATQ) is a PyTorch implementation of TurboQuant for KV cache compression, following the paper TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (ICLR 2026). With HuggingFace interface supported.
codepawl
PyTorch implementation of TurboQuant. Near-optimal vector quantization for KV cache compression and vector search. 3-bit with zero accuracy loss.
artalis-io
Minimal, zero-dependency LLM inference in pure C11. CPU-first with NEON/AVX2 SIMD. Flash MoE (pread + LRU expert cache). TurboQuant 3-bit KV compression (8.9x less memory per session). 20+ GGUF quant formats. Compiles to WASM.
r13xr13
Nexora Code - AI coding harness with TurboQuant KV cache optimizations
rachittshah
TurboQuant KV cache compression for MLX (Apple Silicon)
RemizovDenis
TurboQuant: KV-cache compression for faster and cheaper LLM inference.
scos-lab
TurboQuant reference implementation โ KV cache compression with engineering insights (ICLR 2026 paper reproduction)