Found 68 repositories(showing 30)
scrya-com
KV cache compression via block-diagonal rotation. Beats TurboQuant: better PPL (6.91 vs 7.07), 28% faster decode, 5.3x faster prefill, 44x fewer params. Drop-in llama.cpp integration.
AmesianX
TurboQuant KV Cache Compression for llama.cpp โ 5.2x memory reduction with near-lossless quality | Implementation of Google DeepMind's TurboQuant (ICLR 2026)
unixsysdev
No description available
animehacker
TurboQuant for GGML: 4.57x KV Cache Compression with 72K+ Context for Llama-3.3-70B on Consumer GPUs.
gamogestionweb
No description available
nisten
1bit llama.cpp gguf weights paired with turboquant 4 bit kv cache
domvox
TurboQuant KV cache compression for llama.cpp โ HIP/ROCm port for AMD RDNA3 (gfx1100)
test1111111111111112
TurboQuant llama.cpp fork with optimized turbo4 kernels for Gemma 4 D=256/512 heads โ lazy K/V, batch decode, warp-cooperative write. 120 t/s with 3.8x KV compression on RTX 3090.
jamesarslan
Complete local AI coding pipeline: Qwen3.5-35B-A3B + llama-server + TurboQuant + OpenCode + Context7 MCP + Chrome DevTools. 188 t/s on RTX 5090, zero cloud APIs.
M-Baraa-Mardini
No description available
md-exitcode0
One-click LLM server with TurboQuant Llama CPP engine
AI-Engineering-at
Practical guide: TurboQuant KV-cache quantization for llama.cpp. Run 122B models on consumer GPUs.
Argonaut790
Fused Triton kernels for TurboQuant KV cache compression โ 2-4 bit quantization with RHT rotation. Drop-in HuggingFace & vLLM integration. Up to 4.9x KV cache compression for Llama, Qwen, Mistral, and more.
Simple all in one build script for llama-cpp-turboquant on Windows 11.
MartinCrespoC
๐ Run any LLM on any hardware. 130% faster MoE inference with ExpertFlow + TurboQuant KV compression. Ollama-compatible API. Built on llama.cpp.
jimliddle
A TurboQuant implementation with Llama.cpp for AMD with Vulkan runtime
jagsan-cyber
World's first TurboQuant KV cache compression for llama.cpp on AMD ROCm (RX 9070 / gfx1201)
pdscomp
๐ฆ Docker template for running llama.cpp llama-server in router mode with NVIDIA CUDA and AMD Vulkan GPU acceleration. Features TurboQuant KV cache optimization, long context support (up to 256K tokens), and optimized configurations for 24GB+ VRAM cards.
JoelHJames1
NEXUS: Production C++ inference engine for Apple Silicon. Run 400B+ LLMs on your Mac via layer streaming, Metal GPU compute, TurboQuant KV compression, NXF format, MoE routing, and Neural Engine speculative decoding. Faster than AirLLM, more capable than llama.cpp.
pp1840
Experimental TurboQuant implementation and llama.cpp-style integration path for long-context inference
AylaTheTanuki
Pre-compiled Windows binaries and CMake fixes for the experimental TurboQuant branch (with Gemma 4 support)
WaveboSF
Model Switcher & Benchmark Tool for llama-server with TurboQuant KV-Cache
CarapaceUDE
llama.cpp fork: Qwen 3.5 hybrid GGUF + loader fixes; syncs with ggml-org/llama.cpp
selmand
TurboQuant Run larger AI models with longer context on your GPU โ powered by Google's TurboQuant KV cache compression.
atomicmilkshake
llama.cpp fork with TurboQuant quantization (turbo2/3/4) and TriAttention GPU-accelerated KV cache pruning. 75 tok/s on Qwen3-8B / RTX 3080.
gotrendwise-com
Run Large Language Models on CPU with up to 8ร less RAM using advanced KV cache compression.
benardayim
a llama.cpp fork combining PrismML's 1-bit kernels with TurboQuant KV cache compression.
Clifford-Swartz
Pre-built llama-server with pmem-tier + TurboQuant KV cache compression for JAC
ahmaddarwesh
A lightweight desktop application for managing and interacting with llama.cpp models through a clean, modern interface - Support TurboQuant technology
tsuyu122
TurboQuant Vulkan: 3-bit KV cache quantization for llama.cpp using Lloyd-Max Gaussian codebooks. 4.57x compression, Vulkan GPU support (AMD/Intel/NVIDIA). Hobby project.