Found 51 repositories(showing 30)
scrya-com
KV cache compression via block-diagonal rotation. Beats TurboQuant: better PPL (6.91 vs 7.07), 28% faster decode, 5.3x faster prefill, 44x fewer params. Drop-in llama.cpp integration.
animehacker
TurboQuant for GGML: 4.57x KV Cache Compression with 72K+ Context for Llama-3.3-70B on Consumer GPUs.
unixsysdev
No description available
AmesianX
TurboQuant KV Cache Compression for llama.cpp โ 5.2x memory reduction with near-lossless quality | Implementation of Google DeepMind's TurboQuant (ICLR 2026)
gamogestionweb
No description available
nisten
1bit llama.cpp gguf weights paired with turboquant 4 bit kv cache
jamesarslan
Complete local AI coding pipeline: Qwen3.5-35B-A3B + llama-server + TurboQuant + OpenCode + Context7 MCP + Chrome DevTools. 188 t/s on RTX 5090, zero cloud APIs.
M-Baraa-Mardini
No description available
Argonaut790
Fused Triton kernels for TurboQuant KV cache compression โ 2-4 bit quantization with RHT rotation. Drop-in HuggingFace & vLLM integration. Up to 4.9x KV cache compression for Llama, Qwen, Mistral, and more.
rookiemann
Native Windows build of vLLM v0.17.1 with Triton support and TurboQuant KV cache compression โ Qwen 3.5, Llama 4, and more. No WSL, no Docker. Pre-built wheel + patchset for MSVC 2022 + CUDA 12.6.
pp1840
Experimental TurboQuant implementation and llama.cpp-style integration path for long-context inference
jagsan-cyber
World's first TurboQuant KV cache compression for llama.cpp on AMD ROCm (RX 9070 / gfx1201)
AI-Engineerings-at
Practical guide: TurboQuant KV-cache quantization on consumer hardware (RTX 3090) โ 100K context, 4.3ร compression, ICLR 2026
MartinCrespoC
๐ Run any LLM on any hardware. 130% faster MoE inference with ExpertFlow + TurboQuant KV compression. Ollama-compatible API. Built on llama.cpp.
Simple all in one build script for llama-cpp-turboquant on Windows 11.
WaveboSF
Model Switcher & Benchmark Tool for llama-server with TurboQuant KV-Cache
CarapaceUDE
llama.cpp fork: Qwen 3.5 hybrid GGUF + loader fixes; syncs with ggml-org/llama.cpp
gotrendwise-com
Run Large Language Models on CPU with up to 8ร less RAM using advanced KV cache compression.
test1111111111111112
TurboQuant llama.cpp fork with optimized turbo4 kernels for Gemma 4 D=256/512 heads โ lazy K/V, batch decode, warp-cooperative write. 120 t/s with 3.8x KV compression on RTX 3090.
jimliddle
A TurboQuant implementation with Llama.cpp for AMD with Vulkan runtime
ahmaddarwesh
A lightweight desktop application for managing and interacting with llama.cpp models through a clean, modern interface - Support TurboQuant technology
ProTekk
llama.cpp-turboquant
JohnnyDillinger-hub
No description available
Matt-Adroited
TurboQuant + KDA (Kimi Delta Attention) fork of llama.cpp โ novel state matrix quantization for linear attention models
smurz
Improved TurboQuant quantization for llama.cpp โ adding QJL residual, residual window, asymmetric K/V to turbo-tan fork
zacpr
No description available
thepradip
No description available
thekozugroup
llama-server with TurboQuant (TQ3_0) KV cache compression โ DGX Spark ARM64 build
guanyuch
No description available
Ascendism
llama.cpp + TurboQuant CUDA; syncs with ggml-org/llama.cpp