Minimal, zero-dependency LLM inference in pure C11. CPU-first with NEON/AVX2 SIMD. Flash MoE (pread + LRU expert cache). TurboQuant 3-bit KV compression (8.9x less memory per session). 20+ GGUF quant formats. Compiles to WASM.
Stars
10
Forks
2
Watchers
10
Open Issues
0
Overall repository health assessment
No package.json found
This might not be a Node.js project
166
commits
GPU performance: Q4 matvec rewrite, compute COPY, flash attention, zero-copy weights
a9bfc5aView on GitHubEnable TQ on draft model, update CLAUDE.md for prompt cache + TQ docs
222e973View on GitHubFix TQ cache reset, add full-turn prompt caching, verify WASM TQ path
1fbef8eView on GitHubTQ-aware prompt cache + audit fixes: bn_alloc, size_t casts, warning cleanup
6790097View on GitHubQ4_0 AVX2: inline dot_i8_float, eliminate redundant integer accumulator
b44bb05View on GitHubQ4_K GPU matvec: integer accumulation for NEON-matched precision
b313320View on GitHubFix all audit findings: C1-C2 critical, H1-H4 high, M1-M7 medium
f340670View on GitHubFix SSM GPU shaders: add per-layer state and conv_state offsets
87c5cb1View on GitHubQ4_0 AVX2: float-domain accumulation eliminates per-block integer hsum
92abd82View on GitHubRevert Q-gated debug changes, keep CPU fallback for correctness
0d27068View on GitHub