Back to search
A hands-on fork of NanoGPT with FlashAttention-2 CUDA kernels, INT8/INT4 GPTQ quantization, paged KV-cache reuse, and continuous batching, turning a tiny Shakespeare model into a full-speed GPU LLM inference demo.
Stars
1
Forks
0
Watchers
1
Open Issues
0
Overall repository health assessment
No package.json found
This might not be a Node.js project
4
commits