Found 4 repositories(showing 4)
microsoft
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
amanb2000
minimal llm inference servers for researchers
TomtheCodeBot
No description available
WZRP
Stable Diffusion Minimal Inference
All 4 repositories loaded