[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
Stars
1.2k
Forks
78
Watchers
1.2k
Open Issues
90
Overall repository health assessment
No package.json found
This might not be a Node.js project
119
commits
37
commits
25
commits
5
commits
3
commits
2
commits
1
commits
1
commits
1
commits
1
commits