A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory utilization in both training and inference.
Stars
3.3k
Forks
686
Watchers
3.3k
Open Issues
351
Overall repository health assessment
No package.json found
This might not be a Node.js project
[FSDP2/Megatron-FSDP/DCP] If model parameters are DTensors, optimizer states should also be DTensors. (#2795)
5abadf4View on GitHub[PyTorch] [CI] Capture subprocess stderr in distributed tests for better CI error re… (#2802)
a88fdc1View on GitHubRefactor Amax Kernel ldmatrix loads, TMA/compute barriers, swizzle_idx (#2820)
85f5a84View on GitHub[PyT][Test] Add xfailing FSDP2 memory leak detection tests (#2803)
8cf3c16View on GitHubGEMM + Swiglu fused Grouped MLP for MXFP8 (#2769)
29a8c2fView on GitHub[JAX] Fix: Use jitted kernels for generating THD (and BSHD) segment pos (#2823)
9d77dcbView on GitHub[Common] Persistent Grouped MXFP8 quantization kernel (#2738)
42267ecView on GitHubPass input_output_alias to TritonAutotunedKernelCall (#2814)
3af8792View on GitHub[JAX] Grouped GEMM Refactor to use first_dims and last_dims (#2749)
bce4181View on GitHub[JAX] Add warning if using BSHD and max_segments_per_seq > 1 (#2796)
f4debf6View on GitHub