GitHub Explorer

by Alexey Ratnikov

GitHub Explorer

GitHub Explorer|TRENDING COMPARE|FEEDBACK

Back to search

Copyright (c) 2026 Alexey Ratnikov

NVIDIA/TransformerEngine - GitHub Explorer | GitHub Explorer | Trending | Compare

TransformerEngine

NVIDIA•PUBLIC

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory utilization in both training and inference.

cudadeep-learningfp4fp8gpujax

Apache License 2.0

Created on Sep 20, 2022

Updated on Apr 4, 2026

Stars

3.3k

Forks

686

Watchers

3.3k

Open Issues

351

Repository Health Score

🧡

65/100

Fair

Overall repository health assessment

Score Breakdown

Activity

Active development - updated this week

30/30

100%

machine-learning

python

pytorch

Community

3,259 stars, 686 forks

10/30

33%

Documentation

Has description, license

15/20

75%

Maintenance

10.8% issue ratio

10/20

50%

Health score is calculated based on activity, community engagement, documentation quality, and maintenance practices

Languages

Python

55.0%

Cuda

28.6%

C++

14.1%

C

1.8%

Shell

0.4%

CMake

0.2%

Issues Analytics

8

Total Issues

All time

6

Open

75% of total

2

Closed

25% of total

1d

Avg Close Time

Fast response ✅

Issues Activity: Last 6 months

Top Labels

Hottest Issues

1

#2766 Grouped Bias/Dbias Kernel Support After Grouped GEMM

5

open

2

#2771 undefined symbol: _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib

3

closed

3

#2801 Request for additional features from TE Debug modules for low precision training.

2

open

4

#2754 [Question] Expected behavior for blockwise FP8? Hybrid E4M3/E5M2 format & eval metrics outperforming BF16

2

open

5

#2742 Cross node CP communication latency much higher in 8K H800 HGX cluster

bug

1

open

Dependencies

No package.json found

This might not be a Node.js project

Top Contributors

1

ksivaman

User

284

commits

2

timmoon10

User

189

commits

3

ptrendx

User

114

commits

4

phu0ngng

User

110

commits

5

cyanguwa

User

105

commits

6

jberchtold-nvidia

jberchtold-nvidia

User

78

commits

7

pggPL

User

65

commits

8

zlsh80826

User

46

commits

9

denera

User

40

commits

10

yaox12

User

34

commits

Recent Commits

[FSDP2/Megatron-FSDP/DCP] If model parameters are DTensors, optimizer states should also be DTensors. (#2795)

Cory Ye•1 hour ago

5abadf4View on GitHub

Fix nvshmem build (#2815)

Gaétan Lepage•1 day ago

e83c097View on GitHub

Feature/unswizzle (#2732)

int-smart•1 day ago

509614dView on GitHub

[PyTorch] [CI] Capture subprocess stderr in distributed tests for better CI error re… (#2802)

Sudhakar Singh•1 day ago

a88fdc1View on GitHub

Refactor Amax Kernel ldmatrix loads, TMA/compute barriers, swizzle_idx (#2820)

cael-ling•1 day ago

85f5a84View on GitHub

[PyT][Test] Add xfailing FSDP2 memory leak detection tests (#2803)

Peter St. John•1 day ago

8cf3c16View on GitHub

GEMM + Swiglu fused Grouped MLP for MXFP8 (#2769)

Kirthi Shankar Sivamani•1 day ago

29a8c2fView on GitHub

[JAX] Fix: Use jitted kernels for generating THD (and BSHD) segment pos (#2823)

Kshitij Lakhani•1 day ago

9d77dcbView on GitHub

[Common] Persistent Grouped MXFP8 quantization kernel (#2738)

Oleg Goncharov•2 days ago

42267ecView on GitHub

[PyTorch] Fix bug with PR 2677 (#2819)

Sudhakar Singh•2 days ago

b048869View on GitHub

Optimize fp8 block scaling Allgather for FSDP2 (#2789)

vthumbe1503•2 days ago

4bf1c1cView on GitHub

Remove integration test for Lightning-Thunder (#2822)

Tim Moon•2 days ago

281ff06View on GitHub

Pass input_output_alias to TritonAutotunedKernelCall (#2814)

Teddy Do•2 days ago

3af8792View on GitHub

[JAX] Grouped GEMM Refactor to use first_dims and last_dims (#2749)

jberchtold-nvidia•3 days ago

bce4181View on GitHub

[JAX] Add warning if using BSHD and max_segments_per_seq > 1 (#2796)

jberchtold-nvidia•5 days ago

f4debf6View on GitHub

View all commits