Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
Stars
408
Forks
181
Watchers
408
Open Issues
32
Overall repository health assessment
No package.json found
This might not be a Node.js project
328
commits
241
commits
133
commits
76
commits
49
commits
47
commits
38
commits
28
commits
25
commits
24
commits
feat: add OpenRLHF GRPO training recipe for gpt-oss-20b on HyperPod EKS (g5.12xlarge) (#1053)
66e7860View on GitHubRemove deprecated update_neuron_sdk.sh from HyperPod lifecycle scripts (#1055)
311109bView on GitHubEnhance instance validation and visualization script for network topo… (#1051)
966e80dView on GitHubAdd NCCL send/recv ring benchmark for multi-GPU testing (#1013)
a3e2166View on GitHubUpdate 0.create-venv.sh for pining setuptools version (#1049)
6b896c4View on GitHubAdd ParallelCluster compute node monitoring support (#1043)
edeb016View on GitHubFix FSDP2 train.py: add missing torch.cuda.set_device() call (#1046)
04d3f3dView on GitHubPin PyTorch 2.10.0 and add NPROC_PER_NODE variable for stable DDP container training (#1048)
cbdc4eeView on GitHubfix: use object-level ARN for s3:GetObject in HyperPod execution role policy (#1045)
949703cView on GitHubBump requests from 2.32.4 to 2.33.0 in /3.test_cases/megatron/bionemo (#1037)
0264decView on GitHubfix(ddp/slurm): fix container image missing mlflow and wrong argument name (#1039)
7334940View on GitHubBump filelock from 3.16.1 to 3.20.3 in /3.test_cases/pytorch/nvrx (#1027)
d8f0f32View on GitHubSlinky Slurm on HyperPod EKS — Deployment Automation & Infrastructure Updates (#1020)
249c476View on GitHubfix: update ParallelCluster version from 3.13.1 to 3.14.2 (#1032)
73f6b35View on GitHub