Found 333 repositories(showing 30)
horovod
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
NVIDIA
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
determined-ai
Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
uber
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
tczhangzhi
A quickstart and benchmark for pytorch distributed training.
volcengine
Byted PyTorch Distributed for Hyperscale Training of LLMs and RLs
flink-extended
Deep Learning on Flink aims to integrate Flink and deep learning frameworks (e.g. TensorFlow, PyTorch, etc) to enable distributed deep learning training and inference on a Flink cluster.
LambdaLabsML
Best practices & guides on how to write distributed pytorch training code
krasserm
A PyTorch implementation of Perceiver, Perceiver IO and Perceiver AR with PyTorch Lightning scripts for distributed training
PersiaML
High performance distributed framework for training deep learning recommendation models based on PyTorch.
NVIDIA-NeMo
Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging Face support
facebookincubator
Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
facebookresearch
A library for distributed ML training with PyTorch
rentainhe
Simple tutorials on Pytorch DDP training
BIGBALLON
The pure and clear PyTorch Distributed Training Framework.
IDDM (Industrial, landscape, animate, latent diffusion), support LDM, DDPM, DDIM, PLMS, webui and distributed training. Pytorch实现扩散模型,生成模型,分布式训练
Bluefog-Lib
Distributed and decentralized training framework for PyTorch over graph
PyTorch implementation of over 30 realtime semantic segmentations models, e.g. BiSeNetv1, BiSeNetv2, CGNet, ContextNet, DABNet, DDRNet, EDANet, ENet, ERFNet, ESPNet, ESPNetv2, FastSCNN, ICNet, LEDNet, LinkNet, PP-LiteSeg, SegNet, ShelfNet, STDC, SwiftNet, and support knowledge distillation, distributed training, Optuna etc.
meta-pytorch
A library that contains a rich collection of performant PyTorch model metrics, a simple interface to create new metrics, a toolkit to facilitate metric computation in distributed training and tools for PyTorch model evaluations.
antoine77340
PyTorch GPU distributed training code for MIL-NCE HowTo100M
Janspiry
This is a seed project for distributed PyTorch training, which was built to customize your network quickly
Modalities
Modalities, a PyTorch-native framework for distributed and reproducible foundation model training.
a PyTorch Tutorial to Class-Incremental Learning | a Distributed Training Template of CIL with core code less than 100 lines.
NoteDance
Machine learning library, Distributed training, Deep learning, Reinforcement learning, Models, TensorFlow, PyTorch
phonism
GenRec: Generative Recommender Systems with RQ-VAE semantic IDs, Transformer-based retrieval, and LLM integration. Built on PyTorch with distributed training support.
facebookresearch
Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large scales
A guide that integrates Pytorch DistributedDataParallel, Apex, warmup, learning rate scheduler, also mentions the set-up of early-stopping and random seed.
lesliejackson
Example of PyTorch DistributedDataParallel
IDEA-Research
Official PyTorch implementation of the paper "DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training".
AlibabaPAI
PyTorch distributed training acceleration framework