Found 32 repositories(showing 30)
alpa-projects
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI 23)
Supplemental materials for The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning
Constructed attention-based deep multiple instance learning model using PyTorch and trained on 624 whole slide images of digitized H&E-stained prostate biopsies using AWS SageMaker’s data parallelism toolkit.
Wodlfvllf
QuintNet is a research-oriented PyTorch framework designed to explore and implement multi-dimensional parallelism strategies for distributed deep learning.
DreamingRaven
Generalised and highly customisable, hybrid-parallelism, database based, deep learning framework.
Who doesn’t dream of a new FPGA family that can provide embedded hard neurons in its silicon architecture fabric instead of the conventional DSP and multiplier blocks? The optimized hard neuron design will allow all the software and hardware designers to create or test different deep learning network architectures, especially the convolutional neural networks (CNN), more easily and faster in comparing to any previous FPGA family in the market nowadays. The revolutionary idea about this project is to open the gate of creativity for a precise-tailored new generation of FPGA families that can solve the problems of wasting logic resources and/or unneeded buses width as in the conventional DSP blocks nowadays. The project focusing on the anchor point of the any deep learning architecture, which is to design an optimized high-speed neuron block which should replace the conventional DSP blocks to avoid the drawbacks that designers face while trying to fit the CNN architecture design to it. The design of the proposed neuron also takes the parallelism operation concept as it’s primary keystone, beside the minimization of logic elements usage to construct the proposed neuron cell. The targeted neuron design resource usage is not to exceeds 500 ALM and the expected maximum operating frequency of 834.03 MHz for each neuron. In this project, ultra-fast, adaptive, and parallel modules are designed as soft blocks using VHDL code such as parallel Multipliers-Accumulators (MACs), RELU activation function that will contribute to open a new horizon for all the FPGA designers to build their own Convolutional Neural Networks (CNN). We couldn’t stop imagining INTEL ALTERA to lead the market by converting the proposed designed CNN block and to be a part of their new FPGA architecture fabrics in a separated new Logic Family so soon. The users of such proposed CNN blocks will be amazed from the high-speed operation per seconds that it can provide to them while they are trying to design their own CNN architectures. For instance, and according to the first coding trial, the initial speed of just one MAC unit can reach 3.5 Giga Operations per Second (GOPS) and has the ability to multiply up to 4 different inputs beside a common weight value, which will lead to a revolution in the FPGA capabilities for adopting the era of deep learning algorithms especially if we take in our consideration that also the blocks can work in parallel mode which can lead to increasing the data throughput of the proposed project to about 16 Tera Operations per Second (TOPS). Finally, we believe that this proposed CNN block for FPGA is just the first step that will leave no areas for competitions with the conventional CPUs and GPUs due to the massive speed that it can provide and its flexible scalability that it can be achieved from the parallelism concept of operation of such FPGA-based CNN blocks.
OSU-Nowlab
Distributed deep learning parallelism framework written in PyTorch
ekzhang
Experiments in multi-architecture parallelism for deep learning with JAX
NiuHuangxiaozi
This repository outlines a comprehensive guide for training a distributed deep learning model.
RavenbornJB
A low-level deep learning framework that leverages both CPU and GPU parallelism.
dbgannon
this is the notebook to accomany "Accelerating Deep Learning Inference with Hardware and Software Parallelism"
AstroDnerd
A spatiotemporal deep learning framework for forecasting high-dimensional chaotic systems. Efficiently processes multi-terabyte 3D volumetric data using Distributed Data Parallelism (DDP) and Custom HDF5 Data Loaders. Cleaned and partially forked from nbisht_core_analysis
geraldzakwan
This repository is for the COMS 6998 Practical Deep Learning System Performance course final project that I took at Columbia (https://www.cs.columbia.edu/education/ms/fall-2020-topics-courses/#e6998010). In this project, my teammate and I investigate parallelism in NLP. We experimented on how parallelism (e.g. using multi-head attention instead of recurrent connection and splitting input for inference) affects model performance (accuracy and speed-wise). More on it here http://bit.ly/pract-dl-final-report.
RadhaGulhane13
No description available
RadhaGulhane13
No description available
No description available
No description available
No description available
No description available
No description available
NVIDIA course that covers training deep learning models on multiple GPUs using PyTorch’s DistributedDataParallel (DDP), focusing on data parallelism concepts, multi-GPU setup, scalable model implementation, and optimization techniques for efficient large-scale training.
No description available
b0tShaman
Data parallelism, zero-allocation Deep Learning framework in Go
BrandonXue
Deep learning with convolutional neural networks, from scratch, using parallelism (CUDA).
MALAY-21
Cataract, a common eye condition characterized by clouding of the lens, Timely detection and intervention are crucial for effective management of this condition. In this project, we propose a novel approach for cataract detection leveraging deep learning methods and data parallelism techniques.
shawnsihyunlee
Apple Silicon acceleration and data parallelism for Needle, a home-brewed deep learning framework
Ava4wonder
My solution for ASPLOS-2025-Contest #1: Intra-Operator Parallelism for Distributed Deep Learning
ti2-group
Source code for the Contest on Intra-Operator Parallelism for Distributed Deep Learning (IOPDDL).
subashreevs
Course Content from Data Parallelism: How to Train Deep Learning Models on Multiple GPUs by NVIDIA
asifrahaman13
My learning journey into deeper python concurrency. Contains the codes related to multiprocessing, multi threading, async operation codes, and few other concurrency and parallelism concepts.