Search Results

Found 37 repositories(showing 30)

evogp

EMI-Group

🧡66

A GPU-accelerated library for Tree-based Genetic Programming, leveraging PyTorch and custom CUDA kernels for high-performance evolutionary computation. It supports symbolic regression, classification, and policy optimization with advanced features like multi-output trees and benchmark tools.

264

GPL-3.0

Python

Updated 9 hours ago

cudaevolutionary-computationgenetic-programming+3

PyTorch-2.10.0a0

kentstone84

💛70

Unlock full RTX 5080 performance in PyTorch! PyTorch does not support RTX 5080 (sm_120) natively, so I built custom CUDA 12.8 drivers and PyTorch binaries to make it work. This repo contains build scripts, benchmarks, and installation guides for running AI models at max efficiency on the RTX 5080.

BSD-3-Clause

Python

Updated 4 days ago

Flash-Attention-101

viai957

❤️45

Flash Attention from First Principles: Triton & CUDA implementations with handwritten derivations, notebooks, and Colab benchmarks comparing PyTorch and Triton versions.

Jupyter Notebook

Updated 1 month ago

High-Performance-Reduction-Kernels

sriharshapy

🧡60

CUDA C reduction kernels benchmarking with Triton, PyTorch and CUB primitives

MIT

Jupyter Notebook

Updated 1 week ago

CUDA-Flash-Attention

detker

❤️45

CUDA Flash Attention 2 implementation. Includes forward/backward passes, Python benchmarking framework, and detailed comparisons vs PyTorch.

Cuda

Updated 2 months ago

triton-sinkhorn

LottoLottoLotto

🧡50

A high-performance, CUDA-fused Sinkhorn layer for PyTorch—built with OpenAI Triton and delivering up to 7.8x speedup with ~10–22% memory savings in recent benchmarks.

MIT

Python

Updated 1 month ago

softmax-cuda

addhyay

❤️40

Implementing CUDA for optimizing softmax operation implementation by benchmark PyTorch's softmax operation.

MIT

Cuda

Updated 6 months ago

gpu_test

dadosnapratica

❤️45

Added with several description variants and suggested GitHub topics: gpu, benchmark, pytorch, metal, cuda, apple-silicon, mps, ai, machine-learning, transformer

Python

Updated 1 month ago

GPU-Based-Deep-Learning-Training-Accelerator-Dashboard

mathan527

❤️40

Production-grade system for benchmarking CPU vs GPU deep learning performance. Train identical CNN models on CIFAR-10, capture comprehensive metrics (time, accuracy, GPU utilization, power, energy), and visualize results through an interactive Streamlit dashboard. Built with PyTorch, CUDA, and NVML for academic research and ML systems engineering.

MIT

Python

Updated 3 months ago

pytorch-cuda-benchmark

DandinPower

❤️35

This repository provides a simplebenchmarking suite for measuring the latency of a single forward pass through a HuggingFace Transformer model on CUDA GPUs. It supports both the standard PyTorch implementation and an optional Liger kernel acceleration for causal‑language models.

Python

Updated 11 months ago

cuda-pytorch-benchmark

saurabh-singh-rajput

❤️25

No description available

Cuda

Updated 1 year ago

cuda-pytorch-optimization-benchmarks

namanadep

🧡65

GPU/HPC systems — cuda-pytorch-optimization-benchmarks

Python

Updated 3 days ago

MPM-CudaJax

philipnickel

🧡55

MLS-MPM benchmark: JAX vs PyTorch vs CUDA

Python

Updated 2 weeks ago

CUDA_Benchmark

wattanapong

❤️35

benchmark the numerical operation by using CUDA based Pytorch

Python

Updated 3 years ago

CUDA-AI-inference-acceleration

HankWang-WL

❤️40

CUDA batched matmul and PyTorch baseline, benchmark for AI acceleration

MIT

Cuda

Updated 7 months ago

High-Performance-Reduction-Kernels

sankalphegde

💛70

CUDA C reduction kernels benchmarking with Triton, PyTorch and CUB primitives

MIT

Jupyter Notebook

Updated 5 days ago

accelrod

CoralLeiCN

❤️25

Benchmarking GEMM TFLOPS across different arithmetic intensities using PyTorch (CUDA and MPS).

Apache-2.0

Python

Updated 6 months ago

gpu-benchmark-suite

arvinsingh

❤️35

A comprehensive CLI tool for benchmarking GPU performance across CUDA, Triton, and PyTorch implementations.

Python

Updated 8 months ago

benchmark-frameworkcuda-programminggpu-programming+2

nvidia-thunder-fusion

somya2703

🧡55

High-performance fused ResNet50 benchmarking and analysis project leveraging PyTorch, CUDA, and NVIDIA GPU acceleration.

Python

Updated 1 week ago

pytorch-cuda-matmul-kernel

aleksandarmihajlovic1

❤️35

PyTorch C++/CUDA extension implementing naive and tiled shared-memory matrix multiplication kernels, with autograd support and benchmarking

Cuda

Updated 4 months ago

brigade

wstern1234

❤️40

Distributed GPU benchmarking suite for PyTorch that measures transformer and CNN performance with profiling, scaling, and visualization across CUDA.

MIT

Python

Updated 5 months ago

Co-Design-Scalable-Training-for-Large-Models

lcao02

❤️35

Developed a benchmarking framework using Python and CUDA to evaluate PyTorch Fully Sharded Data Parallel (FSDP) on large scale models

Python

Updated 8 months ago

hpc-ai-workflow-optimization

YashovardhanReddy001

🧡60

HPC–AI workflow optimization framework analyzing CPU–GPU crossover behavior and benchmarking CUDA kernel implementations across PyTorch, CuPy, and Numba.

MIT

Python

Updated 4 weeks ago

benchmarkingcudacupy+6

AttentionHub

tangefly

❤️35

A collection of modern attention mechanisms with PyTorch reference implementations and high-performance Triton/CUDA kernels for research, verification, and benchmarking.

Apache-2.0

Python

Updated 1 month ago

attentioncudatriton

pytorch-env-checker

maido-39

❤️35

A simple, one-click script to check PyTorch, CUDA versions, verify GPU operations, and run a quick CPU vs GPU benchmark.

Python

Updated 6 months ago

Parallel-Distributed-Computing-PROJ

HunainMaqbool

❤️35

PyTorch benchmarking project comparing CPU vs NVIDIA CUDA training using an ultra-deep neural network, showing major speedups from GPU hardware acceleration.

Jupyter Notebook

Updated 3 months ago

HomeAdam

DennySORA

❤️45

PyTorch implementation of HomeAdam/HomeAdamW optimizers (Algorithms 1-3) from arXiv:2603.02649 with CUDA/CPU benchmarks and automated CI/CD release.

Python

Updated 1 month ago

adamwcudadeep-learning+4

flashattention-from-scratch-cuda

kuttivicky

🧡65

A PyTorch CUDA extension that implements FlashAttention-style tiled, numerically-stable attention from scratch, with Nsight/NVTX profiling, correctness tests, and performance benchmarks.

Cuda

Updated 6 days ago

FastMQA

JonSnow1807

❤️40

CUDA implementation of Multi-Query Attention achieving 97% KV-cache memory reduction for LLM inference, enabling 32x larger batch sizes. Educational project demonstrating CUDA kernel development with PyTorch integration and Llama model benchmarks.

MIT

Python

Updated 6 months ago

attention-mechanismcudagpu-programming+2

self-attention-cuda-kernel-comparison

SergiuDeveloper

🧡60

Benchmarking hand-written CUDA C, Numba, and Triton self-attention kernels against PyTorch's SDPA - how fast can you go depending on the tool?

MIT

Python

Updated 2 weeks ago

cudacuda-kernelsdeep-learning+7

GitHub Explorer

Search Results

evogp

PyTorch-2.10.0a0

Flash-Attention-101

High-Performance-Reduction-Kernels

CUDA-Flash-Attention

triton-sinkhorn

softmax-cuda

gpu_test

GPU-Based-Deep-Learning-Training-Accelerator-Dashboard

pytorch-cuda-benchmark

cuda-pytorch-benchmark

cuda-pytorch-optimization-benchmarks

MPM-CudaJax

CUDA_Benchmark

CUDA-AI-inference-acceleration

High-Performance-Reduction-Kernels

accelrod

gpu-benchmark-suite

nvidia-thunder-fusion

pytorch-cuda-matmul-kernel

brigade

Co-Design-Scalable-Training-for-Large-Models

hpc-ai-workflow-optimization

AttentionHub

pytorch-env-checker

Parallel-Distributed-Computing-PROJ

HomeAdam

flashattention-from-scratch-cuda

FastMQA

self-attention-cuda-kernel-comparison

evogp

PyTorch-2.10.0a0

Flash-Attention-101

High-Performance-Reduction-Kernels

CUDA-Flash-Attention

triton-sinkhorn

softmax-cuda

gpu_test

GPU-Based-Deep-Learning-Training-Accelerator-Dashboard

pytorch-cuda-benchmark

cuda-pytorch-benchmark

cuda-pytorch-optimization-benchmarks

MPM-CudaJax

CUDA_Benchmark

CUDA-AI-inference-acceleration

High-Performance-Reduction-Kernels

accelrod

gpu-benchmark-suite

nvidia-thunder-fusion

pytorch-cuda-matmul-kernel

brigade

Co-Design-Scalable-Training-for-Large-Models

hpc-ai-workflow-optimization

AttentionHub

pytorch-env-checker

Parallel-Distributed-Computing-PROJ

HomeAdam

flashattention-from-scratch-cuda

FastMQA

self-attention-cuda-kernel-comparison