Found 37 repositories(showing 30)
EMI-Group
A GPU-accelerated library for Tree-based Genetic Programming, leveraging PyTorch and custom CUDA kernels for high-performance evolutionary computation. It supports symbolic regression, classification, and policy optimization with advanced features like multi-output trees and benchmark tools.
kentstone84
Unlock full RTX 5080 performance in PyTorch! PyTorch does not support RTX 5080 (sm_120) natively, so I built custom CUDA 12.8 drivers and PyTorch binaries to make it work. This repo contains build scripts, benchmarks, and installation guides for running AI models at max efficiency on the RTX 5080.
viai957
Flash Attention from First Principles: Triton & CUDA implementations with handwritten derivations, notebooks, and Colab benchmarks comparing PyTorch and Triton versions.
sriharshapy
CUDA C reduction kernels benchmarking with Triton, PyTorch and CUB primitives
detker
CUDA Flash Attention 2 implementation. Includes forward/backward passes, Python benchmarking framework, and detailed comparisons vs PyTorch.
LottoLottoLotto
A high-performance, CUDA-fused Sinkhorn layer for PyTorch—built with OpenAI Triton and delivering up to 7.8x speedup with ~10–22% memory savings in recent benchmarks.
addhyay
Implementing CUDA for optimizing softmax operation implementation by benchmark PyTorch's softmax operation.
dadosnapratica
Added with several description variants and suggested GitHub topics: gpu, benchmark, pytorch, metal, cuda, apple-silicon, mps, ai, machine-learning, transformer
Production-grade system for benchmarking CPU vs GPU deep learning performance. Train identical CNN models on CIFAR-10, capture comprehensive metrics (time, accuracy, GPU utilization, power, energy), and visualize results through an interactive Streamlit dashboard. Built with PyTorch, CUDA, and NVML for academic research and ML systems engineering.
DandinPower
This repository provides a simplebenchmarking suite for measuring the latency of a single forward pass through a HuggingFace Transformer model on CUDA GPUs. It supports both the standard PyTorch implementation and an optional Liger kernel acceleration for causal‑language models.
saurabh-singh-rajput
No description available
namanadep
GPU/HPC systems — cuda-pytorch-optimization-benchmarks
philipnickel
MLS-MPM benchmark: JAX vs PyTorch vs CUDA
wattanapong
benchmark the numerical operation by using CUDA based Pytorch
HankWang-WL
CUDA batched matmul and PyTorch baseline, benchmark for AI acceleration
sankalphegde
CUDA C reduction kernels benchmarking with Triton, PyTorch and CUB primitives
CoralLeiCN
Benchmarking GEMM TFLOPS across different arithmetic intensities using PyTorch (CUDA and MPS).
arvinsingh
A comprehensive CLI tool for benchmarking GPU performance across CUDA, Triton, and PyTorch implementations.
somya2703
High-performance fused ResNet50 benchmarking and analysis project leveraging PyTorch, CUDA, and NVIDIA GPU acceleration.
aleksandarmihajlovic1
PyTorch C++/CUDA extension implementing naive and tiled shared-memory matrix multiplication kernels, with autograd support and benchmarking
wstern1234
Distributed GPU benchmarking suite for PyTorch that measures transformer and CNN performance with profiling, scaling, and visualization across CUDA.
Developed a benchmarking framework using Python and CUDA to evaluate PyTorch Fully Sharded Data Parallel (FSDP) on large scale models
YashovardhanReddy001
HPC–AI workflow optimization framework analyzing CPU–GPU crossover behavior and benchmarking CUDA kernel implementations across PyTorch, CuPy, and Numba.
tangefly
A collection of modern attention mechanisms with PyTorch reference implementations and high-performance Triton/CUDA kernels for research, verification, and benchmarking.
maido-39
A simple, one-click script to check PyTorch, CUDA versions, verify GPU operations, and run a quick CPU vs GPU benchmark.
HunainMaqbool
PyTorch benchmarking project comparing CPU vs NVIDIA CUDA training using an ultra-deep neural network, showing major speedups from GPU hardware acceleration.
DennySORA
PyTorch implementation of HomeAdam/HomeAdamW optimizers (Algorithms 1-3) from arXiv:2603.02649 with CUDA/CPU benchmarks and automated CI/CD release.
kuttivicky
A PyTorch CUDA extension that implements FlashAttention-style tiled, numerically-stable attention from scratch, with Nsight/NVTX profiling, correctness tests, and performance benchmarks.
JonSnow1807
CUDA implementation of Multi-Query Attention achieving 97% KV-cache memory reduction for LLM inference, enabling 32x larger batch sizes. Educational project demonstrating CUDA kernel development with PyTorch integration and Llama model benchmarks.
SergiuDeveloper
Benchmarking hand-written CUDA C, Numba, and Triton self-attention kernels against PyTorch's SDPA - how fast can you go depending on the tool?