Found 63 repositories(showing 30)
SzymonOzog
Step by step implementation of a fast softmax kernel in CUDA
prasannakotyal
Flash attention implementation Minimal CUDA implementation of Flash Attention with tiled computation and online softmax. Educational implementation based on Dao et al., 2022.
Flash Attention from scratch, tiled CUDA forward kernel, online softmax with running max and correction factor, recomputation trick in backward, O(N) memory, full forward and backward verified against PyTorch autograd to 1e-6.
fattorib
Softmax CUDA kernel :)
waefrebeorn
A Vulkan-based backend for PyTorch-like tensor operations, leveraging GLSL shaders for high-performance compute tasks like addition, matrix multiplication, ReLU, softmax, 2D convolution, and pooling. This project demonstrates how Vulkan can emulate CUDA-like functionality with dynamic pipelines and SPIR-V shader execution for deep learning tasks.
alpha0422
High performance implementation of CUDA label smoothing with softmax cross entropy loss.
forec1
My implementation of softmax splatting for pytorch in cuda
william20001120
学习与实践 CUDA Kernel 优化的示例仓库,涵盖矩阵乘法(SGEMM)、矩阵转置、各类归约(sum/max/softmax/矩阵 softmax)、GEMV、逐元素算子、LayerNorm,以及 cuBLAS 对比与若干入门示例。目标是以循序渐进的方式,拆解典型优化技巧并给出可复现实验。
liaomingg
weighted_softmax_loss_layer for caffe. It includes cpu version and cuda version.
Fused causal scaled dot product attention in a single CUDA kernel using CuPy RawKernel. QK dot products, causal mask, softmax, and AV weighted sum all computed inside one block. No attention matrix written to global memory. Up to 11.6x faster than CuPy at short sequence lengths, breakeven at T=128.
Mog9
Fused KV cache attention for single-token decode in one CUDA kernel using CuPy RawKernel. One query attending over the full KV cache, dot products, softmax, and weighted V sum computed entirely in shared memory with no score vector written to global memory. 8.5x faster than CuPy at short cache lengths, 2.5x at T_cache=1024.
ajagtapdev
CUDA matrix multiplication, reduction, and softmax kernels optimized for my RTX 4070 in C++17
Abishek-Chakravarthy
Parallelized Transformer models using OpenMP, CUDA, and MPI. Achieved up to 42x GPU speedup and 17.68x MPI speedup by optimizing matrix ops, self-attention, GELU, and softmax. Used shared memory, thread tuning, and distributed communication for efficient computation.
qixuxiang
https://zhuanlan.zhihu.com/p/341059988
addhyay
Implementing CUDA for optimizing softmax operation implementation by benchmark PyTorch's softmax operation.
xyz-zy
CS 378 Concurrency: Final Project
intelav
CUDA Softmax Benchmark Suite — compares global memory, GPU-resident, and Unified Memory + Prefetch variants using Nsight profiling and event timing.
VVinstonSmith
Here we introduce several basic CUDA kernel optimizations, including: Reduce, GEMM, GEMV, SPMV, Softmax, etc.
The repository contains custom CUDA kernels for linear layer, softmax and relu which are integrated with python to develop a Neural Network
Fused masked softmax + dropout in a single CUDA kernel using CuPy RawKernel. 3–5.7x faster than a standard CuPy multi-op baseline across sequence lengths 128–2048.
codingwithshawnyt
This repository contains a highly optimized, from-scratch implementation of the FlashAttention algorithm in CUDA. Designed for maximum performance on NVIDIA GPUs, this kernel demonstrates advanced memory hierarchy management, tiling strategies, and numerical stability techniques (Online Softmax).
srinidhi9659
Implemented softmax functionality in CUDA
Kyle-Lewis
Implementation of the Softmax regression algorithm using CUDA and generating visualizations.
bryanzhang
No description available
qiyueyuanwei
Implement matrix Softmax using CUDA
sumantrad
Softmax Kernel using CUDA
yyq0210
No description available
eitanturok
No description available
MeghanaShanthappa
No description available
Shiv22Wabale
No description available