Search Results

Found 63 repositories(showing 30)

FastSoftmax

SzymonOzog

🧡65

Step by step implementation of a fast softmax kernel in CUDA

Cuda

Updated 3 days ago

flash-attention-cuda

prasannakotyal

❤️40

Flash attention implementation Minimal CUDA implementation of Flash Attention with tiled computation and online softmax. Educational implementation based on Dao et al., 2022.

Cuda

Updated 2 months ago

Flash Attention from scratch, tiled CUDA forward kernel, online softmax with running max and correction factor, recomputation trick in backward, O(N) memory, full forward and backward verified against PyTorch autograd to 1e-6.

Python

Updated 3 weeks ago

CudaSoftmax

fattorib

❤️35

Softmax CUDA kernel :)

Cuda

Updated 3 months ago

cudagpupytorch+1

VulkanShaderCUDA

waefrebeorn

❤️40

A Vulkan-based backend for PyTorch-like tensor operations, leveraging GLSL shaders for high-performance compute tasks like addition, matrix multiplication, ReLU, softmax, 2D convolution, and pooling. This project demonstrates how Vulkan can emulate CUDA-like functionality with dynamic pipelines and SPIR-V shader execution for deep learning tasks.

MIT

C++

Updated 3 months ago

label-smoothing-cuda

alpha0422

❤️25

High performance implementation of CUDA label smoothing with softmax cross entropy loss.

Cuda

Updated 2 years ago

MySoftmaxSplatting

forec1

❤️20

My implementation of softmax splatting for pytorch in cuda

Cuda

Updated 1 year ago

CUDA_Kernel_Learn

william20001120

❤️40

学习与实践 CUDA Kernel 优化的示例仓库，涵盖矩阵乘法（SGEMM）、矩阵转置、各类归约（sum/max/softmax/矩阵 softmax）、GEMV、逐元素算子、LayerNorm，以及 cuBLAS 对比与若干入门示例。目标是以循序渐进的方式，拆解典型优化技巧并给出可复现实验。

MIT

Cuda

Updated 6 months ago

weighted_softmax_loss_layer

liaomingg

❤️35

weighted_softmax_loss_layer for caffe. It includes cpu version and cuda version.

C++

Updated 6 years ago

Scaled_Dot_Product_Kernel

Mog9

🧡60

Fused causal scaled dot product attention in a single CUDA kernel using CuPy RawKernel. QK dot products, causal mask, softmax, and AV weighted sum all computed inside one block. No attention matrix written to global memory. Up to 11.6x faster than CuPy at short sequence lengths, breakeven at T=128.

Python

Updated 6 days ago

Fused-KV-Cache

Mog9

🧡60

Fused KV cache attention for single-token decode in one CUDA kernel using CuPy RawKernel. One query attending over the full KV cache, dot products, softmax, and weighted V sum computed entirely in shared memory with no score vector written to global memory. 8.5x faster than CuPy at short cache lengths, 2.5x at T_cache=1024.

Python

Updated 39 minutes ago

kernel

ajagtapdev

❤️35

CUDA matrix multiplication, reduction, and softmax kernels optimized for my RTX 4070 in C++17

Cuda

Updated 4 months ago

cplusplus-17cudagpu

High-Performance-Computing-Parallelizing-Transformer-Models-for-NLP-Tasks

Abishek-Chakravarthy

❤️35

Parallelized Transformer models using OpenMP, CUDA, and MPI. Achieved up to 42x GPU speedup and 17.68x MPI speedup by optimizing matrix ops, self-attention, GELU, and softmax. Used shared memory, thread tuning, and distributed communication for efficient computation.

C++

Updated 6 months ago

CudaSoftmax

qixuxiang

❤️40

https://zhuanlan.zhihu.com/p/341059988

MIT

C++

Updated 2 years ago

softmax-cuda

addhyay

❤️40

Implementing CUDA for optimizing softmax operation implementation by benchmark PyTorch's softmax operation.

MIT

Cuda

Updated 6 months ago

softmax-gd-cuda

xyz-zy

❤️35

CS 378 Concurrency: Final Project

Cuda

Updated 7 years ago

cuda-softmax-bench

intelav

❤️35

CUDA Softmax Benchmark Suite — compares global memory, GPU-resident, and Unified Memory + Prefetch variants using Nsight profiling and event timing.

Cuda

Updated 5 months ago

Optimize-basic-CUDA-kernels

VVinstonSmith

🧡55

Here we introduce several basic CUDA kernel optimizations, including: Reduce, GEMM, GEMV, SPMV, Softmax, etc.

Cuda

Updated 3 weeks ago

Custom-CUDA-kernels-with-Neural-Network-Implementation

hrshl212

❤️35

The repository contains custom CUDA kernels for linear layer, softmax and relu which are integrated with python to develop a Neural Network

Python

Updated 1 year ago

cudaneural-networkpython+1

softmax_dropout_kernel-fuse

Mog9

🧡50

Fused masked softmax + dropout in a single CUDA kernel using CuPy RawKernel. 3–5.7x faster than a standard CuPy multi-op baseline across sequence lengths 128–2048.

Python

Updated 1 week ago

FlashAttention-CUDA

codingwithshawnyt

❤️40

This repository contains a highly optimized, from-scratch implementation of the FlashAttention algorithm in CUDA. Designed for maximum performance on NVIDIA GPUs, this kernel demonstrates advanced memory hierarchy management, tiling strategies, and numerical stability techniques (Online Softmax).

MIT

Cuda

Updated 3 months ago

softmaxCUDA

srinidhi9659

❤️35

Implemented softmax functionality in CUDA

Cuda

Updated 5 months ago

CudaSoftmax

Kyle-Lewis

❤️35

Implementation of the Softmax regression algorithm using CUDA and generating visualizations.

Cuda

Updated 7 years ago

cuda_softmax

bryanzhang

❤️25

No description available

Cuda

Updated 1 year ago

CUDA-Softmax

qiyueyuanwei

❤️40

Implement matrix Softmax using CUDA

MIT

Cuda

Updated 6 months ago

cuda_softmax

sumantrad

❤️40

Softmax Kernel using CUDA

GPL-3.0

Cuda

Updated 5 months ago

CUDA-softmax

yyq0210

❤️25

No description available

CMake

Updated 8 months ago

cuda-softmax

eitanturok

❤️25

No description available

Cuda

Updated 1 year ago

Softmax_cuda

MeghanaShanthappa

❤️25

No description available

Updated 4 months ago

CUDA_SOFTMAX

Shiv22Wabale

❤️25

No description available

Makefile

Updated 7 years ago

GitHub Explorer

Search Results

FastSoftmax

flash-attention-cuda

FlashAttention-CuPy

CudaSoftmax

VulkanShaderCUDA

label-smoothing-cuda

MySoftmaxSplatting

CUDA_Kernel_Learn

weighted_softmax_loss_layer

Scaled_Dot_Product_Kernel

Fused-KV-Cache

kernel

High-Performance-Computing-Parallelizing-Transformer-Models-for-NLP-Tasks

CudaSoftmax

softmax-cuda

softmax-gd-cuda

cuda-softmax-bench

Optimize-basic-CUDA-kernels

Custom-CUDA-kernels-with-Neural-Network-Implementation

softmax_dropout_kernel-fuse

FlashAttention-CUDA

softmaxCUDA

CudaSoftmax

cuda_softmax

CUDA-Softmax

cuda_softmax

CUDA-softmax

cuda-softmax

Softmax_cuda

CUDA_SOFTMAX

FastSoftmax

flash-attention-cuda

FlashAttention-CuPy

CudaSoftmax

VulkanShaderCUDA

label-smoothing-cuda

MySoftmaxSplatting

CUDA_Kernel_Learn

weighted_softmax_loss_layer

Scaled_Dot_Product_Kernel

Fused-KV-Cache

kernel

High-Performance-Computing-Parallelizing-Transformer-Models-for-NLP-Tasks

CudaSoftmax

softmax-cuda

softmax-gd-cuda

cuda-softmax-bench

Optimize-basic-CUDA-kernels

Custom-CUDA-kernels-with-Neural-Network-Implementation

softmax_dropout_kernel-fuse

FlashAttention-CUDA

softmaxCUDA

CudaSoftmax

cuda_softmax

CUDA-Softmax

cuda_softmax

CUDA-softmax

cuda-softmax

Softmax_cuda

CUDA_SOFTMAX