Found 25 repositories(showing 25)
ByteDance-Seed
Distributed Compiler based on Triton for Parallel Systems
Harly-1506
End-to-end MLOps architecture built for polyp segmentation — featuring distributed Ray training, MLflow experiment tracking, and automated CI/CD with Kubeflow Pipelines and KServe (Triton) deployment on Google Kubernetes Engine.
This system operates in a distributed environment using Nvidia Triton.
caroline430
Everything-in-one-place GPU kernel programming for ML engineers. Covers CUDA, Triton, Flash Attention 2/3, paged attention, Mamba, speculative decoding, FP8/AWQ/GPTQ quantization, cuBLAS/cuDNN/CUTLASS, Nsight profiling, PyTorch custom ops, FSDP, tensor parallelism, and distributed training. SOTA 2026.
WaffleBits
Distributed Inference Benchmarking Tool for NVIDIA Triton Server
666keke
Production-ready distributed YOLO inference pipeline powered by NVIDIA Triton Inference Server. Supports Kubernetes orchestration and Docker deployment.
piotrm-nvidia
No description available
Irving1113
triton-distributed-tutorial
triton-inference
No description available
xuzhao9
Benchmark for Triton-Distributed
tongxili
No description available
Triton distributed inference SLA simulator
blockneural
C++-based distributed AI inference system using NVIDIA Triton with gateway, scheduler, and blockchain-based payment integration on Ethereum.
Echo13Bear
No description available
A simple YouTube-like video hosting platform made scalable, consistent-hash-based distributed storage Built using Go, gRPC, and SQLite.
parth-shettiwar
Developing Flash Attention 2 in Triton and Distributed data parallel training
sriramgkn
Codes for my blog on distributed training (CUDA, ONNX, TensorRT, Triton)
theBeginner86
A distributed performance benchmark engine for ASR workloads on Triton Inference Servers
blockneural
Agent service for managing Triton inference containers, coordinating with gateway and scheduler for distributed AI workloads.
kennethvuongcode
Optimized Triton-based matrix multiplication kernel with ReLU and addition, plus MPI-based tensor and data parallel communication for distributed training.
sasi-chappidi
Built an end-to-end LLM infrastructure project with PyTorch, distributed training, FastAPI serving, benchmarking, ONNX export, and Triton-compatible deployment structure.
This project implements systems-level optimizations for transformer training, including custom Triton kernels, PyTorch distributed training, optimizer state sharding, and memory/latency benchmarking tools.
milasd
Implementation of the Byte-Pair Encoding Tokenizer, RoPE Embeddings, Transformer LLM distributed training & inference from scratch w/ PyTorch (and MLX), with a Flash Attention 2 Triton kernel.
nguyenhuyenkiohna
A high-performance, distributed video analytics framework for Smart City traffic monitoring. Optimized with YOLOv11, TensorRT, and ByteTrack. Architecture powered by Apache Kafka and Triton Inference Server for scalable, real-time vehicle Re-ID and analytics.
dagc-ai
Hands-on AI infrastructure from the ground up: GPU memory hierarchy, CUDA kernel optimization, Triton, distributed training, and inference serving. Real benchmarks across the full compute stack, from naive kernels to Groq LPUs, Tenstorrent, AMD MI300X, and Google TPU
All 25 repositories loaded