Found 20 repositories(showing 20)
MingyuJ666
[ICML'25] Our study systematically investigates massive values in LLMs' attention mechanisms. First, we observe massive values are concentrated in low-frequency dimensions across different attention heads exclusively in attention queries (Q) and keys (K) while absent in values (V).
s-chh
Simple and easy to understand PyTorch implementation of Large Language Model (LLM) GPT and LLAMA from scratch with detailed steps. Implemented: Byte-Pair Tokenizer, Rotational Positional Embedding (RoPe), SwishGLU, RMSNorm, Mixture of Experts (MOE). Tested on Taylor Swift song lyrics dataset.
merterbak
Train LLM from scratch with SOTA techniques like RoPE, GQA and KV caching.
petermartens98
Lightweight LLM inspired by Qwen3, built from scratch in PyTorch. Full training pipeline with transformer components including RMSNorm, Rotary Position Embeddings (RoPE), Grouped-Query Attention (GQA), and SwiGLU layers. Trained with hybrid Muon + AdamW optimizer, causal masking, efficient batching, and evaluation tools.
sudhanshukuumar
📚 Enhance your interview preparation for LLM algorithm internships with insights on DeepSeek, PPO, RoPE, and RLHF core concepts.
Hasin-Al
This is a LLM using Decoupled RoPE, MultiHeadLatentAttention and TransfomerBLocks with post and pre normalization and using MoE. The Basic Idea is to build an LLM from scratch.
AhriCat
A new type of LLM/ machine learning model. This is based on RoPe, HymBa and Kronecker transform used in combination with a ternary tokenizer using a [-1,1] token space.
Built a Qwen3-style large language model from scratch in Python, implementing transformer architecture with GQA, SwiGLU activations, RoPE embeddings, and a custom Muon optimizer, gaining hands-on experience in LLM training, optimization, and dataset handling.
Build an end-to-end Large Language Model from scratch: implement transformers, train a tiny LLM, modernize with RoPE and RMSNorm, scale training, add Mixture-of-Experts, perform Supervised Fine-Tuning, train a Reward Model, and apply RLHF with PPO and also RLHF with GRPO for alignment and reinforcement learning.
gauravkumarsl
No description available
No description available
Shumatsurontek
Vision-LLM integration with RoPE for arbitrary resolution support and temporal downsampling
sealsnipe
🚀 Complete LLM Training System - GPU-optimized with torch.compile, GQA, RoPE, SwiGLU, and production-ready inference for consumer hardware (RTX 4070 Ti optimized)
milasd
Implementation of the Byte-Pair Encoding Tokenizer, RoPE Embeddings, Transformer LLM distributed training & inference from scratch w/ PyTorch (and MLX), with a Flash Attention 2 Triton kernel.
VidyasagarDudekula
An end-to-end framework for analyzing LLM behavior. Implements a Llama-style architecture with Grouped Query Attention (GQA) and RoPE, coupled with a comparative analysis suite for deterministic vs. stochastic sampling algorithms.
brianmeyer
I built a tiny LLM from scratch to understand how GPT-4 and LLaMA actually work. 10M params, trained on Shakespeare, modernized with RMSNorm + SwiGLU + RoPE + KV cache. Every mistake documented.
kirsten-1
High-performance Triton kernel library for LLM training with 12 fused operators (AttnRes, RMSNorm, RoPE, CrossEntropy, GRPO, JSD, FusedLinear, etc.) — up to 24x faster than PyTorch with 78% memory savings, outperforming Liger-Kernel on RTX 5090
NguyenQuangTrung19
Deep Learning final project exploring advanced attention mechanisms in LLMs (self-attention, MQA, GQA, Flash/linear/sparse attention, RoPE) with PyTorch demos, plus a CNN + Transformer-Decoder OCR model for image-to-text with evaluation on test data.
subramanyasrevankar
GPT-OSS 270B is a 270B-parameter open-source LLM built with transformer architecture, using token embeddings, RMSNorm, sliding/full attention, RoPE positional encodings, and feed-forward layers. Optimized for efficient training, inference, and high-quality next-token prediction.
ralolooafanxyaiml
A from-scratch PyTorch LLM implementing Sparse Mixture-of-Experts (MoE) with Top-2 gating. Integrates modern Llama-3 components (RMSNorm, SwiGLU, RoPE, GQA) and a custom-coded Byte-Level BPE tokenizer. Pre-trained on a curated corpus of existential & dark philosophical literature.
All 20 repositories loaded