🧠 LLM from Scratch — GPT-Style Transformer in PyTorch

Pranesh Nikhar Mar 25, 2026 03/25/26 3 min read

 󰈤

A complete GPT-style decoder-only transformer built from scratch in PyTorch — no transformers library, no HF. BPE tokenizer, training loop, FastAPI server, and React chat frontend.

🎯 What It Does

LLM from Scratch is a complete GPT-style decoder-only transformer language model (~5.2M parameters) built entirely in PyTorch without using the HuggingFace transformers library. It includes a BPE tokenizer trained from scratch, a training loop with modern optimizer settings, a FastAPI REST API for serving, and a React chat frontend.

$ python train.py
Step 1000 | loss 3.42 | lr 1.2e-4 | 12.5 it/s
Step 2000 | loss 2.89 | lr 8.5e-5 | 12.5 it/s
...
Generated: "shall I compare thee to a summer's day..."

🧱 Tech Stack

Component	Technology
Backend	Python 3.10+, PyTorch 2+, NumPy
API	FastAPI + uvicorn
Frontend	React 19, Vite 8, Tailwind CSS 4, Axios
Tokenizer	Custom BPE (byte-level, 5000 merges)

This is a zero-dependency transformer implementation — no transformers, no SentencePiece, no tiktoken.

🏗️ Architecture

BACKEND/
├── tokenizer.py         # BPE tokenizer (byte-level, configurable merges)
├── attention.py         # Multi-head causal self-attention (optional flash attn)
├── transformer_block.py # Pre-norm: LN → Attn → residual → LN → FFN → residual
├── model.py             # GPT: embeddings → stacked blocks → final LN → LM head
├── dataset.py           # Sliding-window dataset (predict shifted-by-1)
├── train.py             # AdamW, cosine LR, gradient accumulation, clipping
├── generate.py          # CLI text generation
├── server.py            # FastAPI endpoints: /generate, /health
└── config.py            # Hyperparameters + device detection

FRONTEND/
├── App.jsx              # Main React app
├── Chat.jsx             # Chat interface
└── api.js               # Axios API client

🧬 Model Architecture

Parameter	Value
Embedding dim	192
Layers	6
Attention heads	6 (head dim 32)
FFN hidden dim	768
Total params	~5.2M
Vocabulary	5000 BPE tokens
Context length	256 tokens

The architecture follows the GPT-2 pattern:

Token embeddings + learned positional embeddings
Pre-norm transformer blocks (LayerNorm before attention and FFN)
Causal multi-head self-attention with optional Flash Attention
Feed-forward with GELU activation
Final LayerNorm followed by linear LM head
Weight tying between embedding and LM head

📝 BPE Tokenizer

The BPE tokenizer is implemented from scratch with:

Byte-level encoding (handles any UTF-8 input)
Configurable number of merge operations (default: 5000)
Trainable on any text corpus
Serialization via pickle for reuse after training

This avoids any dependency on HuggingFace tokenizers or tiktoken, while still getting reasonable tokenization quality for the Shakespeare domain.

🏋️ Training Setup

Setting	Value
Optimizer	AdamW
Learning rate	Cosine schedule with linear warmup
Weight decay	Applied to weights only (not biases/norms)
Gradient accumulation	4 steps
Gradient clipping	1.0
Batch size	64 sequences
Device	Auto-detect (CUDA / MPS / CPU)

The dataset (dataset.py) implements sliding-window chunking: each training example is a contiguous 256-token window, and the target is the same window shifted by one token (next-token prediction).

🌐 Serving

The FastAPI server exposes two endpoints:

GET  /health    # Health check
POST /generate  # Generate text: {prompt, max_tokens, temperature, top_k, top_p}

Generation parameters include temperature, top-k sampling, and top-p (nucleus) sampling — all implemented manually in model.py.

The React frontend provides a ChatGPT-like chat interface styled with Tailwind CSS 4.

🚀 Quick Start

# Train
cd BACKEND
python train.py  # Trains on Shakespeare, saves to checkpoints/

# Serve
python server.py
# → API running at http://localhost:8000

# Frontend (separate terminal)
cd FRONTEND
npm install
npm run dev
# → UI at http://localhost:5173

# Or generate from CLI
python generate.py --prompt "To be or not to be" --temperature 0.8

💡 Why It’s Interesting

This project is a complete LLM implementation from the ground up — no black boxes, no library magic. Every component is hand-coded: the transformer blocks, the attention mechanism, the BPE tokenizer, the training loop with modern optimizer settings, and the sampling strategies. It’s the ideal reference for anyone who wants to understand how GPT-style models actually work under the hood, from tokenization to generation.

← [b]ack

posts/ 🦀 Rust Borrow Checker — Catches Real Bugs [n]ext → posts/ 🌳 LSM Trees & Bloom Filters — Production Deep Dive