π§ LLM from Scratch β GPT-Style Transformer in PyTorch
A complete GPT-style decoder-only transformer built from scratch in PyTorch β no transformers library, no HF. BPE tokenizer, training loop, FastAPI server, and React chat frontend.
π― What It Does
LLM from Scratch is a complete GPT-style decoder-only transformer language model (~5.2M parameters) built entirely in PyTorch without using the HuggingFace transformers library. It includes a BPE tokenizer trained from scratch, a training loop with modern optimizer settings, a FastAPI REST API for serving, and a React chat frontend.
$ python train.py
Step 1000 | loss 3.42 | lr 1.2e-4 | 12.5 it/s
Step 2000 | loss 2.89 | lr 8.5e-5 | 12.5 it/s
...
Generated: "shall I compare thee to a summer's day..."
π§± Tech Stack
| Component | Technology |
|---|---|
| Backend | Python 3.10+, PyTorch 2+, NumPy |
| API | FastAPI + uvicorn |
| Frontend | React 19, Vite 8, Tailwind CSS 4, Axios |
| Tokenizer | Custom BPE (byte-level, 5000 merges) |
This is a zero-dependency transformer implementation β no transformers, no SentencePiece, no tiktoken.
ποΈ Architecture
BACKEND/
βββ tokenizer.py # BPE tokenizer (byte-level, configurable merges)
βββ attention.py # Multi-head causal self-attention (optional flash attn)
βββ transformer_block.py # Pre-norm: LN β Attn β residual β LN β FFN β residual
βββ model.py # GPT: embeddings β stacked blocks β final LN β LM head
βββ dataset.py # Sliding-window dataset (predict shifted-by-1)
βββ train.py # AdamW, cosine LR, gradient accumulation, clipping
βββ generate.py # CLI text generation
βββ server.py # FastAPI endpoints: /generate, /health
βββ config.py # Hyperparameters + device detection
FRONTEND/
βββ App.jsx # Main React app
βββ Chat.jsx # Chat interface
βββ api.js # Axios API client
𧬠Model Architecture
| Parameter | Value |
|---|---|
| Embedding dim | 192 |
| Layers | 6 |
| Attention heads | 6 (head dim 32) |
| FFN hidden dim | 768 |
| Total params | ~5.2M |
| Vocabulary | 5000 BPE tokens |
| Context length | 256 tokens |
The architecture follows the GPT-2 pattern:
- Token embeddings + learned positional embeddings
- Pre-norm transformer blocks (LayerNorm before attention and FFN)
- Causal multi-head self-attention with optional Flash Attention
- Feed-forward with GELU activation
- Final LayerNorm followed by linear LM head
- Weight tying between embedding and LM head
π BPE Tokenizer
The BPE tokenizer is implemented from scratch with:
- Byte-level encoding (handles any UTF-8 input)
- Configurable number of merge operations (default: 5000)
- Trainable on any text corpus
- Serialization via pickle for reuse after training
This avoids any dependency on HuggingFace tokenizers or tiktoken, while still getting reasonable tokenization quality for the Shakespeare domain.
ποΈ Training Setup
| Setting | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | Cosine schedule with linear warmup |
| Weight decay | Applied to weights only (not biases/norms) |
| Gradient accumulation | 4 steps |
| Gradient clipping | 1.0 |
| Batch size | 64 sequences |
| Device | Auto-detect (CUDA / MPS / CPU) |
The dataset (dataset.py) implements sliding-window chunking: each training example is a contiguous 256-token window, and the target is the same window shifted by one token (next-token prediction).
π Serving
The FastAPI server exposes two endpoints:
GET /health # Health check
POST /generate # Generate text: {prompt, max_tokens, temperature, top_k, top_p}
Generation parameters include temperature, top-k sampling, and top-p (nucleus) sampling β all implemented manually in model.py.
The React frontend provides a ChatGPT-like chat interface styled with Tailwind CSS 4.
π Quick Start
# Train
cd BACKEND
python train.py # Trains on Shakespeare, saves to checkpoints/
# Serve
python server.py
# β API running at http://localhost:8000
# Frontend (separate terminal)
cd FRONTEND
npm install
npm run dev
# β UI at http://localhost:5173
# Or generate from CLI
python generate.py --prompt "To be or not to be" --temperature 0.8
π‘ Why Itβs Interesting
This project is a complete LLM implementation from the ground up β no black boxes, no library magic. Every component is hand-coded: the transformer blocks, the attention mechanism, the BPE tokenizer, the training loop with modern optimizer settings, and the sampling strategies. Itβs the ideal reference for anyone who wants to understand how GPT-style models actually work under the hood, from tokenization to generation.