GPT from Scratch
A minimal, from-scratch implementation of a GPT-style transformer trained on the Tiny Shakespeare corpus. Every component — tokenization, attention heads, transformer blocks, training loop — is implemented without relying on high-level abstractions, making the internals fully transparent.
Architecture
The model stacks 6 transformer Blocks, each containing:
- Multi-Head Self-Attention — 6 parallel attention heads with scaled dot-product attention and causal masking, ensuring tokens only attend to past context
- FeedForward Network — two-layer MLP with 4× hidden expansion and ReLU activation
- Pre-norm LayerNorm — applied before each sub-layer following modern transformer conventions
Hyperparameters: 6 layers · 6 heads · 384 embedding dim · 256 context length · ~10M parameters
Key Concepts
- Causal self-attention and why masking is necessary for autoregressive generation
- How residual connections and layer normalization stabilize deep networks
- Character-level tokenization vs. subword BPE (explored in the GPT-2 project)
- The full transformer training loop: batching, loss estimation, gradient descent
