GPT-2: Reproducing OpenAI’s Architecture from Scratch
A faithful reproduction of OpenAI’s GPT-2 architecture, scaling up from the character-level nanoGPT to a 124M parameter model with real Byte Pair Encoding tokenization. The implementation includes a from_pretrained method that loads OpenAI’s released weights, allowing direct verification that the architecture is correct.
How This Differs from my other GPT from scratch implementation
| GPT from scratch | GPT-2 | |
|---|---|---|
| Tokenization | Character-level (65 tokens) | BPE (50,257 tokens) |
| Activation | ReLU | GELU |
| Layers / Params | 6 layers · ~10M | 12 layers · ~124M |
| Attention | Explicit per-head Q/K/V | Fused QKV projection |
| Weight tying | No | Embedding and LM head share weights |
Key Concepts
- Byte Pair Encoding (BPE) tokenization and why subword vocabularies outperform character-level at scale
- Fused QKV projections and flash attention for memory-efficient attention
- Weight tying between token embeddings and the output projection
- Loading and transposing pretrained weights from OpenAI’s Conv1D format into standard
nn.Linear
