GPT-2: Reproducing OpenAI’s Architecture from Scratch

A faithful reproduction of OpenAI’s GPT-2 architecture, scaling up from the character-level nanoGPT to a 124M parameter model with real Byte Pair Encoding tokenization. The implementation includes a from_pretrained method that loads OpenAI’s released weights, allowing direct verification that the architecture is correct.

How This Differs from my other GPT from scratch implementation

 GPT from scratchGPT-2
TokenizationCharacter-level (65 tokens)BPE (50,257 tokens)
ActivationReLUGELU
Layers / Params6 layers · ~10M12 layers · ~124M
AttentionExplicit per-head Q/K/VFused QKV projection
Weight tyingNoEmbedding and LM head share weights

Key Concepts

  • Byte Pair Encoding (BPE) tokenization and why subword vocabularies outperform character-level at scale
  • Fused QKV projections and flash attention for memory-efficient attention
  • Weight tying between token embeddings and the output projection
  • Loading and transposing pretrained weights from OpenAI’s Conv1D format into standard nn.Linear

GitHub: github.com/srushtii-m/gpt2-from-scratch