GPT-2: Reproducing OpenAI’s Architecture from Scratch

A faithful reproduction of OpenAI’s GPT-2 architecture, scaling up from the character-level nanoGPT to a 124M parameter model with real Byte Pair Encoding tokenization. The implementation includes a from_pretrained method that loads OpenAI’s released weights, allowing direct verification that the architecture is correct.

How This Differs from my other GPT from scratch implementation

	GPT from scratch	GPT-2
Tokenization	Character-level (65 tokens)	BPE (50,257 tokens)
Activation	ReLU	GELU
Layers / Params	6 layers · ~10M	12 layers · ~124M
Attention	Explicit per-head Q/K/V	Fused QKV projection
Weight tying	No	Embedding and LM head share weights

Key Concepts

Byte Pair Encoding (BPE) tokenization and why subword vocabularies outperform character-level at scale
Fused QKV projections and flash attention for memory-efficient attention
Weight tying between token embeddings and the output projection
Loading and transposing pretrained weights from OpenAI’s Conv1D format into standard nn.Linear

GitHub: github.com/srushtii-m/gpt2-from-scratch

Share on

Twitter Facebook LinkedIn

Srushti Manjunath

How This Differs from my other GPT from scratch implementation

Key Concepts

Share on