GPT from Scratch

A minimal, from-scratch implementation of a GPT-style transformer trained on the Tiny Shakespeare corpus. Every component — tokenization, attention heads, transformer blocks, training loop — is implemented without relying on high-level abstractions, making the internals fully transparent.

Architecture

The model stacks 6 transformer Blocks, each containing:

Multi-Head Self-Attention — 6 parallel attention heads with scaled dot-product attention and causal masking, ensuring tokens only attend to past context
FeedForward Network — two-layer MLP with 4× hidden expansion and ReLU activation
Pre-norm LayerNorm — applied before each sub-layer following modern transformer conventions

Hyperparameters: 6 layers · 6 heads · 384 embedding dim · 256 context length · ~10M parameters

Key Concepts

Causal self-attention and why masking is necessary for autoregressive generation
How residual connections and layer normalization stabilize deep networks
Character-level tokenization vs. subword BPE (explored in the GPT-2 project)
The full transformer training loop: batching, loss estimation, gradient descent

GitHub: github.com/srushtii-m/gpt-from-scratch

Share on

Twitter Facebook LinkedIn

Srushti Manjunath

Architecture

Key Concepts

Share on