LLaMA 2: Inference Architecture from Scratch

A from-scratch implementation of Meta’s LLaMA 2 inference pipeline, focused on understanding the architectural decisions that make large language models practical at scale. Rather than training, this project implements the full forward pass and inference loop, loading pre-trained 7B weights for text generation.

Architectural Improvements over GPT-2

Component	GPT-2	LLaMA 2
Positional encoding	Learned absolute	RoPE (Rotary)
Normalization	LayerNorm	RMSNorm
Activation	GELU	SwiGLU
Attention	Standard multi-head	Grouped-query (shared KV heads)
Inference	Full recomputation	KV cache

Key Concepts

RoPE — encodes position by rotating query/key vectors in complex space, generalizing better to longer sequences than learned absolute embeddings
RMSNorm — normalizes by root mean square without mean-centering; simpler and faster than LayerNorm
SwiGLU — feedforward uses SiLU(W₁x) · W₃x, empirically outperforming GELU at scale
Grouped-Query Attention — key/value heads are shared across multiple query heads, reducing memory and compute at inference time
KV Cache — caches computed key/value tensors so each new token runs attention in O(1) rather than recomputing the full context

GitHub: github.com/srushtii-m/llama2

Share on

Twitter Facebook LinkedIn

Srushti Manjunath

Architectural Improvements over GPT-2

Key Concepts

Share on