LLaMA 2: Inference Architecture from Scratch

A from-scratch implementation of Meta’s LLaMA 2 inference pipeline, focused on understanding the architectural decisions that make large language models practical at scale. Rather than training, this project implements the full forward pass and inference loop, loading pre-trained 7B weights for text generation.

Architectural Improvements over GPT-2

ComponentGPT-2LLaMA 2
Positional encodingLearned absoluteRoPE (Rotary)
NormalizationLayerNormRMSNorm
ActivationGELUSwiGLU
AttentionStandard multi-headGrouped-query (shared KV heads)
InferenceFull recomputationKV cache

Key Concepts

  • RoPE — encodes position by rotating query/key vectors in complex space, generalizing better to longer sequences than learned absolute embeddings
  • RMSNorm — normalizes by root mean square without mean-centering; simpler and faster than LayerNorm
  • SwiGLU — feedforward uses SiLU(W₁x) · W₃x, empirically outperforming GELU at scale
  • Grouped-Query Attention — key/value heads are shared across multiple query heads, reducing memory and compute at inference time
  • KV Cache — caches computed key/value tensors so each new token runs attention in O(1) rather than recomputing the full context

GitHub: github.com/srushtii-m/llama2