LLMs & Alignment

LLM Serving API

A production style REST API for serving large language models with FastAPI, token streaming via Server Sent Events, asyncio request batching, and sliding window rate limiting backed by Ollama.

Financial Reasoning with SFT + GRPO

Fine-tuned Gemma-3-270M for structured financial sentiment reasoning using a two-phase pipeline: SFT to teach output format, followed by GRPO with a multi-component reward function including a FinBERT teacher model for sentiment alignment.

VLM Fine-Tuning: SmolVLM-256M on ChartQA

Fine-tuned SmolVLM-256M on the ChartQA chart question-answering dataset using streaming lazy loading and LoRA/DoRA adapters, achieving full training in under 25 minutes on a 16GB GPU with less than 2GB peak VRAM usage.

GRPO Fine-Tuning with LoRA

Fine-tunes large language models using GRPO (Group Relative Policy Optimization) with LoRA adapters and 4-bit quantization, supporting any HuggingFace model and dataset with automatic field detection and a multi-component reward function.

LLaMA 2: Inference Architecture from Scratch

Implements the LLaMA 2 inference pipeline from scratch in PyTorch, covering rotary positional embeddings, RMSNorm, SwiGLU activations, grouped-query attention, and KV caching, the production techniques that distinguish modern LLMs from research transformers.

GPT-2: Reproducing OpenAI’s Architecture from Scratch

Reproduces the GPT-2 architecture from scratch in PyTorch with BPE tokenization, GELU activations, and flash attention, including a weight-loading pipeline to verify the implementation against OpenAI’s pretrained GPT-2 checkpoints.

GPT from Scratch

Character-level GPT transformer entirely from scratch in PyTorch, trained on the Tiny Shakespeare dataset, implementing multi-head self-attention, transformer blocks, and autoregressive generation with every component written by hand.

Makemore - Character-level Language Model

Language modeling using Multi Layer Perceptron and activations, gradients, and BatchNorm, manual backpropagation, and WaveNet-style dilated convolutions experiments to predict the next letter in a sequence of letters.