LLM Serving API
A production style REST API for serving large language models with FastAPI, token streaming via Server Sent Events, asyncio request batching, and sliding window rate limiting backed by Ollama.
A production style REST API for serving large language models with FastAPI, token streaming via Server Sent Events, asyncio request batching, and sliding window rate limiting backed by Ollama.
Fine-tuned Gemma-3-270M for structured financial sentiment reasoning using a two-phase pipeline: SFT to teach output format, followed by GRPO with a multi-component reward function including a FinBERT teacher model for sentiment alignment.
Fine-tuned SmolVLM-256M on the ChartQA chart question-answering dataset using streaming lazy loading and LoRA/DoRA adapters, achieving full training in under 25 minutes on a 16GB GPU with less than 2GB peak VRAM usage.
Fine-tunes large language models using GRPO (Group Relative Policy Optimization) with LoRA adapters and 4-bit quantization, supporting any HuggingFace model and dataset with automatic field detection and a multi-component reward function.
Implements the LLaMA 2 inference pipeline from scratch in PyTorch, covering rotary positional embeddings, RMSNorm, SwiGLU activations, grouped-query attention, and KV caching, the production techniques that distinguish modern LLMs from research transformers.
Reproduces the GPT-2 architecture from scratch in PyTorch with BPE tokenization, GELU activations, and flash attention, including a weight-loading pipeline to verify the implementation against OpenAI’s pretrained GPT-2 checkpoints.
Character-level GPT transformer entirely from scratch in PyTorch, trained on the Tiny Shakespeare dataset, implementing multi-head self-attention, transformer blocks, and autoregressive generation with every component written by hand.
Language modeling using Multi Layer Perceptron and activations, gradients, and BatchNorm, manual backpropagation, and WaveNet-style dilated convolutions experiments to predict the next letter in a sequence of letters.