LLM Serving API

A production style LLM serving API built to demonstrate the infrastructure layer behind deployed AI systems. Handles concurrent requests with an asyncio queue based batcher, streams token by token responses via SSE, enforces per IP rate limits, and exposes health and readiness probes for deployment environments. Backed by Ollama for local model inference, swappable for vLLM on GPU deployments.

Endpoints

  • POST /generate: full completion, routed through the request batcher
  • POST /stream: token by token generation via Server Sent Events
  • GET /health: liveness probe, returns ok if the API process is running
  • GET /ready: readiness probe, checks whether the configured model is loaded in Ollama

Key Components

Request Batching

Incoming requests are placed in an asyncio queue. A background loop collects them into batches up to a configurable max size within a short collection window, then fires all requests in the batch concurrently via asyncio.gather. Each caller awaits its own Future and receives its result independently, with no coupling between requests in the same batch.

Sliding Window Rate Limiter

Per IP rate limiting using an in-memory deque of request timestamps. On each request, timestamps outside the current window are dropped. If the request count reaches the limit, the request is rejected with HTTP 429 and X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers. No Redis required.

SSE Streaming

The /stream endpoint uses FastAPI StreamingResponse with text/event-stream media type. The Ollama client streams responses line by line via httpx.AsyncClient, yielding each token as a formatted SSE event. The stream ends with a [DONE] sentinel.

Observability

Request logging middleware records method, path, status code, client IP, and latency for every request. An X-Response-Time-Ms header is injected into every response. Health and readiness probes are designed for Kubernetes liveness and readiness checks.

What I Learned

  • How to build an async LLM API end to end: request validation with Pydantic, async routing with FastAPI, and token streaming with SSE
  • Request batching with asyncio queues and Futures: collecting concurrent requests and resolving them as a group without blocking individual callers
  • Sliding window rate limiting without external dependencies: in memory deque per IP with amortized O(1) cleanup
  • SSE for streaming token generation: significantly better user experience than waiting for full completion, and simpler than WebSockets for one directional streams
  • Docker Compose service dependencies and health checks for multi container deployments where the API must wait for the model backend to be ready

GitHub: github.com/srushtii-m/llm-serving-api