Compliance Report Generator

A full-stack compliance agent that processes regulatory documents (GDPR, internal policies) and makes them queryable through multiple retrieval strategies. Users upload PDFs, ask questions, and receive answers grounded in the document — with every response gated through PII detection, prompt safety checks, and human approval before being saved to memory.

Pipeline

  1. PDF uploaded and parsed into chunks using pdfplumber (text + tables)
  2. Chunks embedded and stored in Qdrant for semantic vector search
  3. Entities and relationships extracted into Neo4j knowledge graph
  4. User query routed to the selected agent mode
  5. Safety checks run (PII detection + prompt guardrails) before any LLM call
  6. Answer shown for human review — approved answers saved to Qdrant memory

Key Components

Multi-Modal Retrieval — vector similarity (Qdrant) finds semantically relevant chunks; Neo4j graph queries surface entity relationships that keyword search misses.

Three Agent Modes

  • Memory-Aware QA: retrieves document context with Neo4j graph memory injected into the prompt, role-specific instructions for legal analyst, policy researcher, or compliance officer
  • Tool Agent: LangChain ZERO_SHOT_REACT agent with 5 tools — risk scoring, compliance lookup, live news, summarization, and compliance score
  • Context-Aware Chain: role-based QA chain with custom PromptTemplates per user type

Security Layer

  • PII detection (emails, phone numbers, SSNs) before any LLM call
  • Keyword-based prompt guardrails blocking harmful or unethical queries
  • Ollama-based self-hosted safety classifier for air-gapped environments where prompts cannot leave the machine
  • Consistent enforcement across CLI, agents, and Streamlit UI

Human-in-the-Loop (HITL)

  • Streamlit UI: session state pauses execution, shows answer and source chunks, waits for approve or regenerate
  • CLI: programmatic HITL wrapper for batch workflows — runs chain, prints answer with sources, prompts for approval before saving to memory

LangGraph Ingestion Workflow — checkpointable pipeline (ingest, embed, graph, memory) that resumes from any failed step without reprocessing earlier stages.

What I Learned

  • How to combine vector search and knowledge graphs for richer document retrieval — semantic similarity finds relevant chunks, graph queries find entity relationships that keyword search misses
  • LangGraph for stateful, fault-tolerant workflows — nodes can fail and resume without reprocessing earlier steps
  • Building practical guardrails: PII regex detection and prompt classification before any LLM call, with a self-hosted Ollama fallback for privacy-sensitive deployments
  • Human-in-the-loop patterns in Streamlit using session state to pause execution and wait for user approval before writing to memory
  • Multi-agent orchestration: separating memory-aware QA from tool-calling agents and designing a tool registry that makes agent capabilities inspectable and extensible

GitHub: github.com/srushtii-m/compliance-agent