Financial Reasoning with SFT + GRPO

A two-phase fine-tuning pipeline for financial sentiment analysis, targeting small language models (Gemma-3-270M) to produce structured, interpretable reasoning. Phase 1 (SFT) teaches the model to output structured tags; Phase 2 (GRPO) optimizes the quality of reasoning within that format using five reward signals including a FinBERT teacher model.

Two-Phase Pipeline

Phase 1 — Supervised Fine-Tuning (SFT) Trains on structured financial reasoning examples to enforce a strict output contract:

<REASONING> ... </REASONING>
<SENTIMENT> positive / negative / neutral </SENTIMENT>
<CONFIDENCE> 0.1 – 1.0 </CONFIDENCE>

Phase 2 — GRPO with Multi-Level Rewards Generates 6 completions per prompt and optimizes using a composite reward:

ComponentWeightDescription
Format gate35%Validates tag structure — binary gate
Financial reasoning25%Scores domain terms, causal logic, context
FinBERT alignment20%Teacher model sentiment agreement
Confidence calibration15%Brier score-like accuracy of confidence value
Directional consistency5%Reasoning-sentiment alignment

Data Sources

  • Financial PhraseBank — real financial NLP dataset
  • Synthetic examples — built-in diverse scenarios for cold-start training
  • Custom JSONL — plug in any domain-specific data

What I Learned

  • How SFT → GRPO sequencing works: SFT establishes the output structure, then GRPO improves quality within that structure without breaking the format
  • Using a teacher model (FinBERT) as a reward signal — aligning a small model with a domain-specialized model without direct label supervision
  • The importance of reward decomposition: a single score hides which behavior is improving; separate components make training interpretable and debuggable
  • KL penalty (--beta) as a stability control — prevents the GRPO policy from drifting too far from the SFT checkpoint

GitHub: github.com/srushtii-m/financial-reasoning-sft-grpo