Financial Reasoning with SFT + GRPO
A two-phase fine-tuning pipeline for financial sentiment analysis, targeting small language models (Gemma-3-270M) to produce structured, interpretable reasoning. Phase 1 (SFT) teaches the model to output structured tags; Phase 2 (GRPO) optimizes the quality of reasoning within that format using five reward signals including a FinBERT teacher model.
Two-Phase Pipeline
Phase 1 — Supervised Fine-Tuning (SFT) Trains on structured financial reasoning examples to enforce a strict output contract:
<REASONING> ... </REASONING>
<SENTIMENT> positive / negative / neutral </SENTIMENT>
<CONFIDENCE> 0.1 – 1.0 </CONFIDENCE>
Phase 2 — GRPO with Multi-Level Rewards Generates 6 completions per prompt and optimizes using a composite reward:
| Component | Weight | Description |
|---|---|---|
| Format gate | 35% | Validates tag structure — binary gate |
| Financial reasoning | 25% | Scores domain terms, causal logic, context |
| FinBERT alignment | 20% | Teacher model sentiment agreement |
| Confidence calibration | 15% | Brier score-like accuracy of confidence value |
| Directional consistency | 5% | Reasoning-sentiment alignment |
Data Sources
- Financial PhraseBank — real financial NLP dataset
- Synthetic examples — built-in diverse scenarios for cold-start training
- Custom JSONL — plug in any domain-specific data
What I Learned
- How SFT → GRPO sequencing works: SFT establishes the output structure, then GRPO improves quality within that structure without breaking the format
- Using a teacher model (FinBERT) as a reward signal — aligning a small model with a domain-specialized model without direct label supervision
- The importance of reward decomposition: a single score hides which behavior is improving; separate components make training interpretable and debuggable
- KL penalty (
--beta) as a stability control — prevents the GRPO policy from drifting too far from the SFT checkpoint
