Financial Reasoning with SFT + GRPO

A two-phase fine-tuning pipeline for financial sentiment analysis, targeting small language models (Gemma-3-270M) to produce structured, interpretable reasoning. Phase 1 (SFT) teaches the model to output structured tags; Phase 2 (GRPO) optimizes the quality of reasoning within that format using five reward signals including a FinBERT teacher model.

Two-Phase Pipeline

Phase 1 — Supervised Fine-Tuning (SFT) Trains on structured financial reasoning examples to enforce a strict output contract:

<REASONING> ... </REASONING>
<SENTIMENT> positive / negative / neutral </SENTIMENT>
<CONFIDENCE> 0.1 – 1.0 </CONFIDENCE>

Phase 2 — GRPO with Multi-Level Rewards Generates 6 completions per prompt and optimizes using a composite reward:

Component	Weight	Description
Format gate	35%	Validates tag structure — binary gate
Financial reasoning	25%	Scores domain terms, causal logic, context
FinBERT alignment	20%	Teacher model sentiment agreement
Confidence calibration	15%	Brier score-like accuracy of confidence value
Directional consistency	5%	Reasoning-sentiment alignment

Data Sources

Financial PhraseBank — real financial NLP dataset
Synthetic examples — built-in diverse scenarios for cold-start training
Custom JSONL — plug in any domain-specific data

What I Learned

How SFT → GRPO sequencing works: SFT establishes the output structure, then GRPO improves quality within that structure without breaking the format
Using a teacher model (FinBERT) as a reward signal — aligning a small model with a domain-specialized model without direct label supervision
The importance of reward decomposition: a single score hides which behavior is improving; separate components make training interpretable and debuggable
KL penalty (--beta) as a stability control — prevents the GRPO policy from drifting too far from the SFT checkpoint

GitHub: github.com/srushtii-m/financial-reasoning-sft-grpo

Share on

Twitter Facebook LinkedIn

Srushti Manjunath

Two-Phase Pipeline

Data Sources

What I Learned

Share on