VLM Fine-Tuning: SmolVLM-256M on ChartQA
Ultra-efficient fine-tuning of a 256M parameter vision-language model for chart understanding. The key innovation is streaming lazy loading, data is processed on-demand during training rather than pre-loaded, keeping memory usage below 2GB regardless of dataset size. Achieves competitive chart QA results on consumer hardware.
Architecture
- Model: SmolVLM-256M (Idefics3ForConditionalGeneration) — images and text tokenized and fused through a shared transformer
- Adapter: LoRA with DoRA (Weight-Decomposed Low-Rank Adaptation) across all attention and MLP projections
- Dataset: HuggingFaceM4/ChartQA streamed on-demand
Key Innovation: Lazy Loading
full_dataset = load_dataset(DATASET_ID, streaming=True)
raw_train = full_dataset["train"].take(train_size)
train_ds = raw_train.map(format_sample, keep_in_memory=False)
Data is processed per-batch during training — peak VRAM stays under 2GB regardless of dataset size.
GPU Presets
| Configuration | VRAM | Training Time |
|---|---|---|
| High-performance (16GB) | 12–14GB | 15–25 min |
| Balanced (12GB) | 8–10GB | 20–30 min |
| Conservative (8GB) | 4–6GB | 25–35 min |
Results
Tested on 10 ChartQA validation samples:
- Accuracy: 40% (4/10 correct)
- Exact numerical matching works well; complex multi-series chart interpretation remains challenging
What I Learned
- How vision-language models process multimodal inputs — images and text are tokenized into a shared embedding space and fused through a transformer
- Streaming datasets for fine-tuning: lazy loading makes large-dataset fine-tuning feasible on hardware that couldn’t fit the full dataset in memory
- DoRA as an improvement over standard LoRA — decomposing weight updates into magnitude and direction components improves generalization
- Building VLM evaluation pipelines: running inference, extracting structured answers from free-form generation, and comparing against ground truth at scale
