VLM Fine-Tuning: SmolVLM-256M on ChartQA

Ultra-efficient fine-tuning of a 256M parameter vision-language model for chart understanding. The key innovation is streaming lazy loading, data is processed on-demand during training rather than pre-loaded, keeping memory usage below 2GB regardless of dataset size. Achieves competitive chart QA results on consumer hardware.

Architecture

Model: SmolVLM-256M (Idefics3ForConditionalGeneration) — images and text tokenized and fused through a shared transformer
Adapter: LoRA with DoRA (Weight-Decomposed Low-Rank Adaptation) across all attention and MLP projections
Dataset: HuggingFaceM4/ChartQA streamed on-demand

Key Innovation: Lazy Loading

full_dataset = load_dataset(DATASET_ID, streaming=True)
raw_train = full_dataset["train"].take(train_size)
train_ds = raw_train.map(format_sample, keep_in_memory=False)

Data is processed per-batch during training — peak VRAM stays under 2GB regardless of dataset size.

GPU Presets

Configuration	VRAM	Training Time
High-performance (16GB)	12–14GB	15–25 min
Balanced (12GB)	8–10GB	20–30 min
Conservative (8GB)	4–6GB	25–35 min

Results

Tested on 10 ChartQA validation samples:

Accuracy: 40% (4/10 correct)
Exact numerical matching works well; complex multi-series chart interpretation remains challenging

What I Learned

How vision-language models process multimodal inputs — images and text are tokenized into a shared embedding space and fused through a transformer
Streaming datasets for fine-tuning: lazy loading makes large-dataset fine-tuning feasible on hardware that couldn’t fit the full dataset in memory
DoRA as an improvement over standard LoRA — decomposing weight updates into magnitude and direction components improves generalization
Building VLM evaluation pipelines: running inference, extracting structured answers from free-form generation, and comparing against ground truth at scale

GitHub: github.com/srushtii-m/vlm-finetuning

Share on

Twitter Facebook LinkedIn