VLM Fine-Tuning: SmolVLM-256M on ChartQA

Ultra-efficient fine-tuning of a 256M parameter vision-language model for chart understanding. The key innovation is streaming lazy loading, data is processed on-demand during training rather than pre-loaded, keeping memory usage below 2GB regardless of dataset size. Achieves competitive chart QA results on consumer hardware.

Architecture

  • Model: SmolVLM-256M (Idefics3ForConditionalGeneration) — images and text tokenized and fused through a shared transformer
  • Adapter: LoRA with DoRA (Weight-Decomposed Low-Rank Adaptation) across all attention and MLP projections
  • Dataset: HuggingFaceM4/ChartQA streamed on-demand

Key Innovation: Lazy Loading

full_dataset = load_dataset(DATASET_ID, streaming=True)
raw_train = full_dataset["train"].take(train_size)
train_ds = raw_train.map(format_sample, keep_in_memory=False)

Data is processed per-batch during training — peak VRAM stays under 2GB regardless of dataset size.

GPU Presets

ConfigurationVRAMTraining Time
High-performance (16GB)12–14GB15–25 min
Balanced (12GB)8–10GB20–30 min
Conservative (8GB)4–6GB25–35 min

Results

Tested on 10 ChartQA validation samples:

  • Accuracy: 40% (4/10 correct)
  • Exact numerical matching works well; complex multi-series chart interpretation remains challenging

What I Learned

  • How vision-language models process multimodal inputs — images and text are tokenized into a shared embedding space and fused through a transformer
  • Streaming datasets for fine-tuning: lazy loading makes large-dataset fine-tuning feasible on hardware that couldn’t fit the full dataset in memory
  • DoRA as an improvement over standard LoRA — decomposing weight updates into magnitude and direction components improves generalization
  • Building VLM evaluation pipelines: running inference, extracting structured answers from free-form generation, and comparing against ground truth at scale

GitHub: github.com/srushtii-m/vlm-finetuning