Fixing Fine-Tuned JSON — Diagnosis, Mitigations, and Grammar-Aware Training
Your fine-tuned model produces valid JSON. Aggregate metrics look great. But is every field actually correct?
When you fine-tune a language model for structured JSON output, standard evaluation tells you one number: aggregate loss improved. Grammar-constrained decoding (llguidance, Outlines, SGLang) guarantees valid syntax. Everything passes.
But aggregate metrics hide what’s happening at the field level. Fine-tuning a 32B parameter model on a flight booking schema:
The model memorized the majority value for the refundable field instead of learning to predict it from context. Every output was valid JSON. Every output had the wrong refund policy.
This happens when:
slotloss is a free, open-source tool that reveals per-field regressions. See the Quick Start for a hands-on walkthrough.
pip install slotloss
If you find a regression, here are things you can try:
Skewed distributions for constrained fields (e.g., 80% “False”, 20% “True”) cause memorization:
Track per-role loss at each epoch:
Larger pretrained models need gentler fine-tuning — they already know JSON structure:
More diverse examples reduce memorization. Scaling from 50-100 to 500+ examples often helps constrained fields.
The strategies above help but don’t solve the root cause. Standard fine-tuning treats every token equally:
These mitigations detect or reduce the problem. They don’t prevent the model from treating boolean decisions as memorization targets rather than genuine context-dependent choices.
We offer fine-tuning that integrates structural awareness directly into training:
In experiments across three model scales (0.5B, 7B, 32B) and multiple schemas, grammar-aware fine-tuning eliminated the boolean regression that standard fine-tuning produces at 32B, with comparable aggregate loss and fewer trainable parameters.
Diagnostic assessment (free): We help you interpret slotloss results and identify which fields are at risk.
Fine-tuning service: Send your JSON Schema and training data. We return a model with per-role performance guarantees:
Consulting: Hands-on help with evaluation methodology, data quality, and training optimization for structured output.
Breck Baldwin breckbaldwin@gmail.com LinkedIn | GitHub
Breck Baldwin is an independent researcher specializing in structured output evaluation and grammar-aware language model training. Creator of LingPipe (2,700+ academic citations). Ph.D. Computer Science, University of Pennsylvania.
Paper: “Valid JSON, Wrong Answer” (2026)