Fixing Fine-Tuned JSON

Your fine-tuned model produces valid JSON. Aggregate metrics look great. But is every field actually correct?

The Hidden Problem

When you fine-tune a language model for structured JSON output, standard evaluation tells you one number: aggregate loss improved. Grammar-constrained decoding (llguidance, Outlines, SGLang) guarantees valid syntax. Everything passes.

But aggregate metrics hide what’s happening at the field level. Fine-tuning a 32B parameter model on a flight booking schema:

Improved aggregate loss by 69%
Degraded boolean field prediction by 130%
Produced valid JSON on every test example

The model memorized the majority value for the refundable field instead of learning to predict it from context. Every output was valid JSON. Every output had the wrong refund policy.

This happens when:

Training data has skewed distributions for constrained fields (booleans, enums)
The base model is already competent at those fields (larger models are more vulnerable)
Aggregate metrics are dominated by structural tokens that are trivially correct

Diagnose with slotloss

slotloss is a free, open-source tool that reveals per-field regressions. See the Quick Start for a hands-on walkthrough.

pip install slotloss

General Mitigations

If you find a regression, here are things you can try:

Balance your training data

Skewed distributions for constrained fields (e.g., 80% “False”, 20% “True”) cause memorization:

Oversample minority values for boolean and enum fields
Ensure each enum value appears in at least 5-10% of training examples
Augment with synthetic examples that vary the constrained fields

Monitor per-role loss during training

Track per-role loss at each epoch:

Stop early if a specific role’s loss starts increasing while aggregate decreases
The optimal epoch for aggregate loss is often past the point where constrained roles start overfitting

Reduce training epochs and learning rate for larger models

Larger pretrained models need gentler fine-tuning — they already know JSON structure:

0.5B: lr = 1e-4, 10+ epochs
7B: lr = 5e-5, 5-7 epochs
32B+: lr = 1e-5 to 2e-5, 3-5 epochs, watch per-role metrics carefully

Increase training data

More diverse examples reduce memorization. Scaling from 50-100 to 500+ examples often helps constrained fields.

Why General Mitigations Have Limits

The strategies above help but don’t solve the root cause. Standard fine-tuning treats every token equally:

Structural tokens ({, }, :, ,) are trivially correct and produce large gradient signals
Constrained fields (booleans, enums) have few valid values and are easily memorized
Aggregate loss is a weighted average where trivial tokens dominate

These mitigations detect or reduce the problem. They don’t prevent the model from treating boolean decisions as memorization targets rather than genuine context-dependent choices.

Grammar-Aware Fine-Tuning

We offer fine-tuning that integrates structural awareness directly into training:

Prevents per-role regressions rather than detecting them after the fact
Every grammar role (keys, enums, booleans, free text) improves or holds steady
Works with any JSON Schema
Produces auditable per-role training metrics and deployment gates

In experiments across three model scales (0.5B, 7B, 32B) and multiple schemas, grammar-aware fine-tuning eliminated the boolean regression that standard fine-tuning produces at 32B, with comparable aggregate loss and fewer trainable parameters.

What We Offer

Diagnostic assessment (free): We help you interpret slotloss results and identify which fields are at risk.

Fine-tuning service: Send your JSON Schema and training data. We return a model with per-role performance guarantees:

Per-role training dashboard showing convergence for each field type
Deployment gates: configurable thresholds per grammar role
Regression detection across model versions

Consulting: Hands-on help with evaluation methodology, data quality, and training optimization for structured output.

Contact

Breck Baldwin breckbaldwin@gmail.com LinkedIn | GitHub

About

Breck Baldwin is an independent researcher specializing in structured output evaluation and grammar-aware language model training. Creator of LingPipe (2,700+ academic citations). Ph.D. Computer Science, University of Pennsylvania.

Paper: “Valid JSON, Wrong Answer” (2026)