slotloss

Fixing Fine-Tuned JSON — Diagnosis, Mitigations, and Grammar-Aware Training

View the Project on GitHub breckbaldwin/slotloss

Fixing Fine-Tuned JSON

Your fine-tuned model produces valid JSON. Aggregate metrics look great. But is every field actually correct?

The Hidden Problem

When you fine-tune a language model for structured JSON output, standard evaluation tells you one number: aggregate loss improved. Grammar-constrained decoding (llguidance, Outlines, SGLang) guarantees valid syntax. Everything passes.

But aggregate metrics hide what’s happening at the field level. Fine-tuning a 32B parameter model on a flight booking schema:

The model memorized the majority value for the refundable field instead of learning to predict it from context. Every output was valid JSON. Every output had the wrong refund policy.

This happens when:

Diagnose with slotloss

slotloss is a free, open-source tool that reveals per-field regressions. See the Quick Start for a hands-on walkthrough.

pip install slotloss

General Mitigations

If you find a regression, here are things you can try:

Balance your training data

Skewed distributions for constrained fields (e.g., 80% “False”, 20% “True”) cause memorization:

Monitor per-role loss during training

Track per-role loss at each epoch:

Reduce training epochs and learning rate for larger models

Larger pretrained models need gentler fine-tuning — they already know JSON structure:

Increase training data

More diverse examples reduce memorization. Scaling from 50-100 to 500+ examples often helps constrained fields.

Why General Mitigations Have Limits

The strategies above help but don’t solve the root cause. Standard fine-tuning treats every token equally:

  1. Structural tokens ({, }, :, ,) are trivially correct and produce large gradient signals
  2. Constrained fields (booleans, enums) have few valid values and are easily memorized
  3. Aggregate loss is a weighted average where trivial tokens dominate

These mitigations detect or reduce the problem. They don’t prevent the model from treating boolean decisions as memorization targets rather than genuine context-dependent choices.

Grammar-Aware Fine-Tuning

We offer fine-tuning that integrates structural awareness directly into training:

In experiments across three model scales (0.5B, 7B, 32B) and multiple schemas, grammar-aware fine-tuning eliminated the boolean regression that standard fine-tuning produces at 32B, with comparable aggregate loss and fewer trainable parameters.

What We Offer

Diagnostic assessment (free): We help you interpret slotloss results and identify which fields are at risk.

Fine-tuning service: Send your JSON Schema and training data. We return a model with per-role performance guarantees:

Consulting: Hands-on help with evaluation methodology, data quality, and training optimization for structured output.

Contact

Breck Baldwin breckbaldwin@gmail.com LinkedIn | GitHub

About

Breck Baldwin is an independent researcher specializing in structured output evaluation and grammar-aware language model training. Creator of LingPipe (2,700+ academic citations). Ph.D. Computer Science, University of Pennsylvania.

Paper: “Valid JSON, Wrong Answer” (2026)