Reliable JSON from Cheaper LLMs

I recently found myself working on another NYC apartment search app. It asks what you want in plain English, and finds matches for you. Users typically give sentences like:

“Williamsburg or Greenpoint, 2 bed, no basement, dog friendly, short walk to the L.”

which an LLM converts to something a query engine can understand:

{
  "areas": ["WILLIAMSBURG", "GREENPOINT"],
  "bedsMin": 2,
  "excludeBasement": true,
  "requireDogsAllowed": true,
  "subwayRequiredRoutes": ["L"]
}

In version 1, that LLM was Claude Haiku 4.5. It produces structured (JSON) output through Vercel AI SDK's jsonTool mode. The trick is to hand the model a fake tool to call, one whose inputs match your schema. It thinks it's invoking a tool, but really it's just filling out a form. Then Zod checks that form against the same schema. It works. It was 100% valid across my benchmarks, at $8.51 per thousand extractions. But Haiku 4.5 is the "fast and cheap" model everyone reaches for, and I wanted to know if it was really that fast and really that cheap.

Valid JSON, wrong answer

Getting parseable JSON out of a model stopped being the hard part a while ago. The hard part is that a model can produce valid JSON that is silently wrong. Say a user writes “short walk to the F” and the model returns:

{ "maxWalkToSubwayMin": 30, "subwayRequiredRoutes": ["F"] }

It parses, the field is real, the type is right. But thirty minutes is not a short walk. The schema constrains maxWalkToSubwayMin to a number; the model must deduce that “short” should land near five. So I graded the model not just on whether the output was correct, but on whether it made sense in the real world.

The provider refused the schema

The clean way to get structured data is to hand the provider your schema and let it constrain decoding to match: the schema compiles into a grammar, and the model may only emit tokens that grammar allows. That works beautifully on a small, tidy schema. Mine is neither. The preferences schema has 34 fields: beds, price, pets, subway routes, outdoor space, building age, move-in date, etc. Most are optional, because most users don't mention most things. Grammar-constrained decoding does not love that shape; Anthropic's structured-output docs are candid about how fast optional and union fields inflate the grammar.

Google's Gemini models also support structured JSON outputs. They're well-priced and fast, but pass it this 34-field schema and you get an error instead of tokens:

The specified schema produces a constraint that has too much branching for serving

I would have to get the JSON some other way.

Plain JSON and a repair loop

I stopped asking the provider to enforce anything. I turned structured outputs off (structuredOutputs: false on the Google provider), asked for a plain JSON object as text, and made my own code responsible for enforcing the shape. Parsing is cheap, so trivial failures get fixed locally: if the model wraps the object in markdown fences, I strip them rather than pay a round trip to complain. Then Zod validates. Only a real parse or schema failure earns a repair retry, where the model gets its own output and the error back, up to three attempts.

async function repair({ input, previous, error }) {
  // Hand the model its rejected output, the validator's error, and the
  // original request; ask for corrected JSON only. The repaired text loops
  // back through parseValidateOrRepair to be re-checked.
  const { text } = await generateText({
    model,
    prompt: `Your previous output failed validation. Return corrected JSON only.
 
<output>${previous}</output>
<error>${error}</error>
<request>${input}</request>`,
  })
  return parseValidateOrRepair(text, input)
}
 
async function parseValidateOrRepair(text, input) {
  const parsed = safeJsonParse(stripFences(text))
  const result = preferencesSchema.safeParse(parsed)
  if (result.success) return result.data
  return repair({ input, previous: text, error: format(result) }) // ≤ 3 attempts
}

The useful thing about this setup is that it has nothing to do with any one provider. Any model that can write JSON as text can run it, which is what let me put the same plain-JSON path in front of every challenger through OpenRouter and compare them on even terms.

The benchmark

I wrote 13 apartment-search prompts, labelled all 34 fields by hand, and ran every prompt three times through eight models. The repeats measure how steady a model is. Claude Haiku 4.5 stayed on its production jsonTool path; the other seven ran the plain-JSON contract through OpenRouter. Those seven are Gemini 3.1 Flash-Lite, two of OpenAI's cheaper models, and four Chinese labs: DeepSeek, Tencent, MiniMax, and Z-ai. A case passes “strict” only if the whole object is right; one real field wrong fails it.

Eight models on one extraction schema

Each polygon is one model across five axes. Farther from the center is better on that axis, but the shapes cross, so a larger polygon is not the overall winner. Hover or focus a model to isolate it; toggle any off. Per-model values are in the table below.

Each spoke uses its own fixed, labeled scale. Speed and Cost are inverted log scores, so cheaper and faster point outward. Consistency measures agreement across valid repeated outputs. The p50/p95 toggle changes only the Speed spoke. Seconds and dollars are in the table.

Latency shown at

Per-model results for the July 2026 benchmark: first-pass, final, and strict validity, mean semantic accuracy, output consistency across valid repeated outputs, p95 latency, and cost per 1,000 extractions. The latency column reflects the selected percentile (p95); the consistency column does not change with it. Cell tint is a fixed-domain score, orange for worse and teal for better, not a rank.
Model	First-pass	Final	Strict	Mean semantic (%)	Consistency	p95	$/1k
DeepSeek V4 Flash†	97.4%	97.4%	89.7%	97.1	84.6%	96s	$0.43 (best in column)
Tencent Hy3	100% (best in column)	100% (best in column)	94.9%	99.9	94.9%	18.7s	$0.44
MiniMax M3	100% (best in column)	100% (best in column)	100% (best in column)	100 (best in column)	87.2%	30s	$0.76
GPT-5.6 Luna	100% (best in column)	100% (best in column)	97.4%	99.9	94.9%	3.79s	$1.02
Gemini 3.1 Flash-Lite	100% (best in column)	100% (best in column)	92.3%	99.8	100% (best in column)	2.22s (best in column)	$1.04
GPT-5 Mini	100% (best in column)	100% (best in column)	100% (best in column)	100 (best in column)	92.3%	32.9s	$1.58
GLM 5.2‡	87.2%	100% (best in column)	100% (best in column)	100 (best in column)	92.3%	32.3s	$2.92
Claude Haiku 4.5	100% (best in column)	100% (best in column)	84.6%	99.6	94.9%	4.13s	$8.51*

WorseBetter· fixed per-column domains, not normalized to these eight models

† DeepSeek: 97.4% valid after repairs, and the run’s only hard timeout (96s p95). Its mean cost covers the 38 of 39 calls with a known price.

‡ GLM 5.2: 87.2% valid on the first pass, repaired to 100%.

July 2026 run: 13 labelled prompts × 3 repeats per model — the repeats measure stability. Haiku runs the production jsonTool path; all others run plain JSON via OpenRouter with local Zod validation and repair, under a policy-adjusted rubric.

Start with the money. Gemini 3.1 Flash-Lite came back 100% valid, 92.3% strict, a 0.96-second median, and $1.04 per thousand, against Haiku's $8.51 — about an eighth of the cost for comparable quality and speed. That gap is part prompt size, part price: Haiku's jsonTool call carries roughly 7,600 input tokens against the plain-JSON path's 3,600, and its output tokens are priced higher. I can't cleanly separate the two, since provider, tokenizer, and price all shift together between the native and OpenRouter paths, but the direction is not subtle.

Now the harsh part. On strict scoring Haiku is the worst of the eight, at 84.6%. That reads worse than it is: its mean semantic accuracy is 99.55%, so those are one-field slips, not broken extractions. The most repeated one: on the prompt that named no constraints at all, it invented a tastePrompt field, three times in a row. Flash-Lite's only strict miss was the same shape: it invented a bedsMin on one fixture, three times out of three.

Then the trap the medians hide. The three models that hit 100% strict: MiniMax M3, GPT-5 Mini, GLM 5.2, carry p95 latencies of 30, 33, and 32 seconds. Their medians look fine and their tails would blow a ten-second deadline. DeepSeek is the blunt version of the lesson: cheapest of everyone at $0.43 per thousand, and the only model in the run to hit a hard timeout, with a 96-second p95. Cheapest is not the same as shippable.

One more honesty note about that chart. The first time I scored the run, the biggest source of failures wasn't any model. It was a product decision I hadn't made. When a user says nothing about building size, should the extractor omit the field or return all three size buckets? I never decided, and my first rubric treated the omission as wrong: 115 of the 195 cases in that first confirmation set failed on that one expectation. So I rescored offline, with no reruns, against a rubric that accepts either reasonable reading while still failing a genuinely bad answer, like narrowing an unspecified field to one restrictive bucket.

The YAML experiment

There is an idea I kept wanting to be true. JSON spends a lot of characters on syntax: braces, quotes, commas. And YAML spends far fewer! So a model writing YAML should have less to emit and less to get wrong. I tested it in an earlier, smaller run back in June: 5 prompts, 2 repeats, straight against the Gemini API. The July benchmark never retested it, so treat this as a side note.

The first half of the hunch held. Across the three stronger Gemini models, YAML used about 7% fewer output tokens on the first attempt. But it was first-pass valid far less often: 100% for JSON against 88% for YAML, and only 80% for Flash-Lite. YAML is permissive, and it will happily parse into a type you didn't want, so the syntax I saved came back as validation and repair work. Once the repairs were counted, the token advantage evened out: Flash-Lite spent 202 output tokens on JSON against 186 on YAML, while Gemini 2.5 Flash spent 828 on JSON against 840 on YAML. YAML came out cheaper for one model and dearer for the next, and across the set the two formats landed level. The stricter syntax was buying first-pass reliability, and that was worth more than the characters it cost. (One weaker model, Gemini 2.5 Flash-Lite, was erratic enough on YAML to settle nothing either way.)

What I'd ship

If I were optimizing for speed and cost, I'd ship Gemini 3.1 Flash-Lite as the default and keep GPT-5.6 Luna behind it. Luna costs about the same, $1.02 per thousand, and gives up only a second and a half at the tail (3.79s p95 vs 2.22s), but it scores higher on strict accuracy, 97.4% vs. 92.3%. So the rule I'd write is plain: default to Flash-Lite for the latency, and if a wrong extraction would cost you more than that second and a half, make Luna the default instead. All of this is thirteen development prompts against one schema, not a week of production traffic, so I'd pilot it before I trusted it.

In production I'd also hedge the request so it never hangs: default to Flash-Lite, and if nothing's back in four seconds, fire the same extraction at a second model and then a third, taking the first valid response under a hard deadline. This guarantees a response (not necessarily a correct one!). The fastest model quietly becomes your answer, and at least you don't leave users hanging on an error.

Tangent: I recently asked NanoBanana for a photo of me beneath a Coca-Cola beach umbrella. It looked real, until I noticed the logo on the inside of the canopy: legible from left to right. Logos are printed on the outside of beach umbrellas. From beneath, it should have appeared reversed: "aloC-acoC". The model knew to draw pixel-perfect sand, sweat, waves, and shimmers. But it didn't understand the geometry of the scene.

A thirty-minute “short walk” is the same kind of mistake. The output looks right until you ask whether its details make sense. Valid JSON is the floor. The goal is an object that makes the app behave correctly.