Reliable JSON from Cheaper LLMs
How plain JSON, Zod validation, and a repair loop let Gemini 3.1 Flash-Lite run 80% cheaper and 2× faster than our Claude Haiku baseline.
Note: Small local benchmark: 5 prompts, 2 repeats, rerun once.
A very common LLM use case is turning messy human text into structured data.
In my case, the messy text was apartment-search preferences:
“Looking in Williamsburg or Greenpoint, 2 bed, no basement, private balcony, dog friendly, short walk to the L, ideally no fee.”
That sentence is useful to a human. It is much less useful to an app.
The app wants something more like:
{
"areas": ["WILLIAMSBURG", "GREENPOINT"],
"bedsMin": 2,
"bedsMax": 2,
"excludeBasement": true,
"requireOutdoorSpace": true,
"outdoorPrivateOnly": true,
"requireDogsAllowed": true,
"subwayRequiredRoutes": ["L"]
}Once the preference is structured, the app can filter listings, compare tradeoffs, compute features, and avoid losing meaning inside a blob of prose.
The obvious first approach is to tell the model:
Extract the preferences. Return JSON only.
That works surprisingly often.
It also stops being enough pretty quickly.
Valid JSON is not the same as a correct extraction
There are several levels of “working.”
The model might return invalid JSON.
It might return valid JSON wrapped in markdown fences.
It might return valid JSON with the wrong fields.
It might return valid JSON with the right fields and bad judgment.
That last one is the most interesting.
If a user says “short walk to the F,” and the model outputs:
{
"maxWalkToSubwayMin": 30,
"subwayRequiredRoutes": ["F"]
}the JSON is valid. The extraction is probably bad.
This is why structured output is not one problem. It is three problems:
Prompt = how to interpret the user’s words
Schema = what shape the output may have
Output path = how the answer gets back into my codeI’m using “output path” carefully here. I do not mean HTTP, SSE, or networking transport. I mean whether the answer comes back through native structured output, tool-call arguments, or plain text that my code parses.
The schema can tell the model that maxWalkToSubwayMin is a number.
It cannot, by itself, teach the model that “short walk” should probably be closer to 5 minutes than 20.
The schema is not the prompt
I originally had a blurry mental model here.
I thought: if I pass a Zod schema, and the provider has a structured-output mode, maybe the model just fills out the schema.
That is partly true, but it misses the important part.
A schema can say:
// preferences-schema.example.ts
const preferencesSchema = z.object({
priceMax: z.number().optional(),
areas: z.array(areaEnum).optional(),
requireDoorman: z.boolean().optional(),
maxWalkToSubwayMin: z.number().optional(),
})That describes the shape.
It does not explain policy:
- “Around 4k” should become
priceMax: 4000. - “Doorman would be nice” should not become
requireDoorman: true. - “No basement” should become
excludeBasement: true. - “Short bike to the G” may need to become a walk-equivalent distance if the app only stores walk-time constraints.
Those rules live in the prompt, tests, and product behavior.
The schema is the type contract. The prompt is the extraction contract.
Four ways to get JSON from an LLM
There are a few different output paths, and they are easy to confuse.
1. “Please return JSON”
The simplest approach is just prompt engineering:
Extract the user’s apartment preferences.
Return only valid JSON.
Do not include markdown.This is cheap to implement and portable across models.
The downside is that the model is still just writing text. It can add commentary, wrap the object in fenced code blocks, omit fields, invent fields, or produce JSON that parses but does not match your app’s schema.
You can improve this a lot with local validation:
// parse-json-output.example.ts
const parsed = JSON.parse(stripJsonFences(modelText))
const result = preferencesSchema.safeParse(parsed)
if (!result.success) {
// Ask the model to repair its previous answer.
}Some failures are too cheap to send back to the model. If the response is valid JSON wrapped in markdown fences, strip the fences locally and keep going. Save repair retries for real parse failures or schema failures.
That is not fancy, but it is a real strategy.
2. Native structured output
Native structured output means the provider receives a schema and constrains the model’s response to match it.
Anthropic’s current structured-output docs describe JSON outputs through output_config.format, strict tool use with strict: true, and schema-constrained generation for valid, parseable downstream output. They also explain that structured outputs compile schemas into grammars, which can add first-request latency and are cached for 24 hours. (Claude)
This is the clean path when it works.
But wide schemas can get expensive for the provider to enforce. Anthropic documents explicit schema-complexity limits, including 20 strict tools per request, 24 optional parameters total across strict tool schemas and JSON output schemas, and 16 parameters with union types. It also says optional parameters roughly double part of the grammar’s state space. (Claude)
That matters for extraction schemas, because extraction schemas often want many optional fields.
Apartment preferences are a perfect example. A user may mention pets, floor, neighborhood, budget, concessions, move-in date, subway routes, parks, amenities, building age, building size, and more. Most fields are optional because most users do not mention most fields.
That is a natural app schema.
It is also a branchy structured-output schema.
3. jsonTool
The Vercel AI SDK’s Anthropic provider has a structuredOutputMode option. One mode, jsonTool, returns the object through a special "json" tool call instead of plain text. (AI SDK)
Conceptually, this is like giving the model a form-submission channel.
The prompt says how to fill the form.
The schema says what fields the form allows.
The jsonTool path gives the model a structured way to return the completed form.
This is not magic. It still depends on a schema, and it still needs a good prompt. But for our wide extraction schema, Anthropic’s jsonTool path was the production baseline that worked.
4. Plain JSON text plus local validation
The surprising result from my benchmark was that the best Gemini path was not native structured output.
It was:
Generate plain JSON text.
Parse it locally.
Validate it with Zod.
If parsing or validation fails, ask the model to repair it.In the AI SDK Google provider, structured outputs are enabled by default, and the docs say they are required for tool calling. The provider also exposes structuredOutputs: false as a workaround for object generation when the schema contains elements Google’s OpenAPI-schema subset does not support. (AI SDK)
That workaround was the key.
Instead of asking Gemini’s native structured-output machinery to accept a large, branchy schema, we asked Gemini to write JSON as plain text and made our own code responsible for enforcement.
A minimal version looks like this:
// gemini-json-extraction.example.ts
async function extractPreferences(text: string) {
const first = await generateText({
model,
system: `
Extract apartment preferences.
Return ONLY a JSON object.
Use the supported camelCase field names.
Omit uncertain fields instead of inventing values.
Do not use markdown fences.
`,
prompt: text,
providerOptions: {
google: {
structuredOutputs: false,
},
},
})
return parseValidateOrRepair(first.text, text)
}The boring but effective part is to clean up trivial wrappers locally, then validate, then repair only the real failures:
// repair-loop.example.ts
function stripJsonFences(text: string) {
return text
.trim()
.replace(/^```(?:json)?\s*/i, '')
.replace(/\s*```$/i, '')
.trim()
}
async function parseValidateOrRepair(modelText: string, originalText: string) {
const stripped = stripJsonFences(modelText)
try {
const parsed = JSON.parse(stripped)
const result = preferencesSchema.safeParse(parsed)
if (result.success) return result.data
return repairJson({
originalText,
previousOutput: modelText,
error: result.error.message,
})
} catch (error) {
return repairJson({
originalText,
previousOutput: modelText,
error: String(error),
})
}
}This path is less elegant than provider-native structured output.
It is also portable, debuggable, and surprisingly effective.
Why not build our own jsonTool for Gemini?
At first, the obvious question is:
If Anthropic’s
jsonToolworks, why not generate a fake Gemini tool from the Zod schema and force the model to call it?
Something like:
// fake-tool.example.ts
const submitPreferences = tool({
name: 'submit_preferences',
inputSchema: preferencesSchema,
})In theory, you can try this.
But turning a schema into a tool does not make the schema disappear.
The provider still has to accept the tool parameter schema. The model still has to produce arguments that match it. And in the AI SDK Google provider, tool calling still requires structured outputs. (AI SDK)
So a DIY submit_preferences(...) tool might land right back in the same schema-complexity problem.
That does not mean it can never work. It means I would treat it as an experiment, not as the obvious replacement for plain JSON plus validation.
The benchmark has to test the real contract
A toy benchmark is easy:
Input: "I want a 1BR under $4,000"
Expected: { bedsMin: 1, bedsMax: 1, priceMax: 4000 }That is useful, but it does not test the interesting part.
The interesting part is when the model has to map human language into app semantics.
To make the judgment problem concrete, imagine the app stores transit constraints as walk-equivalent minutes:
// benchmark-fixture.example.ts
const fixture = {
input:
'Park Slope or Cobble Hill, around $4,000. Short walk to the F, short bike to the G.',
expected: {
exact: {
areas: ['PARK_SLOPE', 'COBBLE_HILL'],
priceMax: 4000,
},
transit: [
{
route: 'F',
maxWalkEquivalentMinutes: 5,
},
{
route: 'G',
maxWalkEquivalentMinutes: 15,
},
],
omitted: ['requireDoorman', 'priceMin'],
},
}The user did not say maxWalkEquivalentMinutes: 15.
The model has to infer that “short bike” covers a larger area than “short walk,” then translate that into the schema’s coordinate system.
That is where LLM extraction is actually useful. It is not just copying substrings into fields. It is making judgment calls that would be annoying to encode as brittle regexes.
That bike-versus-walk example is illustrative, not one of the benchmark fixtures. In the real benchmark, the same kind of fuzzy conversion showed up in maxWalkToParkMiles, where “10-minute walk to a park” had to become a numeric radius.
But judgment calls complicate evaluation.
Some fields should be exact. If the user says $4,000, priceMax should probably be 4000.
Some fields should be ranges. If the user says “10-minute walk to a park,” multiple mile cutoffs might be reasonable.
Some fields should merely be present.
Some fields should be omitted, because treating a soft preference as a hard requirement can hide good listings.
A useful benchmark needs all of those buckets.
// scoring-buckets.example.ts
const expected = {
exact: {
priceMax: 4000,
requireOutdoorSpace: true,
},
range: {
maxWalkToParkMiles: {
min: 0.35,
max: 0.6,
},
},
set: ['tastePrompt'],
omitted: ['requireDoorman', 'priceMin'],
}The goal is not to punish harmless variation.
The goal is to catch outputs that would make the app behave badly.
What the benchmark found
For this extraction schema, native Gemini structured JSON was not the path to chase.
The schema was too wide and branchy. The serious Gemini contenders were fallback paths: plain JSON text plus local parse/validation/repair, and YAML plus local parse/validation/repair.
Native Gemini structured output did not make the comparison table below. It failed before generation on this schema with:
The specified schema produces a constraint that has too much branching for serving
Main run shown; the rerun kept the broad ranking intact.
| Path | Model | First-pass | Final | Exact match | Median latency | Mean cost | Read |
|---|---|---|---|---|---|---|---|
Anthropic jsonTool | Claude Haiku 4.5 | 100% | 100% | 96.8% | 2.6 s | 0.94¢ | stable baseline |
| Plain JSON + Zod + repair | Gemini 2.5 Flash | 100% | 100% | 98.1% | 2.9 s | 0.30¢ | quality control |
| Plain JSON + Zod + repair | Gemini 3.1 Flash-Lite | 100% | 100% | 97.5% | 0.93 s | 0.12¢ | strongest low-cost candidate |
The best practical migration candidate was gemini-3.1-flash-lite with plain JSON text, local Zod validation, and repair retries. It matched Anthropic’s validity, stayed close on exact-match quality, and was much faster and cheaper. The caveat: it introduced a small extra-field rate, so I would pilot it rather than blindly swap it in.
I would not overgeneralize that.
This does not prove Gemini should replace Anthropic everywhere.
It does not prove plain JSON beats structured output in general.
It says something narrower and more useful:
For this wide, partial extraction schema, the less elegant path was the more practical Gemini path.
YAML was worth testing, but it was not compelling as the default. It sometimes matched JSON, but it needed repairs more often and did not clearly win.
The most useful failure was semantic
The benchmark also found a repeated issue with buildingSize.
That field was easy for models to mishandle. Sometimes the risky failure was not omitting the field. The risky failure was over-narrowing it.
For example, if the user does not specify a building size, downstream matching might treat omission or “all sizes allowed” as harmless. But if the extractor invents only ["luxury"], it can exclude otherwise valid listings.
That is not a JSON problem.
That is not even primarily a provider problem.
It is an extraction-behavior problem.
This is the kind of bug a good benchmark should catch: a valid object that would make the product worse.
What I’d do next time
I came away with a few practical rules.
First, separate shape from meaning.
The schema says what can be returned. The prompt says how to decide what should be returned.
Second, use provider-native structured output when it fits.
It is clean, and when the schema is modest, it can remove a lot of parsing and retry code.
Third, expect wide optional schemas to hurt.
Extraction schemas often look like flexible TypeScript interfaces. Structured-output engines often prefer stricter API contracts.
Fourth, do not dismiss plain JSON plus local validation.
It feels like the boring fallback. It may also be the most portable output path.
Fifth, benchmark the object your app actually needs.
Do not just test whether the model returns valid JSON. Test exact fields, fuzzy fields, omitted fields, retries, latency, cost, and the downstream consequences of being wrong.
The lesson here was not that plain JSON is always better than structured output.
It was narrower: for a wide optional extraction schema, Gemini’s cleaner structured-output path was not the practical one. The boring path was. Ask for JSON. Clean up trivial wrappers locally. Validate with Zod. Repair real failures. Benchmark the object your app actually needs.
The real floor is valid JSON. The real goal is a structured object that makes the app behave correctly.