Structured outputs vs JSON mode: what actually holds up in production

If your LLM isn't a chat assistant but a link in a pipeline — you need a guarantee the output parses. Invalid JSON in production isn't a "model error," it's engineering negligence. Because there are three ways to solve this, and two of them give you 99.9%+.

Here's the difference between them, where each works, and the rake each one comes with.

Method 1: "ask nicely" (JSON prompting)

The most popular and most fragile. In the system prompt: "Return JSON strictly in this format: { ... }". Sometimes add "don't write anything except JSON."

Works 95% of the time. In 5% the model:

Wraps the answer in json ... fences
Adds "Here's your answer:" before the JSON
Puts a trailing comma
Adds comments in JSON (// this is the name)
On long outputs gets cut mid-string due to max_tokens

In production 5% is a catastrophe. If you run 1,000 requests a day, 50 break silently, users see "something went wrong," and you don't know why.

When acceptable: prototypes, internal tools, tasks where retries are harmless.

Method 2: tool use (function calling)

The real solution. Describe the desired structure as a tool schema; the model calls the tool; you get the arguments already as JSON.

const resp = await client.messages.create({
  model: "claude-sonnet-4-6",
  tools: [{
    name: "extract_contact",
    description: "Extract contact info from text",
    input_schema: {
      type: "object",
      properties: {
        name: { type: "string" },
        email: { type: "string", format: "email" },
        phone: { type: "string" }
      },
      required: ["name"]
    }
  }],
  tool_choice: { type: "tool", name: "extract_contact" },
  messages: [{ role: "user", content: userText }]
});

const args = resp.content.find(b => b.type === "tool_use").input;
// args is already a valid object with the right types

Key points:

tool_choice: { type: "tool", name: "..." } forces the model to call this specific tool. Without it the model may decide not to call and answer in text.
The schema is validated at the API level — invalid JSON just won't come back.
Works reliably on all Claude 4+ models.

In my production code, tool use hits 99.95%+ across thousands of requests. The remaining 0.05% are network timeouts, not format errors.

When to pick: always, when you need structured extraction or classification.

Method 3: strict JSON schema (prefill + validation)

A hybrid for cases where tool use is awkward (e.g. you want a large array of objects).

The technique: prefill the response with { as the last assistant message, forcing the model to continue from that character.

const resp = await client.messages.create({
  model: "claude-sonnet-4-6",
  messages: [
    { role: "user", content: userPrompt },
    { role: "assistant", content: "{" }
  ]
});

const jsonText = "{" + resp.content[0].text;
const parsed = JSON.parse(jsonText);

Plus: guaranteed to start with {, no "Here's your JSON:" preamble. Minus: no guarantee of valid inner structure. You need Zod / Pydantic validation on top.

Reliability ~99.7% with retry-on-parse-error. Slower than tool use and takes more code.

When to pick: when the tool schema can't express your shape (dynamic keys, very nested structures), or when you need to stream JSON.

The rakes that catch everyone

Max_tokens. The most common "broken JSON" cause. The model honestly returns a valid prefix, then hits the limit and cuts off. Fix: set max_tokens at 2x your expected size and log stop_reason. If stop_reason === "max_tokens" you didn't get the full answer, even if it parsed.

Empty strings where null should be. LLMs often put "" instead of null or vice versa. Schema fix: { type: ["string", "null"] } or strict validation with coercion.

Numbers as strings. "age": "30" instead of "age": 30. Tool use catches this via schema, JSON prompting does not.

Unicode and newlines. In content fields the model may include \n — valid JSON but your frontend may not handle it. Normalize server-side.

Production strategy

Default to tool use. 95% of tasks map cleanly here.
Validate after parsing. Even tool use doesn't guarantee that email is actually an email. Run it through Zod/Pydantic/Joi.
Retry on invalid. If parsing fails — retry with a message to the model: "Previous response failed validation: <error>. Try again." In 90% of cases the second try lands.
Track invalid-response rate. If you see > 0.5% parse errors — something is wrong with the prompt or schema.
Never parse free text with regex. That's always an architectural smell.

What to do today

Find every place in your code where you do JSON.parse(response.text) or response.json() on an LLM output without validation. Each one is a mine. Migrate to tool use with Zod validation, or add validation with retry. A day's work, no more on-call nights.

Need help with the LLM pipeline architecture for your case? Get in touch. Part of the "AI infrastructure audit" package.