Skip to content

Advanced★★★★★9 min read

Structured outputs from LLMs: tool use, JSON mode, schemas

Three ways to make a model emit valid JSON, when each one wins, and the failure modes that surprise you in production.

Half the production LLM apps in 2026 need the model to emit JSON of a specific shape, not free-form text. Extracting fields from a resume, classifying customer intent into one of 12 categories, generating an action plan with named steps. Three years ago you had to do this with regex on free text and pray. Today you have three real options. Each has tradeoffs.

The three approaches

  1. JSON mode. Tell the model "output valid JSON" and let it figure out the shape. Cheap, low overhead, no schema enforcement.
  2. Tool use / function calling. Define a function with a JSON schema for its arguments. The model is forced to emit args matching that schema (with strict mode), even though the function is fictional.
  3. Constrained decoding / strict structured outputs. The provider enforces schema compliance during sampling — invalid tokens are masked out, so the model literally cannot emit invalid JSON.

In 2026, all three are available across major providers, but with different feature names and restrictions.

JSON mode: the simplest tool

response = client.chat.completions.create(
    model="gpt-5",
    response_format={"type": "json_object"},
    messages=[
        {"role": "system", "content": "You output a JSON object with fields name, email, role."},
        {"role": "user", "content": "..."},
    ],
)

The model emits some JSON object. You parse it. If you described the schema clearly in the system prompt, it'll usually match.

Strengths: Cheapest. Universally supported. Easy to set up.

Weaknesses:

  • Schema enforcement is best-effort. The model might miss a field or add extras.
  • No type guarantees. "age" might come back as "forty" instead of 40.
  • You still have to write Pydantic / zod validation downstream.

Use JSON mode when the schema is simple, the cost matters, and you can validate downstream.

Tool use: schema-enforced JSON

Define a tool that's not actually a tool — it's just a way to constrain output:

tools = [{
    "type": "function",
    "function": {
        "name": "submit_extraction",
        "description": "Submit the extracted fields.",
        "strict": True,  # Enable strict schema enforcement
        "parameters": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "email": {"type": "string", "format": "email"},
                "age": {"type": "integer", "minimum": 0, "maximum": 120},
                "role": {"type": "string", "enum": ["engineer", "designer", "manager", "other"]}
            },
            "required": ["name", "email", "role"],
            "additionalProperties": False
        }
    }
}]

response = client.chat.completions.create(
    model="gpt-5",
    tools=tools,
    tool_choice={"type": "function", "function": {"name": "submit_extraction"}},
    messages=[...],
)
args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)

Strengths:

  • With strict: True, OpenAI guarantees the output matches the schema. No invalid JSON, no missing fields, no wrong types. Anthropic and Gemini have analogous strict modes.
  • Enums constrain to specific values. role can never come back as "chief vibemaster."
  • The schema is part of the model's input — better adherence than "system prompt says output X."

Weaknesses:

  • Strict mode adds latency (the provider compiles the schema into a constrained-decoding state machine on each request). Usually 10-20% slower.
  • Strict mode has subtle restrictions: no oneOf at the root, all fields must be required, recursive schemas are limited. Read the provider docs.
  • Forces a tool-call response shape; you have to dig the args out.

Tool use is the right default for production extraction in 2026.

Constrained decoding (strict structured outputs)

OpenAI's response_format={"type": "json_schema", "strict": true, ...} and equivalent in Anthropic/Gemini is the cleaner version of tool-use-as-extraction:

response = client.chat.completions.create(
    model="gpt-5",
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "extraction",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "role": {"type": "string", "enum": ["engineer", "designer", "manager"]}
                },
                "required": ["name", "role"],
                "additionalProperties": False
            }
        }
    },
    messages=[...],
)
result = json.loads(response.choices[0].message.content)

Same guarantees as tool use with strict, but the response is regular content (not a tool call). Cleaner if you don't conceptually need a "function" wrapper.

For self-hosted (vLLM, llama.cpp), this is implemented via libraries like Outlines, llguidance, or XGrammar that mask invalid tokens during sampling. Same guarantee, no API call to a provider.

What happens when strict is on

The model can literally not emit invalid JSON. The decoder, at every token step, looks at all 200,000 vocabulary tokens, checks which ones would keep the output valid given the schema and tokens emitted so far, and only allows sampling from valid tokens.

This means:

  • You will never get "age": "forty" when the schema says integer.
  • You will never get a missing required field.
  • You will never get a JSON syntax error.
  • You might still get semantically wrong data — "age": 999 is valid integer, just wrong.

Validation is structural, not factual. You still need to sanity-check ranges, business rules, and content quality.

The five surprises in production

  1. Strict mode fails on complex nested schemas. Some schemas the provider can't compile (deeply nested, recursive, with too many enum values). You'll see a 400 error. Either simplify or fall back to tool use without strict.
  2. Latency overhead is real. A simple extraction goes from 200ms to 280ms with strict on. For batch extraction, this matters.
  3. The model sometimes refuses. "I cannot determine X from the input" — it has to fit the schema, but if the schema requires a field the model can't infer, it might emit a placeholder or the call might fail. Add an "unknown" enum value or make optional fields actually optional.
  4. Schema drift between providers. OpenAI, Anthropic, Gemini all have slightly different schema dialects. A schema that works for OpenAI might fail for Anthropic. If you swap providers, test.
  5. Cost of long enums. A role field with 200 enum values inflates the schema and adds latency. If you have many options, consider a free-form string + downstream validation, or split into category + subcategory.

Picking the right approach

  • One-off prototype, schema simple, you'll validate downstream: JSON mode. Fastest to ship.
  • Production extraction, must not break: Tool use with strict, or response_format json_schema with strict. The schema is your contract.
  • Schema is huge or complex (many enums, nested, recursive): Tool use without strict, validate downstream with Pydantic / zod. You'll get 99% adherence and avoid provider compilation errors.
  • Self-hosted, want guarantees: Outlines or XGrammar with vLLM. Strict at zero per-call cost.
  • Hierarchical task (plan + steps + per-step args): Tool use, where each "tool" represents a step type. The model emits a sequence of tool calls.

Library helpers

In 2026 the popular wrappers:

  • Pydantic + Instructor (Python). instructor.from_openai(client) gives you client.chat.completions.create(response_model=MyModel, ...) and a typed Pydantic object back. Hides the strict-mode plumbing.
  • zod-to-json-schema (TypeScript). Convert a zod schema to JSON schema, pass to the API, parse the response back through zod. Type-safe both ways.
  • Outlines (Python). For local models or when you want full control over the constraining algorithm.

For most teams: Instructor (Python) or zod (TypeScript) on top of the strict API. Don't write your own validation layer unless you're doing something exotic.

When NOT to use structured outputs

Structured outputs are great for extraction, classification, plan generation. They're bad for:

  • Long-form natural-language outputs. Don't shoehorn an essay into a {"text": "..."} schema; just use plain text.
  • Creative writing. Schema constraints reduce creativity. Marginally measurable; if you care about quality, skip them.
  • Streaming user-visible UI. Strict structured outputs stream less smoothly. Pure text is friendlier for chat UI.

Further reading

  • OpenAI structured outputs docs.
  • Anthropic tool use docs.
  • Outlines library (open-source structured generation).
  • Efficient Guided Generation for Large Language Models (Willard & Louf, 2023).
  • Look up: JSON Schema, Pydantic, Instructor library, XGrammar.

Last updated: 2026-04-29

We use cookies

Anonymous analytics help us improve the site. You can opt out anytime. Learn more

Structured outputs from LLMs: tool use, JSON mode, schemas · BuilderWorld