Guardrails

Code or models that sit around an LLM to filter inputs/outputs, block unsafe content, enforce schemas, or stop the model from doing things it shouldn't.

Guardrails are the safety belt around an LLM. They sit between the user and the model, or between the model and downstream systems, checking content against rules — block PII leaks, refuse harmful topics, enforce JSON output structure, redact secrets, or veto agent actions that violate policy. They matter because LLMs are stochastic and you can't trust them to follow rules from prompt text alone. A determined user can always find a way to get the model to say something unwanted; guardrails are deterministic checks that catch what slips through. They're also where regulators and enterprises focus — "prompt-only safety" doesn't satisfy compliance teams. A concrete example: a customer-support chatbot. Input guardrail: detect prompt injection attempts and refuse to process them. Output guardrail: scan the model's reply for credit card numbers, addresses, or off-policy promises ("I'll give you a refund") before it reaches the user. Action guardrail: if the agent calls a refund tool, require human approval above a dollar threshold. Popular libraries: Guardrails AI, NVIDIA NeMo Guardrails, Llama Guard, OpenAI's Moderation API, Anthropic's safety classifiers. Self-hosted options like Llama Guard 3 are common for builders who don't want a third-party dependency. Related: alignment, content moderation, prompt injection, agent security.