AI Guardrails
Guardrails are the checks that keep an AI system inside safe, intended behavior — filtering inputs, constraining what it can do, and validating outputs before they reach a user. They run outside the model, so they hold even when the model is wrong or manipulated.
Guardrails are the boundary around an AI system that catches bad behavior the model itself might produce. They operate at three points: on the input (block prompt injection, unsafe or out-of-scope requests), on the actions (scope permissions, require approval for high-stakes tool calls), and on the output (filter unsafe, off-policy, or ungrounded responses before they ship).
The key property is that real guardrails live outside the model — as validation code, permission systems, and approval gates — not as instructions in the prompt. A rule written into the prompt is a request the model can be talked out of; a guardrail enforced in code holds even when the model is wrong or an attacker has manipulated it. They’re the practical mechanism behind deploying autonomy you don’t fully trust: the goal isn’t a less capable agent, it’s a small blast radius when it fails.