Before any agent touches a client's customers or live data, we run it through a four-gate AI agent evaluation process. Pass all four gates and it ships. Fail any one and it goes back to the build, not to production.
This matters because most agent evaluation frameworks you'll find online are written for ML researchers. They measure benchmark accuracy, MMLU scores, and perplexity. None of that tells you whether an agent will quietly send a draft email to the wrong contact at 2 AM, or bill a customer twice because an API timeout left a transaction in a limbo state. Different question, different rubric.
Why AI Agent Evaluation for SMBs Requires a Different Rubric
Academic benchmarks measure general capability. What an SMB needs to know is narrower and more consequential: does this specific agent fail safely when something unexpected happens, does it stay within the permissions we gave it, can we read a log and explain what it did and why, and if it breaks, what does that cost?
Those four questions map directly to our four gates.
The stakes are also asymmetric in ways that matter for smaller organizations. A Fortune 500 company absorbs a bad agent run as a footnote in a quarterly review. For a 12-person professional services firm, one agent that emails the wrong segment, deletes the wrong records, or locks up the CRM for three hours is a client-relationship event. The evaluation process has to reflect that asymmetry.
Gate One: Does It Fail Gracefully on Edge Cases?
The first gate is adversarial testing, not happy-path testing. We deliberately feed the agent inputs it was not designed for: malformed data, out-of-scope requests, missing required fields, and inputs that could be interpreted multiple ways.
Acceptable failure looks like a clean stop, a logged error, and a handoff to a human. Unacceptable failure looks like hallucinated output presented as fact, partial execution with no indication that something went wrong, or silent data corruption.
Concretely: an agent that processes inbound lead forms and encounters a blank required phone field should stop and flag the record, not fill in a plausible-looking number and keep going. That distinction sounds obvious until you've watched a production agent invent phone numbers. We have.
Gate Two: Does It Stay Within Its Permission Scope?
Every agent we build gets a defined permission boundary at spec time. Read-only access to certain tables. Write access to a specific object type. No access to financial records. No external API calls outside an approved list.
Gate two verifies that the agent actually respects those boundaries under pressure, not just in the happy path. We probe the edges: what happens if the agent's task could be completed faster by accessing a resource it wasn't granted? Does it error cleanly, or does it find a way around the constraint?
We also check for scope creep in the prompt layer. An agent operating on a single-tenant workflow that has somehow been given a system prompt broad enough to query company-wide data is a gate-two failure even if the specific test case works fine. The permission set should match the narrowest possible grant that allows the task, and nothing broader.
This is where the draft-only pattern we default to in email agents earns its keep. An agent that can only create drafts, never send, has a much smaller blast radius than one with send permissions. Gate two quantifies that blast radius explicitly before anything ships.
Gate Three: Does It Produce Auditable Output?
This is the gate where agents that performed perfectly in the demo most often fail.
Auditability means: given the agent's output, a human reviewer can reconstruct what decision was made, what data triggered it, and what action was taken, without reverse-engineering anything. The log exists, it's structured, and it's readable by someone who wasn't in the build.
Why does this matter? Because when something goes wrong in production (and eventually something will), the path to fixing it runs through the audit log. If the log is incomplete, ambiguous, or buried in a format only the original engineer can parse, you've created a support burden that scales directly with the error rate.
The practical test: hand the output log from five agent runs to someone who wasn't involved in building it, give them fifteen minutes, and ask them to explain what the agent did and whether it did it correctly. If they can't, the agent fails gate three.
Gate Four: What Does a Failure Actually Cost?
The final gate is not a technical test. It's a financial and reputational risk model run before any agent goes live.
We ask three questions: What is the worst realistic outcome if this agent fails? What is the realistic frequency of that failure given our edge case coverage? And is there a human checkpoint between agent action and external impact?
An agent that drafts internal reports has a near-zero blast radius even if it occasionally produces nonsense. An agent that initiates outbound client communications or processes financial transactions has a blast radius that warrants a mandatory human-review step between agent output and external action, at least for the first several weeks of production deployment.
This gate sometimes kills agents that passed the first three. If the failure cost is too high and the human-review overhead eliminates the efficiency gain, we don't ship. We either redesign the permission boundary to reduce the blast radius, or we recommend keeping the task human.
Two Agents That Passed the Demo and Failed Gate Three
Both of these happened in real builds, so the specifics are generalized to protect client confidentiality.
The first was an intake routing agent for a professional services firm. In the demo, it correctly categorized and routed every test submission we threw at it. Gate one passed. Gate two passed (read-only on the intake form, write to a routing field only). Gate three failed because the routing log wrote the category label but not the data that drove the decision. When a client complained that their intake had been miscategorized, nobody could reconstruct why without manually re-running the submission through the logic. We rebuilt the logging layer before it shipped.
The second was a CRM data enrichment agent. Passed the demo, passed gates one and two. Failed gate three because the enrichment overwrote existing contact fields without preserving the previous values. The output was not auditable in the sense that it was irreversible: there was no before-state to compare against. We added a pre-write snapshot to a shadow table and a conflict-resolution step before any overwrite, then ran it through gate three again. It passed the second time.
Both agents went to production eventually. The gate-three failure was not a reason not to build them. It was a reason not to ship them yet.
The Gate Process Is Calibration, Not Gatekeeping
The point of the four-gate evaluation is not to find reasons to kill agents. Most agents that reach evaluation already have a reasonable design; the work happened earlier. The gates are a forcing function that converts "this seems to work" into "we know what it does, what it can't do, and what happens when it fails."
That is what separates a shipped agent from a demo. Demos optimize for the happy path. Production systems have to survive contact with reality.
If you're evaluating whether an agent is ready to ship and you haven't explicitly mapped the failure modes and their costs, you're not done with the evaluation. You're done with the demo.