Ninsei Labs/Blog/Operator Playbooks

    AI Agent Regression Testing: Keeping Prod Honest

    Pre-ship eval qualifies an agent once. Here's how to build ai agent testing that catches production drift before users do.

    Most AI agent testing stops the moment the agent ships. That is the bug. AI agent regression testing is the practice of running a fixed set of behavioral checks against your deployed agents on a schedule, after every model update, and whenever an upstream dependency changes, so you catch drift before your users do. Pre-deployment eval qualifies an agent before it ships. Regression testing keeps it honest afterward.

    That distinction matters more than most operators realize. Pre-ship testing is a gate. Regression testing is an ongoing process. The discipline is the same in traditional software. Almost nobody applies it to agents.

    Why Prod Agents Break Silently

    Traditional software breaks loudly. A 500 error shows up in logs. A type mismatch throws an exception. Agent failures look different. The agent still runs. It still returns a response. The response is just subtly, measurably, progressively worse, until someone complains.

    Three things cause this in practice.

    Model updates. Your provider rolls a new model version. Temperature behavior shifts. Instruction following degrades on specific patterns your prompt relied on. You never requalified because the changelog said "minor improvements." The chains most likely to break on model updates are the ones doing structured extraction from unstructured text, especially when the prompt relies on specific phrasing to coax JSON out of a model that now wants to explain its reasoning first.

    Document and input format changes. A client starts sending invoices from a new vendor. The PDF layout is different. Your chunking strategy pulls the wrong fields. The agent returns a plausible-looking result that is wrong. Nobody notices until month-end reconciliation.

    Tool and schema changes. An API your agent calls adds a required field, or deprecates an optional one, or changes an enum value. The agent may not error, depending on how you handle tool call failures. It will just silently do the wrong thing.

    Observability catches these after the fact. Regression testing catches them before.

    What AI Agent Testing Looks Like After Deploy

    A regression suite for an agent is a collection of frozen test cases, each with a known-good input and a behavioral assertion about the expected output. Not "the output must exactly match this string." That is too brittle. But "the output must contain the extracted invoice total," or "the summary must be under 100 words," or "the tool must be called with these parameters," or "the final answer must not contradict the source document."

    The assertions are what make this work. They encode the behaviors that matter to you. A test case without an assertion is just a log entry.

    I keep these test cases in a folder in the repo, versioned alongside the agent's prompt and tool definitions. Each one is a JSON file: input, any fixture documents, and a list of checks. The checks run against the agent's actual output when I trigger the suite.

    This is a discipline most operators skip. They write the agent, ship it, add some tracing, and call it done. Tracing tells you what happened. A regression suite tells you whether the behavior you care about is still intact.

    The Assertions That Actually Catch Failures

    The specific checks depend on the agent, but a few categories cover most failure modes.

    Structural assertions. Did the agent call the right tool? Did it return JSON when JSON was expected? Did it populate the required fields? These are the easiest to write and the fastest to run.

    Semantic assertions. Did the extracted value match the ground truth in the fixture? Is the summary factually consistent with the source? These need either a deterministic check (regex, field comparison) or a lightweight LLM judge. For the LLM-judge checks, I use a small model with a tight prompt: "Does this answer contradict the source document? Yes or no." Binary. Cheap. Fast. Haiku-class models handle this reliably at a fraction of what the production model costs.

    Behavioral boundary assertions. Did the agent refuse to answer out-of-scope questions? Did it stay within its tool permissions? These are especially important for agents with access to sensitive data or the ability to write to external systems.

    Not every test needs every category. A document processing agent needs heavy structural and semantic coverage. A customer-facing chat agent needs more boundary testing.

    When to Run the Suite

    The trigger list is short.

    Before deploying a prompt change. When the underlying model version changes (pin your model ID; don't let the provider float you to "latest" without a deliberate bump). When a tool schema or downstream API changes. On a weekly schedule regardless of changes, because the world around your agent shifts even when your code does not.

    The weekly run is the one most operators skip. It catches the silent drift, the document format changes, the upstream API quirks that nobody sends a changelog for. A weekly CI job that runs the suite and sends a Slack notification on failure takes a few hours to wire up. That has caught more production issues for me than any other reliability investment.

    For agents running in n8n or a similar visual workflow environment, you can trigger the suite externally by calling a webhook that replays test inputs through the production workflow. For agents built with code frameworks like LangGraph or CrewAI, you wire the test runner directly into your CI pipeline.

    Building the Fixture Library Over Time

    The fixture library is the most valuable artifact you accumulate. Each time an edge case surfaces in prod, turn it into a test case. The document type that broke chunking becomes a fixture. The tool call that misfired becomes a frozen input. The user query that produced a wrong answer becomes a boundary test.

    Over several months, you end up with a library that encodes the specific ways your agent has failed. That is more useful than any generic eval framework, because it reflects your actual traffic, your actual documents, your actual users.

    I start the fixture library on day one, even before the agent is in prod. Seed it with synthetic examples that cover the obvious paths, then add real cases as they arrive. By the time the agent has been live for a month, the fixture library is doing real work.

    How Regression Testing Fits the Broader Reliability Stack

    Regression testing sits between pre-deployment eval and production observability. The three layers are not redundant. They answer different questions.

    Eval tells you the agent is qualified to ship. Regression testing tells you it is still behaving the way it was when it shipped. Observability tells you what it is doing right now.

    The failure mode I see most often is teams with strong observability and no regression testing. They watch the traces carefully. They notice when something goes wrong. But they notice after it has already affected users. A regression suite moves detection earlier, to before the change goes out, when fixing it is cheap.


    The goal is not a perfect agent. There is no such thing. The goal is an agent whose failure modes you understand, whose regressions you catch fast, and whose behavioral contracts you can verify on demand. A production agent without a regression suite is a bet that nothing important changes. That bet does not hold.

    Frequently asked questions

    What is AI agent regression testing?
    AI agent regression testing is running a fixed set of behavioral checks against deployed agents on a schedule, after model updates, and when dependencies change. It catches regressions before they affect users, while pre-deployment eval only qualifies agents before they ship.
    Why do AI agents fail silently in production?
    Agent failures often go unnoticed because the agent still returns a response, just one that is subtly wrong. Model updates, new document formats, and tool schema changes cause agents to degrade without throwing errors that appear in logs.
    How often should I run regression tests?
    Run before deploying prompt changes, whenever the model version changes, when tool schemas change, and on a weekly schedule regardless of changes. The weekly run is most important because it catches silent drift from upstream changes.
    What should a regression test include?
    Each test case needs a known-good input, fixture documents, and behavioral assertions about expected output. Assertions verify structure (right tool called), semantic accuracy (extracted values match ground truth), and boundaries (agent refuses out-of-scope requests). Without assertions, you just have a log entry.
    How do I set up regression testing?
    For n8n workflows, call a webhook that replays test inputs through production. For code frameworks like LangGraph or CrewAI, wire the test runner into your CI pipeline. Start with a weekly job that sends a Slack notification on failure.

    Want this kind of thinking applied to your business?

    Book a 30-minute discovery call. We'll talk through what you're building, route you to the right service, or tell you we're not the right fit.

    Book a discovery call
    ← All articles