
We recently made an architectural change to how our AI agent generates end-to-end tests. No new model. No prompt engineering breakthroughs. No additional training data. We changed what we asked the agent to do — and our composite eval score improved by 47%.
At Checksum, we're building Continuous Quality — the infrastructure that takes AI-generated code from prompt to production. AI coding tools have made writing code trivial. But testing, trusting, and deploying that code is the hard part, and it's the central chokepoint for modern software teams. Our AI agents generate E2E tests, validate code against real user sessions, and ensure that every deployment is green. The quality of those tests — whether they run, whether they cover the right behavior, whether they catch the bugs that matter — is everything. So when we saw a 47% jump across our evaluation suite, we dug into why.
The answer was simple, and it applies to anyone building with AI agents: we stopped teaching the agent our stack and started letting it just write code.
Let agents code, not operate tools
LLMs are trained on code. Billions of tokens of JavaScript, TypeScript, Python, test files, documentation. When you ask a model to write a Playwright test, it's working in territory it knows deeply. That's where it performs best.
But that's not what we were asking it to do.
Like most teams building agent systems, our first instinct was to give the agent tools. Lots of them. `createTest`, `updateMetadata`, `validateFormat`, `queryExecutionHistory`, `analyzeDOMStructure`, `syncToDatabase` — 13 tools in total. Each tool came with a schema definition, usage examples, parameter descriptions, and error handling instructions. All of that lived in the system prompt.
The result: 34,000 tokens of context dedicated to teaching the agent how to be a Checksum API client. That's 34k tokens not spent on reasoning about your application, not spent on understanding test coverage gaps, not spent on writing better code.
Here's a useful framing: think of your agent like a user. You'd never onboard a new team member by handing them 13 internal CLI tools and a 34,000-word manual. You'd give them a clean interface, sensible defaults, and let them focus on their actual job. The same principle applies to agents.
The trend in the AI tooling ecosystem is to build more tools — more integrations, more capabilities, more surface area for the model. But more tools means more context overhead, more failure modes, and more ways for the agent to get confused. More tools doesn't mean better agents.
What we built instead
We made two changes.
First, we built Repo Mirror. This is bidirectional sync infrastructure between GitHub (or GitLab) and Checksum's database. The agent writes standard `.spec` files and pushes them to a repository. That's it. Checksum handles everything else — formatting, validation, schema compliance, database writes — on the platform side. The sync runs in real time, both directions. When an agent pushes code, the UI updates. When someone edits in the UI, the repo updates.
This is part of a broader architectural principle at Checksum: coding agents shouldn't need to understand the full complexity of a running software system. There's a Context Void — coding agents see code, but they don't see real user sessions, cloud environments, third-party integrations, or production behavior. Bridging that void is our job, not the agent's. So we moved all the "how to interact with Checksum" complexity out of the agent and into our infrastructure. The agent doesn't need to know about our database schema, our API endpoints, or our internal data formats. It writes code to a repo, and we handle ingestion — powered by 50TB of proprietary data spanning user sessions, browser events, SDLC data, and execution traces.
Second, we replaced 13 tools with 2. Both remaining tools follow a code-interpreter pattern. They provide the agent with a JSON structure — past execution data, current DOM analysis — and let the agent write JavaScript against it. Instead of building a bespoke tool for every query the agent might want to make (`getFailedTests`, `findUncoveredRoutes`, `compareExecutionResults`), we hand it structured data and say: write code to extract what you need.
This plays directly to the model's strengths. LLMs are dramatically more reliable at writing JavaScript to query a JSON structure than at correctly orchestrating sequences of custom tool calls. The failure modes we'd been fighting — malformed parameters, incorrect tool sequencing, schema mismatches — disappeared.
The agent went from needing to understand Checksum's internal APIs, database schema, and formatting rules to needing to understand two things: here's a JSON, here's a repo. Write code.
The numbers
The impact was immediate and consistent across every model we tested:
System prompt + tools: 34k tokens → 8k tokens. A 76% reduction in context overhead.
Tool count: 13 → 2.
Composite eval score: 47% improvement, measured across a hybrid evaluation combining automated checks with human review.
Both factors contributed roughly equally. The context savings meant the agent had more room to reason about the actual application and write better tests. The reliability improvement meant fewer malformed outputs, fewer retries, and fewer tests that looked right but didn't run.
The failure modes that dominated our error logs — formatting errors from incorrect database schema usage, tool interaction failures from misunderstood parameters, degraded output quality from context window saturation — largely vanished.
We wrote the business logic ourselves. The integrations with GitHub and GitLab, the bidirectional data transformation, the real-time sync pipeline, the validation and formatting on ingestion. That engineering effort is what made the agent's job simple. The agent's job is to write code. Our job is everything else — the continuous quality layer that takes that code from prompt to production.
What this means for agent builders
If you're building AI agent systems, here are three principles we'd take from this:
Align the task with the training data. The model already knows how to write code. We abstracted away everything that wasn't code. Ask yourself: what is my model already excellent at? Then shape your architecture so the agent spends its time doing that — and only that. Every responsibility you add beyond the model's core strength is a tax on performance.
Write business logic to improve performance. The lazy approach — giving the agent a stack of tools and hoping it figures out the orchestration — doesn't scale. It might work for demos, but it breaks down in production. Instead, invest engineering effort in the surrounding infrastructure. Build the ingestion pipelines, the validation layers, the sync mechanisms. Think of your agent like a user: you want everything to be as familiar as possible and as foolproof as possible.
Protect your context window. It is your most precious resource. Every token of tool schema, every usage example, every formatting instruction is a token the agent can't use for reasoning about the actual problem. We cut 26k tokens of overhead and saw a 47% improvement. Measure your context overhead. Then cut it ruthlessly.
The 47% didn't come from a breakthrough in AI. It came from a shift in how we thought about the agent's role. We stopped trying to make the agent understand our system. We rebuilt our system so the agent could just do what it already does best.
There is no AI coding revolution without AI quality. Writing code is solved. The bottleneck is everything after — testing, validation, deployment. At Checksum, we're building the other half: continuous quality infrastructure that lets agents write code and ships it to production. This architectural insight — let agents code, handle the rest yourself — is how we got there.
Gal Vered is a Co-Founder at Checksum where they use AI to generate end-to-end Cypress and Playwright tests, so that dev teams know that their product is thoroughly tested and shipped bug free, without the need to manually write or maintain tests.
In his role, Gal helped many teams build their testing infrastructure, solve typical (and not so typical) testing challenges and deploy AI to move fast and ship high quality software.

