Blog

The Prompt-Test-Prompt Loop Is Killing Your Day

Gal Vered

March 10, 2026

For developers who spend more time fixing AI-generated tests than writing features.

There is a specific kind of afternoon that has become common on AI-accelerated teams.

You used an agent to build a feature. The feature looks right. You have some tests -- either written manually or generated. You run them. Three fail. You look at the failures. Two are because the agent changed a UI selector that your tests depended on. One is a real bug.

You fix the selector failures. You re-run. One still fails. You prompt the agent to fix the bug. The agent produces a fix. You run the tests again. The fix broke something else. You prompt again.

It is 4pm. You have been in this loop since 11am. You have shipped nothing.

This is the prompt-test-prompt loop. If you are shipping with AI coding tools and not using a quality layer, you are living in it.

Why Tests Break the Way They Do

The fragility problem has two distinct causes, and they require different solutions.

The first cause is selector brittleness. End-to-end tests that reference specific DOM elements -- button IDs, CSS classes, aria labels -- break every time a UI change touches those elements. If your agent is shipping UI changes at AI speed, your selectors are breaking at AI speed too. This is not a problem with your tests. It is a structural property of how most tests are written.

The second cause is context blindness. When a coding agent writes a feature, it works from the code it can see. It does not know about the state of your database, the behavior of your third-party APIs, the permission model your application enforces in production, or the edge cases that have bitten your team before. So the code it generates is plausibly correct in isolation and frequently wrong in context. And the tests it generates test the wrong things, because they are written with the same blind spots the code was.

These are not problems you can solve by prompting more carefully. They are structural.

What Self-Healing Tests Actually Mean

The phrase "self-healing tests" gets used loosely. It is worth being precise.

A test heals itself when it can identify that a failure is caused by a change in the application -- a moved element, a renamed field, a restructured flow -- rather than a genuine bug, and update itself accordingly without human intervention.

This is different from a flaky test that randomly fails and passes. It is also different from a test that just retries until it passes. A self-healing test understands why it failed and makes a judgment: is this a bug, or is this the application changing? If it is the application changing, fix the test. If it is a bug, surface it.

Checksum's End-to-End agent does this by building a graph of your entire application -- every screen, every interaction, every flow -- before generating tests. Because the tests are generated from a structural understanding of the application rather than from specific selectors, they can adapt when the structure changes. When a UI evolves, the agent identifies the change in the graph and updates the test to match. The test stays green because it understands the application, not because it blindly passes.

The Verification Loop That Should Replace the Prompt-Test-Prompt Loop

The better pattern looks like this:

You prompt your coding agent to build a feature. The agent builds it. Before you do anything else, Checksum's CI Guard agent automatically generates targeted tests for the code that changed and runs them. If there are failures, they appear in your workflow -- in your CI pipeline, in your IDE via a slash command, or as a comment on the PR.

You see the failure. It is specific: this endpoint returns a 403 for unauthenticated users when the spec says it should return a 401. You prompt the agent with the specific failure context. The agent fixes it. The tests run again automatically. They pass.

You have shipped a verified feature without manually writing a single test and without spending an afternoon in a debug loop.

The difference is not that you wrote better prompts. The difference is that there is infrastructure catching the failures before you have to find them yourself.

A Note on Test Ownership

One thing worth knowing: Checksum generates tests as real Playwright code that lives in your repository. You own it. You can read it, modify it, run it anywhere. There is no vendor lock-in, no proprietary test format, no black box you have to trust.

This matters because the goal is not to replace your test suite with something you do not understand. It is to make your test suite comprehensive, current, and maintained automatically -- so you can spend your time on the problems that actually require a human.

The prompt-test-prompt loop is a symptom of missing infrastructure. Once the infrastructure exists, the loop goes away.

Checksum integrates directly with Claude Code, Cursor, and 100+ AI coding agents. Type /checksum to generate tests on the fly. Learn more at checksum.ai.

‍

Gal Vered

Gal Vered is a Co-Founder and CEO at Checksum where they use AI to generate end-to-end Playwright tests, so that dev teams know that their product is thoroughly tested and shipped bug free, without the need to manually write or maintain tests.