Flaky tests: why they happen and how to cut failures fast

Flaky tests: why they happen and how to cut failures fast

Checksum + Postilize 70% Fewer Bugs With AI Driven Testing

If your CI fails and the first response is “rerun it,” you do not have a test problem. You have a trust problem: failures that appear, disappear, and pull engineers into investigations that do not improve the product.

The good news is that flakiness is not random. In real production automation, failures fall into a handful of repeatable buckets, and there are fixes you can apply right away.

What are flaky tests?

A flaky test is an automated test that sometimes passes and sometimes fails without a meaningful change in the application behavior.

It is especially damaging because it trains teams to ignore red builds. And once that happens, both outcomes get worse:

  • You waste time chasing false alarms

  • You ship real regressions because you stop believing the signal

Why flaky tests happen

Most teams assume flakiness is a “timing” issue. Timing is part of it, but production data paints a more useful picture.

In an analysis of over one million end-to-end automation runs across hundreds of production web applications, failures were “surprisingly predictable.”

Here is how those failures broke down:

  • 32% selector changes

  • 27% flow changes

  • 22% environment instability

  • 19% loading and timing issues

That breakdown matters because it tells you where to focus. Most flakes are not mysterious. They live in the layers where automation is coupled to a UI that changes often.

1) Selector changes (the most common cause)

A button moves. A class name changes. A component gets refactored. The user experience is fine, but the locator is not.

How to reduce it:

  • Use semantic locators and stable attributes (for example, roles, labels, test IDs)

  • Avoid deep CSS chains and DOM-structure assumptions

  • Treat selectors like an interface, not an implementation detail

2) Flow changes (product evolves, tests do not)

Many failures happen because the path changed: a new step, a new validation, a new permission check, or a feature-flag branch.

How to reduce it:

  • Write tests around the outcome that matters, not every micro-step

  • Keep end-to-end coverage focused on critical user journeys

  • Push lower-level validation into API and integration tests

3) Environment instability (data, dependencies, and CI drift)

This includes unstable test data, third-party dependencies, noisy staging environments, and infrastructure variance.

How to reduce it:

  • Make test data seeded, isolated, and resettable

  • Mock external services where it makes sense

  • Separate “gating” suites from “monitoring” suites so you do not block releases on the wrong signal

4) Loading and timing issues (the classic flake)

Modern UIs are asynchronous. Tests click too early. Assertions run during transitions. Animations interfere.

How to reduce it:

  • Wait for meaningful readiness (not arbitrary sleeps)

  • Prefer explicit conditions tied to UI state or network idleness

  • Avoid “retry until green” patterns that hide real regressions

The core idea most teams miss: reliability is about staying operational

A lot of public “agent” and automation benchmarks focus on whether something works once.

Production reliability is different. It is about whether workflows keep working over weeks and months as UI, data, and flows change. 

That is why flaky tests accumulate. It is not because engineers are bad at writing tests. It is because the app moves and the suite does not.

How to fix flaky tests (a practical playbook)

If you want to reduce flakiness quickly, do not start by rewriting everything. Start by making failures rarer and cheaper.

1) Measure flakiness in a way leaders actually feel

Track:

  • failures per 100 runs (by test and by category)

  • time from first failure to “trusted green” (how long you are blocked or distracted)

2) Shrink the end-to-end surface area

Every extra step is another chance to break. Keep E2E tests for the flows that protect revenue and customer trust.

3) Standardize selectors and enforce it

Create a team rule: no brittle locators in new tests. Review selectors like you review code.

4) Fix test data like you would fix production data

If tests rely on shared state, they will fight each other. Make data isolation part of the system.

5) Make failures actionable by default

A red build should come with enough context to answer: “Is this a product bug or a test coupling issue?”

Even a lightweight standard helps: traces, screenshots, last-known-good runs, and what changed recently.

6) Stop “rerun until green” from becoming policy

Reruns are sometimes necessary, but they should be a signal to fix the system. Otherwise you teach everyone to ignore failures.

7) Add a maintenance loop that matches your shipping velocity

This is the step teams skip. If the product changes weekly, maintenance cannot be an emergency-only activity.

What changes when AI maintains tests (without going product-first)

The shift is not “AI writes tests.” Lots of tools can generate a first draft.

The shift is continuous maintenance:

  • many failures can be resolved automatically

  • the rest become quick reviews instead of deep investigations

In one benchmark summary, AI-driven recovery and auto-healing resolved about 70% of failures autonomously, with about 30% needing a quick human review. 

That same dataset highlighted an 80% reduction in failure rates, plus a step-change in triage speed, with time per failure dropping from hours to minutes.

That is the real unlock: stable automation that survives change instead of breaking every week

The hidden cost of flaky tests (why it never feels “worth fixing” until it is)

Flakiness is expensive because the cost is spread out. It shows up as small interruptions, not one big invoice.

Depending on failure volume and triage habits, a 500-test suite can translate to seven figures per year in maintenance time. Our benchmark examples range from $1.7M+ to ~$4.3M/year.

Want a quick estimate of what flaky failures are costing you? Use the QA Benchmark calculator to model it with your current test count.

Even if your numbers differ, the pattern is consistent: flaky tests steal attention first, then velocity.

FAQ 

What causes flaky tests?
Most flakiness comes from selector changes, flow changes, environment instability, and loading or timing issues.

How do I identify flaky tests?
Look for tests that fail inconsistently across runs, pass after reruns with no meaningful code changes, or fail mainly in CI.

Do better frameworks eliminate flaky tests?
They help, but they cannot prevent brittle selectors, unstable data, or changing flows. Flakiness is as much a maintenance problem as a tooling problem.

What is the fastest way to reduce flakiness this quarter?
Standardize selectors, isolate test data, shrink E2E scope to critical flows, and make failures more diagnosable so fixes are fast and repeatable.

How do I estimate ROI from reducing flaky failures?
Use a simple model: failures avoided × time saved per failure × blended engineer cost. If you want a quick starting point, there’s a savings calculator tied to the QA benchmark report you can use with your current test count.

gal-vered-author-image

Checksum Team

Checksum Team

Checksum.ai is an AI-powered end-to-end test automation platform that helps engineering teams automatically generate and maintain high-quality tests for web applications — without manually writing or updating them. 

Checksum is now a Google Partner

Checksum AI and Google Cloud: End-to-End Testing AI Innovation

Checksum is now a Google Partner

Checksum AI and Google Cloud: End-to-End Testing AI Innovation

Checksum is now

a Google Cloud Partner