Flaky Tests: Why They Happen and How We Eliminated 97% of Them with AI

What Are Flaky Tests?

Flaky tests are automated tests that sometimes pass and sometimes fail without any changes to the codebase or application. These tests produce inconsistent results, leading to unreliable CI/CD pipelines, wasted developer time, and slower release cycles.

Flaky tests typically result from:
- Timing issues (e.g., elements not fully loaded)
- Improper use of waits or retries
- Asynchronous behavior is not being handled properly
- Fragile selectors or brittle DOM assumptions
- External dependencies that introduce variability (e.g., third-party APIs)

Over time, flaky tests accumulate and become a silent tax on engineering velocity. Teams either ignore them or spend hours debugging problems that aren’t real.

Flaky Test Benchmarks: Selenium vs Cypress vs Playwright vs Checksum

We benchmarked flaky test rates across hundreds of test suites in real-world applications. Here are the results:

Framework	Median Flake Rate	Common Causes
Selenium	20–25%	Timing bugs, stale element errors, and manual waits
Cypress	12–18%	DOM mutation sensitivity, race conditions
Playwright (manual tests)	3–7%	Residual flakiness from user-authored test logic
Checksum (AI-generated Playwright)	0.6%	Smart retries, stable selectors, automated healing

Why Selenium and Cypress Struggle with Flaky Tests

Selenium and Cypress were designed before modern frontend frameworks became the norm. Both tools expose developers to low-level control over waits, assertions, and selectors, which often leads to fragile tests.For their time they were good open-source tools; but Microsoft playwright (which Checksum uses) is just built with enterprise grade capabilities, which is why most teams want to make the switch.

Selenium requires manual management of waits and uses brittle XPath or CSS selectors. Flaky tests often arise when elements become stale, asynchronous behavior isn’t handled, or the page structure changes slightly.

Cypress attempts to improve this with auto-waiting and easier assertions. However, Cypress struggles with iframe support, flake-prone animations, and implicit state that causes test leakage across specs.

Flaky tests in both tools require constant manual triage, and debugging them adds drag to every release.

Why Playwright Reduces Flaky Tests — But Doesn’t Eliminate Them

Playwright improves flaky test stability in several ways:
- Built-in auto-waiting on page loads, elements, and network conditions
- Better support for parallel execution and iframe handling
- Rich debugging tools like trace playback and video snapshots

While Playwright significantly reduces flaky tests, it doesn’t eliminate them. If tests are written manually, developers can still introduce instability through poor selector choices, non-deterministic logic, or improper use of retries.
Our internal data shows that even Playwright suites written by hand typically retain a 3–7% flake rate.

How Checksum Eliminates Flaky Tests Using AI

At Checksum, we use AI agents to generate and maintain Playwright tests that are built to avoid flaky tests from the outset. Here’s how:
- Stable selectors: AI selects robust, semantically meaningful locators rather than brittle CSS chains or XPath.
- Flake-aware retries: The platform applies retries only when necessary, using a classifier trained on known flaky test patterns.
- Continuous healing: When the app UI changes, the AI updates test logic and selectors in real time to prevent flakiness.
- Noise filtering: Our system automatically distinguishes between true failures and false positives due to test or infrastructure noise.

The result is a typical flake rate of just 0.6% — even in highly dynamic multi-tenant applications.

Case Study: Reducing Flaky Tests from 22% to 0.6%

One B2B SaaS platform migrated 1,500 Selenium tests to Checksum. Their baseline metrics:

- 22% flaky test rate across their CI pipeline
- Daily test reruns caused 3+ hours of engineering delay
- 4 out of 10 releases delayed due to test-related false alarms

After switching to Checksum:
- Flaky test rate dropped to 0.6%
- Regression test time reduced by 70%
- No test-related rollbacks in the following quarter

This transformation came with no manual intervention — the AI agents handled the migration, selector stabilization, and trace integration.

Flaky Tests FAQ

What causes flaky tests?
Flaky tests are usually caused by timing issues, race conditions, unstable selectors, or interactions with asynchronous UI behavior. Network delays or environmental dependencies can also contribute.

How do I identify flaky tests?
Common symptoms include: tests that pass locally but fail on CI, failures that are inconsistent across runs, or tests that pass after rerunning without code changes. Monitoring test history and using trace tools can help diagnose flaky behavior.

Can Cypress eliminate flaky tests?
Cypress reduces some flakiness compared to Selenium, but it still suffers from issues like DOM mutations, state leakage between tests, and limited iframe support. Most teams report 12–18% flake rates with Cypress in production.

Is Playwright fully immune to flaky tests?
No. While Playwright includes better waits and tracing, flaky tests still occur if tests are written poorly. Selector instability, bad retry logic, or assumptions about timing can still lead to flakiness.

How does Checksum differ?
Checksum’s AI doesn’t just run tests — it writes and maintains them. This includes choosing stable selectors, repairing failing tests, and detecting flaky patterns automatically. The result is a flake rate that’s 5–10x lower than what teams achieve with Playwright alone.

Say Goodbye to Flaky Tests

Flaky tests slow down development, undermine trust, and waste engineering time. Whether you're using Selenium, Cypress, or Playwright, there's a better path forward. At Checksum, we help teams eliminate flaky tests by replacing manual test maintenance with AI-generated automation that stays stable even as your app evolves.

If you're ready to eliminate flaky tests, we can help. Book a test suite audit or contact us.

Neel Punatar

Neel Punatar is an engineer from UC Berkeley - Go Bears! He has worked at places like NASA and Cisco as an engineer but quickly switched to marketing for tech. He has worked for companies like OneLogin, Zenefits, and Foxpass before joining Checksum. He loves making engineers more productive with the tools he promotes. Currently he is leading marketing at Checksum.

Checksum is now a Google Partner

・

Checksum AI and Google Cloud: End-to-End Testing AI Innovation

Learn More

Checksum is now a Google Partner

・

Checksum AI and Google Cloud: End-to-End Testing AI Innovation

Learn More

Checksum is now

a Google Cloud Partner

Learn More

Flaky Tests: Why They Happen and How We Eliminated 97% of Them with AI