Checksum is now a Google Partner
・
Checksum AI and Google Cloud: End-to-End Testing AI Innovation
Checksum is now
a Google Cloud Partner
Checksum is now a Google Partner
・
Checksum AI and Google Cloud: End-to-End Testing AI Innovation
February 9, 2026
The Problem With Web Agent Benchmarks (And Why We Need Better Ones)
Web agent benchmarks are broken. Benchmarks like Browserbase's Stagehand evals and agent-evals have fundamental gaps that make them poor predictors of real-world automation success.
Here's the uncomfortable truth: current web agent benchmarks tell us AI isn't ready to automate the web. But that's not the whole story.

The Compound Action Problem
Look at Stagehand's atomic action evals. They report around 85% accuracy per action. That sounds pretty good until you do the math on a real workflow.
Booking a table at a restaurant takes maybe 10 actions: navigate to the site, click reservations, select date, select time, enter party size, fill in contact info, submit. Even simple enterprise workflows are longer. At 85% per-action accuracy, your success rate for a 10-action flow is 0.85^10 = 19.7%.
One in five attempts succeeds.
The agent-evals benchmark from Stagehand shows end-to-end success rates around 65% on WebVoyager-style tasks. That's more realistic for multi-step flows, but it's still telling us the same thing: you cannot reliably automate web processes with AI today.
Except... people are. Every day. So what's missing?
The Reliability Question
A 65% success rate means different things depending on distribution.
If 65% of tasks succeed reliably, that's transformative. Give me 100 daily workflows and I'll happily automate 65 of them. The 35% that fail? I'll handle those manually while the AI takes care of the rest.
But if every task succeeds 65% of the time, that's unusable. I can't trust any individual automation to work. I'm constantly checking, retrying, fixing. That's not automation—that's babysitting.
Current benchmarks don't tell us which world we're in. They give us aggregate numbers without showing us the distribution of successes across tasks. Without that, we can't answer the most important question: which processes can I actually automate today?
The Agent Harness Gap
Nobody runs these models raw in production. Nobody.
Every real-world AI automation system has an agent harness: retry logic, validation checks, caching for known-good paths, fallback strategies, confidence thresholds. These aren't nice-to-haves. They're table stakes.
A model that scores 60% raw might hit 90% with good harness logic. Another model at 65% raw might plateau at 70% because it fails in ways that are harder to recover from. The benchmark numbers don't tell you which is which.
This matters because different models fail differently. One might make locator errors that are easy to detect and retry. Another might succeed at actions but misunderstand intent in ways that are harder to catch. Raw accuracy scores hide these differences.
Real production systems layer multiple recovery strategies. When Playwright code fails mid-execution, AI can kick in to complete the action in real-time. When a selector breaks, the system can try alternative approaches before giving up. When validation fails, it can retry with different parameters. This is engineering, not magic—but it changes the success rates dramatically.
The benchmarks measure model capability in isolation. But production is about model capability plus engineering. Without testing the full stack, we're optimizing for the wrong thing.
The Auto-Healing Problem
Here's the thing about browser automation: writing the initial script is the easy part.
I can write a Selenium script to automate almost any web workflow in an afternoon. No AI needed. Just good old DOM inspection and XPath selectors. For a high-value process that runs 100 times a day, spending 5-10 hours on the script is nothing. Spending 30 minutes reviewing AI-generated Playwright code? Even easier.
The killer is maintenance. That script breaks next week when they redesign the button. It breaks again when they add a loading spinner. Again when they change the form validation. Again when they add a cookie banner. Again when...
You get the idea.
This is why LLMs matter for browser automation. Not because they can take a vague prompt and figure out a website they've never seen—though that's cool. They matter because they can adapt when things break.
But here's what's missing from the benchmarks: they don't test healing at the right layer. The most effective approach isn't pure AI that figures everything out from scratch every time. It's AI that maintains deterministic code.
Generate Playwright code as your source of truth. When it breaks, have AI update the code and submit it as a PR. Now you have version control, human review, and a clear definition of what the automation does. You're not debugging AI behavior—you're reviewing code changes. When something fails, you get clear, actionable reports about what broke and how it was fixed.
This is a different problem than "navigate to a random website and complete a random task." It assumes some human oversight at the beginning. It assumes you care about the same workflows over time, not one-shot tasks. It assumes code as the source of truth, with AI as the maintenance layer.
Current benchmarks don't test this. They test zero-shot performance on static tasks. They don't test: "This workflow ran yesterday. Today the UI changed. Can you still complete it?" That's the actual problem we need AI to solve.
What We Really Need
The enterprise automation problem is different from the benchmarks. It's not "can AI figure out any website from scratch?" It's "can AI help me maintain automations that I care about?"
Real enterprise automation looks like this:
I define a workflow I need to repeat, maybe with AI-generated Playwright code I can review
Maybe I give some structure, some guardrails, some validation
The automation runs daily, weekly, constantly
When code breaks mid-run, AI recovery kicks in to complete the task
After runs complete, the system reviews failures and submits fixes
Sites change, the automation adapts
When it breaks, I get clear, actionable reports—not "the AI failed," but "this selector broke, here's the fix"
Over time, it learns the stable paths and gets faster
This is a different problem than benchmarks measure. It assumes some human oversight at the beginning. It assumes retry logic and validation. It assumes you care about the same workflows over time, not one-shot tasks. It assumes code as infrastructure, not AI as a black box.
Benchmarks that test one-shot performance on diverse tasks are measuring something. But they're not measuring the thing that matters most for production automation: resilience over time with human-reviewable artifacts.
The Path Forward
We need benchmarks that:
Report per-task success rates, not just aggregates. Show us which tasks are automatable today.
Test with agent harnesses, not just raw models. Measure production systems with real-time recovery, not lab conditions.
Measure adaptation over time. Run workflows, change the sites, see what survives. Test if AI can maintain Playwright code, not just execute actions from scratch.
Value reliability over breadth. A system that automates 10 workflows at 95% reliability is more useful than one that attempts 100 at 50%.
Test code generation and healing. Can the system produce reviewable artifacts? Can it fix its own code when things break?
The current benchmarks tell us AI isn't ready. But AI is already automating real work for real companies. That gap means we're measuring the wrong things.
We're building a new benchmark that addresses these gaps. Not because Stagehand's evals are wrong—they're measuring what they're meant to measure. But because what we need to measure is different.
The web is already being automated by AI. The question isn't "can it work?" It's "how do we know which workflows will work, and how do we make them work better?"
Better benchmarks are the first step.
From the Checksum.ai Blog
January 6, 2026
The Problem With Web Agent Benchmarks (And Why We Need Better Ones)
Web agent benchmarks are broken. Benchmarks like Browserbase's Stagehand evals and agent-evals have fundamental gaps that make them poor predictors of real-world automation success.
Here's the uncomfortable truth: current web agent benchmarks tell us AI isn't ready to automate the web. But that's not the whole story.


The Compound Action Problem
Look at Stagehand's atomic action evals. They report around 85% accuracy per action. That sounds pretty good until you do the math on a real workflow.
Booking a table at a restaurant takes maybe 10 actions: navigate to the site, click reservations, select date, select time, enter party size, fill in contact info, submit. Even simple enterprise workflows are longer. At 85% per-action accuracy, your success rate for a 10-action flow is 0.85^10 = 19.7%.
One in five attempts succeeds.
The agent-evals benchmark from Stagehand shows end-to-end success rates around 65% on WebVoyager-style tasks. That's more realistic for multi-step flows, but it's still telling us the same thing: you cannot reliably automate web processes with AI today.
Except... people are. Every day. So what's missing?
The Reliability Question
A 65% success rate means different things depending on distribution.
If 65% of tasks succeed reliably, that's transformative. Give me 100 daily workflows and I'll happily automate 65 of them. The 35% that fail? I'll handle those manually while the AI takes care of the rest.
But if every task succeeds 65% of the time, that's unusable. I can't trust any individual automation to work. I'm constantly checking, retrying, fixing. That's not automation—that's babysitting.
Current benchmarks don't tell us which world we're in. They give us aggregate numbers without showing us the distribution of successes across tasks. Without that, we can't answer the most important question: which processes can I actually automate today?
The Agent Harness Gap
Nobody runs these models raw in production. Nobody.
Every real-world AI automation system has an agent harness: retry logic, validation checks, caching for known-good paths, fallback strategies, confidence thresholds. These aren't nice-to-haves. They're table stakes.
A model that scores 60% raw might hit 90% with good harness logic. Another model at 65% raw might plateau at 70% because it fails in ways that are harder to recover from. The benchmark numbers don't tell you which is which.
This matters because different models fail differently. One might make locator errors that are easy to detect and retry. Another might succeed at actions but misunderstand intent in ways that are harder to catch. Raw accuracy scores hide these differences.
Real production systems layer multiple recovery strategies. When Playwright code fails mid-execution, AI can kick in to complete the action in real-time. When a selector breaks, the system can try alternative approaches before giving up. When validation fails, it can retry with different parameters. This is engineering, not magic—but it changes the success rates dramatically.
The benchmarks measure model capability in isolation. But production is about model capability plus engineering. Without testing the full stack, we're optimizing for the wrong thing.
The Auto-Healing Problem
Here's the thing about browser automation: writing the initial script is the easy part.
I can write a Selenium script to automate almost any web workflow in an afternoon. No AI needed. Just good old DOM inspection and XPath selectors. For a high-value process that runs 100 times a day, spending 5-10 hours on the script is nothing. Spending 30 minutes reviewing AI-generated Playwright code? Even easier.
The killer is maintenance. That script breaks next week when they redesign the button. It breaks again when they add a loading spinner. Again when they change the form validation. Again when they add a cookie banner. Again when...
You get the idea.
This is why LLMs matter for browser automation. Not because they can take a vague prompt and figure out a website they've never seen—though that's cool. They matter because they can adapt when things break.
But here's what's missing from the benchmarks: they don't test healing at the right layer. The most effective approach isn't pure AI that figures everything out from scratch every time. It's AI that maintains deterministic code.
Generate Playwright code as your source of truth. When it breaks, have AI update the code and submit it as a PR. Now you have version control, human review, and a clear definition of what the automation does. You're not debugging AI behavior—you're reviewing code changes. When something fails, you get clear, actionable reports about what broke and how it was fixed.
This is a different problem than "navigate to a random website and complete a random task." It assumes some human oversight at the beginning. It assumes you care about the same workflows over time, not one-shot tasks. It assumes code as the source of truth, with AI as the maintenance layer.
Current benchmarks don't test this. They test zero-shot performance on static tasks. They don't test: "This workflow ran yesterday. Today the UI changed. Can you still complete it?" That's the actual problem we need AI to solve.
What We Really Need
This is a different problem than benchmarks measure. It assumes some human oversight at the beginning. It assumes retry logic and validation. It assumes you care about the same workflows over time, not one-shot tasks. It assumes code as infrastructure, not AI as a black box.
Benchmarks that test one-shot performance on diverse tasks are measuring something. But they're not measuring the thing that matters most for production automation: resilience over time with human-reviewable artifacts.
The enterprise automation problem is different from the benchmarks. It's not "can AI figure out any website from scratch?" It's "can AI help me maintain automations that I care about?"
Real enterprise automation looks like this:
I define a workflow I need to repeat, maybe with AI-generated Playwright code I can review
Maybe I give some structure, some guardrails, some validation
The automation runs daily, weekly, constantly
When code breaks mid-run, AI recovery kicks in to complete the task
After runs complete, the system reviews failures and submits fixes
Sites change, the automation adapts
When it breaks, I get clear, actionable reports—not "the AI failed," but "this selector broke, here's the fix"
Over time, it learns the stable paths and gets faster
The Path Forward
We need benchmarks that:
Report per-task success rates, not just aggregates. Show us which tasks are automatable today.
Test with agent harnesses, not just raw models. Measure production systems with real-time recovery, not lab conditions.
Measure adaptation over time. Run workflows, change the sites, see what survives. Test if AI can maintain Playwright code, not just execute actions from scratch.
Value reliability over breadth. A system that automates 10 workflows at 95% reliability is more useful than one that attempts 100 at 50%.
Test code generation and healing. Can the system produce reviewable artifacts? Can it fix its own code when things break?
The current benchmarks tell us AI isn't ready. But AI is already automating real work for real companies. That gap means we're measuring the wrong things.
We're building a new benchmark that addresses these gaps. Not because Stagehand's evals are wrong—they're measuring what they're meant to measure. But because what we need to measure is different.
The web is already being automated by AI. The question isn't "can it work?" It's "how do we know which workflows will work, and how do we make them work better?"
Better benchmarks are the first step.
From the
Checksum.ai Blog
January 6, 2026
The Problem With Web Agent Benchmarks (And Why We Need Better Ones)
Web agent benchmarks are broken. Benchmarks like Browserbase's Stagehand evals and agent-evals have fundamental gaps that make them poor predictors of real-world automation success.
Here's the uncomfortable truth: current web agent benchmarks tell us AI isn't ready to automate the web. But that's not the whole story.


The Compound Action Problem
Look at Stagehand's atomic action evals. They report around 85% accuracy per action. That sounds pretty good until you do the math on a real workflow.
Booking a table at a restaurant takes maybe 10 actions: navigate to the site, click reservations, select date, select time, enter party size, fill in contact info, submit. Even simple enterprise workflows are longer. At 85% per-action accuracy, your success rate for a 10-action flow is 0.85^10 = 19.7%.
One in five attempts succeeds.
The agent-evals benchmark from Stagehand shows end-to-end success rates around 65% on WebVoyager-style tasks. That's more realistic for multi-step flows, but it's still telling us the same thing: you cannot reliably automate web processes with AI today.
Except... people are. Every day. So what's missing?
The Reliability Question
A 65% success rate means different things depending on distribution.
If 65% of tasks succeed reliably, that's transformative. Give me 100 daily workflows and I'll happily automate 65 of them. The 35% that fail? I'll handle those manually while the AI takes care of the rest.
But if every task succeeds 65% of the time, that's unusable. I can't trust any individual automation to work. I'm constantly checking, retrying, fixing. That's not automation—that's babysitting.
Current benchmarks don't tell us which world we're in. They give us aggregate numbers without showing us the distribution of successes across tasks. Without that, we can't answer the most important question: which processes can I actually automate today?
The Agent Harness Gap
Nobody runs these models raw in production. Nobody.
Every real-world AI automation system has an agent harness: retry logic, validation checks, caching for known-good paths, fallback strategies, confidence thresholds. These aren't nice-to-haves. They're table stakes.
A model that scores 60% raw might hit 90% with good harness logic. Another model at 65% raw might plateau at 70% because it fails in ways that are harder to recover from. The benchmark numbers don't tell you which is which.
This matters because different models fail differently. One might make locator errors that are easy to detect and retry. Another might succeed at actions but misunderstand intent in ways that are harder to catch. Raw accuracy scores hide these differences.
Real production systems layer multiple recovery strategies. When Playwright code fails mid-execution, AI can kick in to complete the action in real-time. When a selector breaks, the system can try alternative approaches before giving up. When validation fails, it can retry with different parameters. This is engineering, not magic—but it changes the success rates dramatically.
The benchmarks measure model capability in isolation. But production is about model capability plus engineering. Without testing the full stack, we're optimizing for the wrong thing.
The Auto-Healing Problem
Here's the thing about browser automation: writing the initial script is the easy part.
I can write a Selenium script to automate almost any web workflow in an afternoon. No AI needed. Just good old DOM inspection and XPath selectors. For a high-value process that runs 100 times a day, spending 5-10 hours on the script is nothing. Spending 30 minutes reviewing AI-generated Playwright code? Even easier.
The killer is maintenance. That script breaks next week when they redesign the button. It breaks again when they add a loading spinner. Again when they change the form validation. Again when they add a cookie banner. Again when...
You get the idea.
This is why LLMs matter for browser automation. Not because they can take a vague prompt and figure out a website they've never seen—though that's cool. They matter because they can adapt when things break.
But here's what's missing from the benchmarks: they don't test healing at the right layer. The most effective approach isn't pure AI that figures everything out from scratch every time. It's AI that maintains deterministic code.
Generate Playwright code as your source of truth. When it breaks, have AI update the code and submit it as a PR. Now you have version control, human review, and a clear definition of what the automation does. You're not debugging AI behavior—you're reviewing code changes. When something fails, you get clear, actionable reports about what broke and how it was fixed.
This is a different problem than "navigate to a random website and complete a random task." It assumes some human oversight at the beginning. It assumes you care about the same workflows over time, not one-shot tasks. It assumes code as the source of truth, with AI as the maintenance layer.
Current benchmarks don't test this. They test zero-shot performance on static tasks. They don't test: "This workflow ran yesterday. Today the UI changed. Can you still complete it?" That's the actual problem we need AI to solve.
What We Really Need
This is a different problem than benchmarks measure. It assumes some human oversight at the beginning. It assumes retry logic and validation. It assumes you care about the same workflows over time, not one-shot tasks. It assumes code as infrastructure, not AI as a black box.
Benchmarks that test one-shot performance on diverse tasks are measuring something. But they're not measuring the thing that matters most for production automation: resilience over time with human-reviewable artifacts.
The enterprise automation problem is different from the benchmarks. It's not "can AI figure out any website from scratch?" It's "can AI help me maintain automations that I care about?"
Real enterprise automation looks like this:
I define a workflow I need to repeat, maybe with AI-generated Playwright code I can review
Maybe I give some structure, some guardrails, some validation
The automation runs daily, weekly, constantly
When code breaks mid-run, AI recovery kicks in to complete the task
After runs complete, the system reviews failures and submits fixes
Sites change, the automation adapts
When it breaks, I get clear, actionable reports—not "the AI failed," but "this selector broke, here's the fix"
Over time, it learns the stable paths and gets faster
The Path Forward
We need benchmarks that:
Report per-task success rates, not just aggregates. Show us which tasks are automatable today.
Test with agent harnesses, not just raw models. Measure production systems with real-time recovery, not lab conditions.
Measure adaptation over time. Run workflows, change the sites, see what survives. Test if AI can maintain Playwright code, not just execute actions from scratch.
Value reliability over breadth. A system that automates 10 workflows at 95% reliability is more useful than one that attempts 100 at 50%.
Test code generation and healing. Can the system produce reviewable artifacts? Can it fix its own code when things break?
The current benchmarks tell us AI isn't ready. But AI is already automating real work for real companies. That gap means we're measuring the wrong things.
We're building a new benchmark that addresses these gaps. Not because Stagehand's evals are wrong—they're measuring what they're meant to measure. But because what we need to measure is different.
The web is already being automated by AI. The question isn't "can it work?" It's "how do we know which workflows will work, and how do we make them work better?"
Better benchmarks are the first step.








