
Web agent benchmarks are broken. Benchmarks like Browserbase's Stagehand evals and agent-evals have fundamental gaps that make them poor predictors of real-world automation success.
Here's the uncomfortable truth: current web agent benchmarks tell us AI isn't ready to automate the web. But that's not the whole story.
Gal Vered is a Co-Founder at Checksum where they use AI to generate end-to-end Cypress and Playwright tests, so that dev teams know that their product is thoroughly tested and shipped bug free, without the need to manually write or maintain tests.
In his role, Gal helped many teams build their testing infrastructure, solve typical (and not so typical) testing challenges and deploy AI to move fast and ship high quality software.

