AI in QA - Issue #12

If you’re tired of spending more time fixing tests than writing them, this webinar is worth a look. BrowserStack is hosting a live session on how QA teams are using AI to create tests faster and reduce locator maintenance with self-healing automation workflows.

June 9 | 11:00 AM CET | Register Now!

Expectation-Driven Development

I ran across a technique this week I hadn't seen before: Expectation-Driven Development. Andrea Laforgia wrote about this on his Substack, and it's worth reading about even if you don't fully adopt it.

The idea is, you create Expectations of your system before any code is written. From these expectations the AI implements the feature, then you ask the AI tool to create evidence that the expectations are being met. You review the content and decides if the evidence is sufficient to ship/release.

While this is a new way to development, I believe from a tester's perspective we can take this concept and bake it into the way we think about how we work with AI tools while testing. As we review user stories and work with the team to refine them, we share our expectations with the product/development team. As code is implemented and we have something to test, we can use those same expectations to help generate automated tests to prove our expectations. What's nice about this is we can also have our expectations live as comments in our test files, so we know exactly what expectations we are validating with the set of tests in the file.

The key distinction between TDD (Test-Driven Development) and EDD is you're not using e2e tests to drive the design (that already happened via expectations). You're writing e2e tests after the fact to protect the working system from future regressions. The expectations + verification provides future context while the e2e tests are the safety net that catches regressions.

It's essentially the difference between "tests as specification" and "tests as insurance." You already had the specification (EDD expectations). Now you're buying insurance.

Big thanks to Andrea Laforgia for sharing, read the full post here: Expectation-Driven Development: A Validation Framework for the Age of AI Agents

Headlines & Launches

Vibium Raises Funding - Jason Huggins Announces
Jason Huggins (LinkedIn)
Jason Huggins (creator of Selenium) announces Vibium raised funding from Commit Capital, a M fund investing in dev tools and infrastructure. Vibium is building AI-powered testing tooling. The fund's unique angle: four of five partners are engineers who embed in portfolio companies to ship alongside founders.

MCP Is Growing Up
Angie Jones (Agentic AI Foundation)
Angie Jones breaks down the 2026-07-28 MCP release candidate: MCP is going stateless at the protocol layer, making it easier to scale behind load balancers. State moves to explicit handles the model can reason about. Extensions get a real governance process (reverse-DNS IDs, independent versioning). Roots, Sampling, and Logging are deprecated in favor of tool parameters, direct LLM provider APIs, and OpenTelemetry. Authorization gets tighter OAuth/OIDC rules for multi-server deployments. Full JSON Schema 2020-12 support for tool definitions. Feature lifecycle policy: 12 months between deprecation and removal.

Vince Graics on wdio-agent-service at SeleniumConf Valencia
Vince Graics (LinkedIn)
Vince Graics demoed wdio-agent-service at SeleniumConf Valencia as an impromptu lightning talk. The plugin adds LLM-powered agentic browser actions to WebdriverIO via browser.agent(prompt), using small models (qwen2.5-coder:3b) through Ollama. Hybrid approach: stable selectors use regular WDIO, unpredictable UI (cookie banners, popups, A/B tests) delegates to the LLM. Lightning talk - GitHub

Tools & Frameworks

GitHub - vostride/agent-qa: The self-improving Agentic QA harness with Memory.
Vostride (GitHub)
Self-improving agentic QA harness with memory. Write tests in natural language for web and mobile. Self-heals when actions fail by re-observing the UI and trying alternate paths. Builds execution memory from past runs to avoid repeating mistakes. Supports sandboxed hooks in Docker, MCP for coding agents, and bring-your-own-LLM (OpenAI, Anthropic, local models, Codex, Claude Code subscriptions).

AI-Automation-QA-Engineer: Autonomous Web Testing & Bug Fixing Agent
Rohail Siddiqi (LinkedIn)
Autonomous two-stage QA agent (detect + fix) built by Rohail Siddiqi. Uses a deep Playwright-based agent loop to explore live web apps, generate test cases, classify failures, and auto-patch source code. Case study: 21 tests generated in 2:16, zero false positives, found an unplanted duplicate-email-signup bug by accumulating state across tests like a real user. Supports Ollama/Anthropic/OpenAI. GitHub

qTest MCP Server for Test Automation | Usman Ghani posted on the topic | LinkedIn
Usman Ghani (LinkedIn)
Unofficial MCP server for qTest Manager that fills gaps in the official version - supports test execution cycles, suites, test runs, and filtering. Works with Claude and other MCP-compatible AI agents via natural language. MIT licensed, available on npm.

Postman + Playwright: Integration Testing UI and API Together
Postman (Postman Blog)
Postman's official guide on integrating Postman with Playwright to test UI and API together in a single workflow. Covers how to combine API-level tests with browser automation for end-to-end integration testing, leveraging Postman's collection runner alongside Playwright's browser capabilities.

Aruna Swaminathan: AI-Assisted Test Generation with v2 of Her Selenium+AI Tool
Aruna Swaminathan (LinkedIn)
Aruna Swaminathan shares v2 of her AI test generation tool that takes Jira/stories as input and generates Selenium Python test scripts. Covers edge cases, boundary conditions, and error scenarios automatically. Demo video and GitHub included. Next focus (v3): generating framework-aware tests that drop into POM structures and CI/CD pipelines - not just individual scripts but maintainable, team-ready test code.

strands-agents/evals: AWS Agent Evaluation Framework
Nikesh K. (LinkedIn)
Nikesh K. highlights two key tools for SDETs moving into AI quality: (1) confident-ai/deepeval - pytest for LLMs with 50+ metrics (hallucination, relevancy, faithfulness, MCP eval) that plugs into existing CI/CD; (2) strands-agents/evals - built by AWS, goes deeper on agent-specific quality: did the agent use the right tools, follow the right trajectory, and reach the goal the right way. The gap between shipping AI and testing AI is where the next wave of SDET work lives.

Techniques & Tutorials

The Code Passed. But Did It Work? Testing AI-Generated Code in 2026
Vishal Karivelil (Medium)
Explores the widening gap between AI-generated code volume and QA testing strategies. AI writes syntactically clean but confidently wrong code - no TODO comments, no PR flags, no Slack threads to signal risk. Systematic failure modes: edge case blindness, hallucinated dependencies, silent logic errors. Argues QA needs to shift from finding breaks to finding features that behave incorrectly under edge cases.

The Green Report | Claude Code Hooks as a Test Quality Gate
The Green Report (The Green Report) SOFTWARE TESTING WEEKLY
Shows how to use Claude Code hooks (PreToolUse/PostToolUse) as automated test quality gates. Configures shell scripts that fire when Claude Code writes or edits files, catching hardcoded waits, missing assertions, and forbidden selectors at write-time instead of code review. When a hook fails, Claude Code self-corrects and re-runs until it passes - enforcement without interrupting workflow.

The Question That Followed Me Home
Dragan Spiridonov (Forge Quality)
Dragan Spiridonov chronicles a whirlwind week for the Nagual-QE agentic quality engineering platform: six releases in eight days driven entirely by community bug reports and feature requests. Covers the journey from v3.9.31 to v3.10.1 - stacked pipeline bugs, unbounded vector files hitting 59GB, dead LLM router code discovered by an external contributor, and an external embedder endpoint. Also documents hands-on testing meetups in Belgrade and Novi Sad, and the growing question of how to bring agentic QE to teams.

What Building an AI Testing System Taught Me About Where the Field Actually Is
Neil Duggan (Medium)
Neil Duggan spent a year building QA Brain - an autonomous web app exploration + Playwright test generation system (~170K lines). Five hard-won lessons: (1) Intent capture is upstream of everything - without clear acceptance criteria, AI tests validate the wrong thing. (2) The oracle problem is unsolved - knowing what to assert is harder than generating the actions. (3) Deterministic codegen beats runtime LLM interpretation for reliability. (4) Selector healing needs multi-dimensional fingerprinting, not just one strategy. (5) Origin-aware quality scoring governs the full test lifecycle. The gap between vendor claims and real capability is still wide.

Research & Data

AI and Testing: Using Model Pipelines for Testing
Jeff Nyman (TesterStories)
Jeff Nyman builds a multi-model testing pipeline using Ollama: ts-reasoner generates test cases from a spec, ts-coder validates them against implementation, and DeepEval's Faithfulness metric evaluates spec adherence. The post explores pipeline vs. agentic AI, showing how evaluation gates and decision loops create progressively more autonomous testing systems.

Hot Take 🔥

Why AI Evals Are Just Software Testing
Adam White (LinkedIn)
Adam White’s LinkedIn post draws a direct line between established software testing practices and the emerging field of AI evaluations, arguing that testers already possess the core skills-risk-based thinking, investigation under uncertainty, and distinguishing checking from testing-needed for AI eval work. He provides a terminology translation guide and encourages testing professionals to engage with AI teams, bringing decades of testing canon with them.

Checking Isn’t Testing. Soon It Won’t Be Employment Either | Quality Remarks Keith Klain
Keith Klain (Quality Remarks)
Keith Klain argues the 'checking vs testing' debate was never about vocabulary - it was permission to stop thinking. AI can already do checking well enough to replace most of the testing workforce. The .36B AI model evaluation market should have been testers' moment, but too many accepted a narrow definition of their craft. If testing = checking, AI is your competitor. If testing = skilled investigation, AI is a tool.

Quick Links

Peekaboo Documentation
@steipete (Peekaboo)
Peekaboo is a macOS automation toolkit that enables humans and AI agents to capture screen pixels, read the accessibility tree, and drive input through a CLI, MCP server, or native Mac app. It integrates with AI clients like Codex, Claude Code, and Cursor, providing a natural-language agent loop for desktop automation tasks. The documentation covers installation, configuration, command references, and architecture details.

QA didn't and won't die, but will adapt
Reddit
ML engineer shares how they use agent-driven testing at Droidrun: markdown test flow files describe how agents should test the app (unit, UI, API). Claude, Codex, or Gemini execute the flows on every PR/release, running static analyzers, clicking through UIs, testing across OS versions and settings, and generating comparison reports with screenshots. Reduced testing from full manual regression to 10-15 min smoke tests. Argues QAs using AI tools will generate 5x more value - the job adapts, not dies.

QA Engineers Using AI Tools Are Rebuilding the Role From Scratch
Jaren Charles Cudilla (LinkedIn)
Jaren Charles Cudilla argues QA engineers using AI are rebuilding the role around skill files - structured context documents loaded into LLMs as system prompts encoding methodology, formats, severity logic, and decision rules. He runs a one-person AI QA agency using a skill file + local Ollama, keeping client data on-machine. Key insight: AI dev teams already generate their own automated tests; your job is testing what was built against what real users expect, not duplicating the automation layer. Manual testing tells you which flows are worth automating at all.

Getting Claude to Actually Read Your CLAUDE.md | HumanLayer Blog
HumanLayer shares a pattern for making CLAUDE.md instructions stick: wrap sections in conditional tags so Claude knows exactly when to apply each rule, instead of treating the whole file as optional.

If something in this issue made you think differently about how your team approaches AI in testing, pass it along. The best conversations about AI and QA are happening in Slack channels and stand-ups, not just newsletters.

Have something worth featuring? Reply and send it my way, I read every link.

Thanks for reading,
Butch Mayhew