AI in QA - Issue #6

In today's testing landscape we have more AI in our workflows than ever before, but this issue asks whether we actually have better quality as a result.

In a ton of ways we can move so much faster at certain tasks, but the output from our AI tools isn't at the place where we can trust them without review. I think the key is figuring out how to make the review process useful and efficient to ensure we have the right information in our review and we are checking the right things rather than what may look good on a report. Producing 1,000 lines of code and 30 new automated tests, each with a single assertion that the URL is correct on the final step of the test is likely not going to detect many bugs.

If you do find yourself in this situation, there are ways to get better results. You have multiple levers. You can make a change to the prompt, the context/files, the model, and even the AI harness. These are all variables in your formula for getting great output from these tools.

I wanted to shout out a few newsletters that I subscribe to and learn from this week, inspired by Alan Richardson's example in his newsletter the Software Testing Roundup

These are all newsletters that I find helpful in shaping my opinions and views. When I do find interesting items from these sources I'll be tagging that in this issue and future issues.

Catch us live on Youtube

Headlines & Launches

Vernon Version 3 - Now with Added AI!
via Vernon Richards (Yeah But Does It Work Substack)
Vernon maps out how QEs need to evolve in the AI era: shifting from hands-on ICs to Quality Coaches who enable whole-team testing. Covers four key skill gaps to plug - task analysis, evaluation (evals), cost/token thinking, and outer loop ownership.

Can You Prompt Claude Into Being A Good Tester?
via T.J. Maher (LinkedIn)
T.J. Maher ran multi-hour experiments prompting Claude to build a Playwright test framework for SauceDemo. Claude generated 188 tests with broken imports, claimed 100% pass rates on code that couldn't compile, and repeatedly folded on pushback without actually fixing the issues. A candid look at AI reliability gaps in test automation.

AI-Powered API Testing at Scale
via Nik Gupta (Medium) SOFTWARE TESTING WEEKLY
Enterprise-scale API testing pipeline using Claude Code with three entry points: Gherkin feature files to Jest tests, OpenAPI specs to feature files to tests, and application-to-tests. Covers data strategy for parameterized testing and CI integration.

Cognitive Automation: This Isn’t About How Fast You Can Generate Scripts
via Simon Prior (Lead Test Include)
Simon Prior argues that most 'AI-powered testing' is just mechanical automation wearing a smarter mask - replacing brittle scripts with AI-generated brittle scripts. Real cognitive automation means AI augmenting the quality of your thinking: risk reasoning at scale, exploratory thought partnerships, and adaptive test selection. Part of the Intelligent Quality Leadership series.

Falling behind on test automation and AI adoption? DevClarity's QA Practice gets your team up to speed fast - with hands-on training, proven workflows, and measurable results within 30 days.

Tools & Frameworks

Playwright AI QA Agent: Auto-Classifying Test Failures with LLMs
via Nir Arad (LinkedIn)
Open-source project combining Playwright tests with a Claude-powered agent that automatically classifies test failures in CI, distinguishing between broken locators and real bugs, with future plans for auto-healing PRs.

How We Use Cursor: Triaging Bug Reports with AI Agents
via Eric Zakariasson (X)
Automated bug triage pipeline: when a bug is reported, Cursor asks follow-up questions, searches for duplicates, examines the code, creates a Linear ticket, and then spawns an agent to attempt a fix.

Trowser: Exploratory Testing for Humans and AI
via Rikard Edgren (LinkedIn)
Rikard Edgren updates Trowser, an exploratory testing tool designed for both humans and AI. New features include better quicktests, API improvements, session logging, and random test data. Future releases will be restricted due to the power of new robustness and security features.

Making AI testing as simple as the rest of your system
via Priyanshu Shekhar (LinkedIn)
Priyanshu Shekhar built Evaliphy, a QA-first RAG testing framework that uses human-readable assertions (toBeFaithful, toBeRelevant, toBeHarmless) via LLM-as-Judge. Designed for QA engineers who want to test AI systems as black boxes without learning ML concepts. Open-source, TypeScript/Node.js.

Evaluating Skills
via Robert Xu (LangChain) CODING JAG BY TESTMU AI
LangChain shares best practices for evaluating AI agent skills — curated instructions that improve agent performance in specialized domains. Covers clean testing environments, A/B comparisons with and without skills, and iterative improvement using LangSmith.

Techniques & Tutorials

I Built a QA Quality Gate System With Claude Code Hooks
via ScrollTest (Medium) SOFTWARE TESTING WEEKLY
Complete guide to building deterministic QA quality gates using Claude Code hooks - middleware that intercepts every AI action. Covers PreToolUse, PostToolUse, Stop, and SubagentStop hooks with shell scripts that block bad tests before they are written.

Research & Data

When Metrics Lie: Agent Drift and Why Your AI Might Be Succeeding at the Wrong Thing
via Maryna Didkovska (LinkedIn)
Case study of Agent Drift: an AI agent tasked with optimizing test coverage deleted half a Jira backlog because its goal was not aligned with human intent. Proposes new governance metrics like drift score and side-effect score for detecting agents that succeed at the wrong thing.

AI and Testing: A Testing Example
via Jeff Nyman (Tester Stories)
Jeff Nyman builds a substantive AI test case from scratch - not assertions or pass/fail, but a controlled conversational environment that probes how an LLM reasons, scopes answers, and maintains conceptual continuity across turns. Treats testing like a physics experiment: define boundaries, set initial conditions, observe behavior.

Sponsorship slots are available for tool companies and platforms who want to get in front of this audience.

Foundations

How I Built an Agentic SDLC Demo with Cursor, MCP Servers, Playwright, GitHub, and Vercel in Under an Hour
via Beth Marshall (Beth The Tester Blog)
Complete walkthrough of building an agentic SDLC demo using Cursor with MCP servers to orchestrate scaffolding, Playwright test generation, Git operations, CI, and deployment - zero hand-written code. Great intro to how MCP servers connect AI agents to real development tooling.

Quick Links

QA Orchestra - AI QA Agents for Automated QA Orchestration (Anass Rach, QA Orchestra)

Why Do I Have to Answer AI Questions If I Am Not an AI Engineer? (Adrian Maciuc, LinkedIn)

AI Hit 80% Coverage in an Afternoon. And You Have No Idea If It Matters. (Simon Prior, LinkedIn)

Cloud MCP: Give Your AI Assistant Access to Your Test Runs (Emily Wisniewski, Cypress Blog)

LambdaTest TestMu AI Skills (TestMu, GitHub Repo)

If something in this issue made you think differently about how your team approaches AI in testing, pass it along. The best conversations about AI and QA are happening in Slack channels and stand-ups, not just newsletters.

Have something worth featuring? Reply and send it my way, I read every link.

Thanks for reading,
Butch Mayhew