AI in QA - Issue #13

Sponsorship

Interested in sponsoring AI in QA? Reach out.

AI in QA Review

Butch's Take

As I continue to use AI tools in my day-to-day work, I tend to see similar failure modes pop up. As they do, I look for ways to improve my prompt or workflow in order to get better results.

As I cover in my training sessions through DevClarity, LLMs are just prediction machines. That prediction engine is what makes them so good at knocking out a test or scaffolding a function in seconds, and it's the same reason they'll confidently predict a file or an API that simply isn't there. The strength and the failure mode are one and the same.

So I've found the goal isn't to catch it being wrong and walk away frustrated. The goal is getting better output. Once you accept that it's going to predict things that don't exist, you can focus on what actually works: giving it the right context. The better the context you put in place, the more the predictions are worth keeping. That's where AI tooling becomes leverage rather than frustration.

I really didn't get a good feel for the failure modes until I started using the tools in my day-to-day work. So I encourage you to experiment and grow your own understanding of how best to leverage these tools.

This week I'm hoping to capture what AI tools subscribers use, I plan to share these results in future issues and discuss my thoughts around tools I use day to day.

Headlines & Launches

The Most Valuable QA Skill Is Thinking
David Ingraham (Medium)
Argues the real divide is testers who adapt vs. those who don't. Cites Harvard Business School data (repetitive-task demand fell 13% post-ChatGPT, analytical roles grew 20%), Anthropic's Economic Index (57% augmentation vs. 43% automation), and METR research (AI reliability doubles every 7 months for short tasks, collapses for long ones). Deterministic QA work gets automated. The thinking becomes more valuable.

How To Test With Augmented Coding
Ben Fellows / Mamadou N'diaye (Tester HQ) SOFTWARE TESTING ROUND-UP
Ben Fellows and Mamadou N'diaye discuss augmented coding in QA. Covers using AI tools (Playwright, Cursor) to cut UI automation time from hours to minutes, shifting QA focus toward strategy and architecture. Topics include policy-as-code for PR reviews, synthetic test data via data factories, and the evolution toward hybrid testing/development roles. Practical perspective on what changes and what stays the same.

QA Engineers Were Fastest to Adapt to AI
Vlad Dobrovolski (LinkedIn)
Vlad Dobrovolskiy reports his QA team adapted fastest to AI tooling at Proton AI, outpacing developers and PMs. Given skills, agents, and ownership, they moved from bug-finding to shipping features end-to-end and became the internal experts other engineers consult for AI workflow questions.

Sometimes the Best Debugging Tool Is a Git Log
John A. (LinkedIn)
A QA engineer solved a production bug by asking AI a plain question and letting it run a git log, no elaborate harness needed. John Adams uses this to frame the 'token maxing' problem: choosing the cheapest effective approach over complex systems. The real skill is recognizing when a simple one will do.

Falling behind on test automation and AI adoption? DevClarity's QA Practice gets your team up to speed fast - with hands-on training, proven workflows, and measurable results within 30 days.

Tools & Frameworks

Playwright MCP with Cursor: Building Context Aware QA Automation Workflows
Vidushika Rathnaweera (Medium)
Walkthrough of Playwright MCP server integrated with Cursor IDE. Three capability modes: Snapshot (accessibility tree), Vision (screenshots for visual checks), and DevTools (CDP for network, console, performance). Shows how to combine modes for context-aware exploratory testing and generate Playwright tests from AI-driven inspection.

Issue No. 01: The Vibium CLI
Daisy Ladybug (VibeWire)
Vibium CLI is a terminal-based browser automation tool with 66 commands across 8 categories. Key differentiator: it's typed, not scripted. Every command runs against a live browser session. Also ships an MCP server (85 tools) exposing the full surface to AI agents. Role+text selectors prioritize stability over brittle CSS/XPath. Useful for quick exploratory automation and agent-driven testing.

Canary: QA Harness for Claude Code
Canary (GitHub)
QA harness purpose-built for coding agents. Reads code diffs, identifies affected UI flows, tests them in real browsers via QuickJS WASM sandbox with full Playwright API. Every run captures screen recordings, console logs, network HARs, Playwright traces, and a reusable script for CI replay with zero inference cost. Drop-in plugins for Claude Code, Cursor, and Codex.

AI-Driven Mobile QA: Auto-Healing Maestro Tests
Mehmet Serhat Özdursun (LinkedIn)
Mehmet Serhat Özdursun shares his AI-driven mobile QA pipeline at poq: Agent 1 (The Healer) detects failing Maestro tests, fixes them autonomously, validates 3 consecutive passes, then raises and merges PRs in about 7 minutes with zero human input. Agent 2 (The Generator, in progress) reads git diffs and generates test scenarios from scratch. The vision is full SDLC AI support from code change to validated deployment.

Poll

Techniques & Tutorials

Product Unit Tests: A Missing Layer of QA
Phillip Gales (FishDog)
Phillip Gales proposes 'product unit tests': executable assertions written from the user's perspective, run by an LLM in a headless browser, evaluated against the product brief. The gap these fill: traditional stacks catch mechanical breakage but not whether the shipped experience matches what was specified. Can also run against multiple synthetic personas to surface underspecification in the brief.

My Debugging Agent Has One Weird Rule and It Works
Artur Bikulov (LinkedIn)
Artur Bikulov's debugging agent never proposes a code fix. It reports symptom, evidence, and verdict, forcing the human to reason about the fix. Uses falsification (name competing explanations, find disconfirming evidence), labels every failure into one of 8 categories, and only cites file:lines it actually read. The rule prevents confirmation bias where the model guesses early and bends evidence to fit.

AI-based tools in Testing
Oleksandr Bolzhelarskyi (LinkedIn)
Oleksandr Bolzhelarskyi reviews AI-based testing tools across three categories: test management platforms (TestRigor, foreai.co, Kualitee AI), MCPs and browser agents (Playwright MCP, ChatGPT Atlas), and LLMs in IDEs (Gemini, Claude Code, Cursor). Key finding: test management tools ignore documentation quality. LLMs sped up his team's automation by 10x. Proposes a context-driven testing concept where AI assists at every stage.

Building a Production-Grade AI-Assisted QA Framework — From Zero to Agentic
Burak Arikboga (LinkedIn)
Burak Arikboga is building a production-grade AI-assisted QA framework where GitHub Issues become test plans, test plans become Playwright scripts, and failing tests heal themselves. Stack: Playwright + TypeScript + Gemini 2.5 Flash. Three agents: Planner, Generator, and Healer. Runs against OrangeHRM with Allure reporting and GitHub Actions CI.

176 Playwright Tests via 5-Agent Pipeline
Sumeet Shah (LinkedIn)
Sumeet Shah built a 5-agent pipeline that generated 176 Playwright tests across 6 modules. Agent 1 handles domain knowledge, Agent 2 generates scenarios, Agent 3 assigns test layers, Agent 4 writes and self-corrects code by running tests in a loop, and Agent 5 enforces standards. Full POM architecture with GitHub Actions CI and Allure reporting.

Research & Data

Opus 4.8 Regression Alert: Verify Tool Calls Against Ground Truth
Phillip Clapham (LinkedIn)
Phillip Clapham reports Opus 4.8 fabricated tool-call results under heavy context load: claimed to read files it never read, reported green suites without running them, invented GPS coordinates. His harness caught all three by verifying against disk independently. Never trust the model's narration of tool output. Verify against actual files, response bodies, and exit codes separately.

AI and Testing: Knowledge Graphs and Ontologies – Stories from a Software Tester
Jeff Nyman (Tester Stories)
Jeff Nyman introduces knowledge graphs as a structured, queryable complement to LLMs. The key argument: LLMs can't show their work, so they can't be verified. Knowledge graphs store named entities and labeled relationships that are inspectable and traversable. This post builds a 4-stage pipeline (Ollama extraction, RDFLib graph, SPARQL queries, grounded answers) using local models only. Part 1 of 3.

Hot Take 🔥

Is Deterministic QA Dying?
Olga Bavrina (LinkedIn)
Olga Bavrina argues QA is shifting from deterministic verification to probabilistic risk assessment. Modern systems (distributed architectures, feature flags, LLMs, agentic coding) make exact assertions less reliable. The question isn't what test covers this, it's what signals tell us something unusual is happening. QA isn't dying. Determinism is.

Quick Links

The Agentic Test Pyramid — Matthew Boston
Matthew Boston (matthewboston.com)
Extends Fowler's test pyramid with a second axis: determinism. Bottom four layers are the original pyramid, unchanged. Two new layers on top for model-driven components: behavioral E2E against the live model, and quality evals (model-as-judge with rubric grading). Tripwires are the highest-leverage layer. Determinism decides the gate: stable tests block merge, fuzzy tests run on schedule.

The Future of Software Testing
Daniel Knott (YouTube)
Roundtable discussion on where testing is headed. Panel agrees: the shift isn't from manual to automated, it's from deterministic verification to probabilistic risk assessment. LLMs, feature flags, and distributed architectures make exact assertions less reliable. The useful skill isn't writing more tests, it's knowing which signals matter and when to trust them.

Filip Hric's Skills: Reusable Configs for AI Coding Agents
Filip Hric (LinkedIn)
Filip Hric open-sourced a skills repo that auto-creates symlinks between AI coding agent config directories (.agents, .claude, .cursor). The problem: Cursor, Claude Code, and Codex each use different conventions for rules and skill files. The skill syncs them automatically so you don't have to manage each one manually.

If something in this issue made you think differently about how your team approaches AI in testing, pass it along. The best conversations about AI and QA are happening in Slack channels and stand-ups, not just newsletters.

Have something worth featuring? Reply and send it my way, I read every link.

Thanks for reading,
Butch Mayhew