AI Glossary for QA Teams

Most confusion about AI tools in QA comes from mixing up four terms that sound similar but mean very different things. Get these right and everything else starts to make sense.

Model

A model is a blob of parameters produced during training. It does one thing: predict the next token given what came before. It has no memory, no tools, no goals. It is stateless; the model has no awareness of your previous conversations, test runs, or anything outside the current input.

For QA teams: When an AI-generated test produces a wrong assertion, that is a model limitation. The model predicted plausible-sounding code based on training data, not your actual system behavior. Treat model outputs as a starting point, not ground truth. Always verify any output from models against your real system.

Examples: Claude Sonnet4.7 , GPT 5, Gemini 1.5 Pro

Harness

The harness is everything around the model that turns it into something useful; the system prompt, available tools, context window management, memory, and any scaffolding that shapes how the model behaves.

For QA teams: This is why Codex (with access to your terminal, files, and browser) behaves so differently from ChatGPT on the web, even when both run the same underlying model. The harness determines what the model knows, what it can do, and how it approaches problems. When you write a AGENTS.md file or configure a system prompt for your QA agent, you are shaping the harness.

Examples: Claude Code, Codex, GitHub Copilot, and Cursor are all different harnesses, often running the same underlying models

AGENTS.md: A markdown file that gives an agent persistent context about a project. It typically describes the codebase, conventions, key files, and preferred approaches. In QA contexts, it might explain the test framework, where tests live, and how to run them.

System Prompt: The base instructions that define how a model should behave. It shapes personality, constraints, and capabilities before any user input. A system prompt is the foundation of the harness; everything else (tools, context, memory) builds on top of it.

Environment

The environment is everything outside the harness that the agent can perceive and act on through tools. It is the world the agent operates in.

For QA teams: Your test suite is an environment. Your browser, your API endpoints, your CI/CD pipeline make up the environment. An agent can only interact with what has been exposed to it via tools. If your test agent cannot read your API docs, those docs do not exist to it. Configuring MCP servers, giving an agent browser access, providing a cli tool with instructions, or pointing it at your test repo all expand its environment.

Examples: File system, browser via Playwright, GitHub Actions, your application under test, MCP servers, Command Line programs.

Agent

An agent is a model, harnessed, in an environment. All three parts together. The agent reasons, plans, and takes actions; calling tools, reading results, and deciding what to do next in a loop until the task is done.

For QA teams: When someone says "AI is writing my tests," they usually mean an agent, not just a model. This distinction matters for reliability and control. You cannot configure a model — it is fixed. But you can configure the harness and environment to shape how an agent behaves. That is where custom instructions, tool permissions, and context files come in.

Example: Claude Code running in your repo = Claude Sonnet (model) + Claude Code (harness) + your terminal and filesystem (environment)

Supporting Terms

Hallucination

When a model produces confident but incorrect output. The model is not lying, it is pattern-matching to what seemed plausible during training. For QA teams, this is the core reliability concern with AI-generated test code. Always review AI-generated assertions, selectors, and API calls against your actual system before committing them.

LLM-as-Judge

Using a language model to score or grade the output of another language model or agent, instead of writing traditional assertions or relying on a human reviewer. The pattern emerged alongside instruction-tuned models (popularized by the 2023 MT-Bench and Chatbot Arena work) once teams realized example-based asserts couldn’t keep up with open-ended generation. It’s now standard plumbing in agent eval frameworks , and a standard target for skepticism, since judge models bring their own biases to the bench.

For QA teams: This is the eval pattern you’ll reach for when you can’t write a deterministic assertion against AI output. When an agent generates test plans, exploratory test notes, or bug reports, there’s no single correct answer to assert against. LLM-as-judge gives you a scalable way to evaluate quality, but treat the judge like any other test tool , it has its own failure modes and biases. Always ask: who’s judging the judge?

Examples: MT-Bench, Chatbot Arena, agent-skills-eval, LLM-based grading in CI pipelines

Context Window

The maximum amount of text a model can consider at once, its working memory. Everything outside the context window does not exist to the model. This explains why an AI agent may seem to forget earlier instructions in a long session, or why pasting in a large test file can push out important context. Keeping prompts focused and context tight is a core skill for working with AI in QA.

Prompt Engineering

Designing inputs that reliably produce useful outputs from a model. For QA teams, this means being specific, stating what kind of test you want, what the expected behavior is, what framework to use, and what not to do. Vague prompts produce vague tests.

RAG (Retrieval-Augmented Generation)

A technique that pairs an LLM with external knowledge retrieval. Instead of relying solely on training data, the system pulls in relevant documents at query time and adds them to the context. For QA teams, this means an agent can be grounded in your actual API documentation, requirements, or test plans, reducing hallucinations on system-specific behavior.

Spec-Driven Development (SDD)

A development approach where specifications , not code , are the primary artifact. A spec is a structured, behavior-oriented document written in natural language that describes what software should do before any code is written. The spec becomes the source of truth that guides AI coding agents through implementation.

For QA teams: This is directly relevant to how you’ll work with AI-generated code. SDD shifts the human’s role from writing code to writing and verifying specs. The spec defines acceptance criteria and expected behavior, which means QA thinking , requirements clarity, edge cases, testable outcomes , becomes the core skill, not just writing assertions after the fact. Three levels exist: spec-first (spec before code), spec-anchored (spec maintained alongside code), and spec-as-source (spec is the only thing humans edit; code is fully generated).

Examples: Kiro, GitHub spec-kit, Tessl

Skills

Modular capability packages that give AI agents reusable abilities, think of them as plugins for your AI assistant. A skill typically includes instructions (SKILL.md), reference materials, and sometimes scripts that tell an agent how to perform a specific task well. Skills are the building blocks that turn a general-purpose model into a domain-specific tool.

For QA teams: Skills are how you encode your testing expertise into your AI workflow. A Playwright skill gives your agent deep browser automation knowledge. A test strategy skill helps it write better test plans. Instead of re-explaining your approach every session, you package it once as a skill and compose them as needed.

A well-designed skill is:

Concise - laser-focused on the task, no filler or distraction
Composable - works alongside other skills without conflict
Progressively disclosed - loads what’s needed, defers the rest
Harness-agnostic - works across Claude Code, Codex, Cursor, or any agent
Portable - shareable across projects and teams
Opinionated - makes clear choices so the agent doesn’t have to guess
Secure - never exposes secrets, respects permissions

Examples: OpenClaw skills, Claude Code custom instructions, Codex agent configs

Non-Deterministic Systems

Systems where the same input does not guarantee the same output across runs. LLMs are inherently non-deterministic - unlike traditional code, there is no single correct answer to assert against. For QA teams, this shifts the testing approach entirely: instead of exact output matching, you evaluate output quality, intent, and acceptable range.

MCP (Model Context Protocol)

An open standard that lets AI agents connect to external tools and data sources through a consistent interface. MCP servers expose capabilities like file access, database queries, API calls, and browser control that any MCP-compatible agent can use. Instead of building custom integrations for every tool, you spin up an MCP server once and any agent can talk to it.

For QA teams: MCP is how you build a reusable test environment for your agents. A Playwright MCP server gives any agent browser automation. A database MCP server lets agents query test data. You compose servers to expand what the agent can perceive and act on, without changing the agent itself.

Examples: Playwright MCP, GitHub MCP, Filesystem MCP, custom test data servers

Tool Calling (Function Calling)

The mechanism that lets a model interact with the outside world by requesting a function be executed and reading the result. The model decides which tool to use, what parameters to pass, and how to interpret the response. This is the bridge between reasoning and action.

For QA teams: This is the core loop of agent-based testing. An agent calls a "run test" tool, reads the results, calls an "inspect element" tool, then updates its approach. Understanding tool calling helps you design better test tools and diagnose why an agent got stuck or made a bad decision.

Examples: Running shell commands, querying APIs, browser actions, database lookups