Most confusion about AI tools in QA comes from mixing up four terms that sound similar but mean very different things. Get these right and everything else starts to make sense.

Model

A model is a blob of parameters produced during training. It does one thing: predict the next token given what came before. It has no memory, no tools, no goals. It is stateless; the model has no awareness of your previous conversations, test runs, or anything outside the current input.

For QA teams: When an AI-generated test produces a wrong assertion, that is a model limitation. The model predicted plausible-sounding code based on training data, not your actual system behavior. Treat model outputs as a starting point, not ground truth. Always verify any output from models against your real system.

Examples: Claude Sonnet4.7 , GPT 5, Gemini 1.5 Pro

Harness

The harness is everything around the model that turns it into something useful; the system prompt, available tools, context window management, memory, and any scaffolding that shapes how the model behaves.

For QA teams: This is why Codex (with access to your terminal, files, and browser) behaves so differently from ChatGPT on the web, even when both run the same underlying model. The harness determines what the model knows, what it can do, and how it approaches problems. When you write a AGENTS.md file or configure a system prompt for your QA agent, you are shaping the harness.

Examples: Claude Code, Codex, GitHub Copilot, and Cursor are all different harnesses, often running the same underlying models

Environment

The environment is everything outside the harness that the agent can perceive and act on through tools. It is the world the agent operates in.

For QA teams: Your test suite is an environment. Your browser, your API endpoints, your CI/CD pipeline make up the environment. An agent can only interact with what has been exposed to it via tools. If your test agent cannot read your API docs, those docs do not exist to it. Configuring MCP servers, giving an agent browser access, providing a cli tool with instructions, or pointing it at your test repo all expand its environment.

Examples: File system, browser via Playwright, GitHub Actions, your application under test, MCP servers, Command Line programs.

Agent

An agent is a model, harnessed, in an environment. All three parts together. The agent reasons, plans, and takes actions; calling tools, reading results, and deciding what to do next in a loop until the task is done.

For QA teams: When someone says "AI is writing my tests," they usually mean an agent, not just a model. This distinction matters for reliability and control. You cannot configure a model — it is fixed. But you can configure the harness and environment to shape how an agent behaves. That is where custom instructions, tool permissions, and context files come in.

Example: Claude Code running in your repo = Claude Sonnet (model) + Claude Code (harness) + your terminal and filesystem (environment)

Spec-Driven Development (SDD)

A development approach where specifications — not code — are the primary artifact. A spec is a structured, behavior-oriented document written in natural language that describes what software should do before any code is written. The spec becomes the source of truth that guides AI coding agents through implementation.

For QA teams: This is directly relevant to how you’ll work with AI-generated code. SDD shifts the human’s role from writing code to writing and verifying specs. The spec defines acceptance criteria and expected behavior, which means QA thinking — requirements clarity, edge cases, testable outcomes — becomes the core skill, not just writing assertions after the fact. Three levels exist: spec-first (spec before code), spec-anchored (spec maintained alongside code), and spec-as-source (spec is the only thing humans edit; code is fully generated).

Examples: Kiro, GitHub spec-kit, Tessl

Supporting Terms

Hallucination

When a model produces confident but incorrect output. The model is not lying, it is pattern-matching to what seemed plausible during training. For QA teams, this is the core reliability concern with AI-generated test code. Always review AI-generated assertions, selectors, and API calls against your actual system before committing them.

LLM-as-Judge

Using a language model to score or grade the output of another language model or agent, instead of writing traditional assertions or relying on a human reviewer. The pattern emerged alongside instruction-tuned models (popularized by the 2023 MT-Bench and Chatbot Arena work) once teams realized example-based asserts couldn’t keep up with open-ended generation. It’s now standard plumbing in agent eval frameworks — and a standard target for skepticism, since judge models bring their own biases to the bench.

For QA teams: This is the eval pattern you’ll reach for when you can’t write a deterministic assertion against AI output. When an agent generates test plans, exploratory test notes, or bug reports, there’s no single correct answer to assert against. LLM-as-judge gives you a scalable way to evaluate quality, but treat the judge like any other test tool — it has its own failure modes and biases. Always ask: who’s judging the judge?

Examples: MT-Bench, Chatbot Arena, agent-skills-eval, LLM-based grading in CI pipelines

Context Window

The maximum amount of text a model can consider at once, its working memory. Everything outside the context window does not exist to the model. This explains why an AI agent may seem to forget earlier instructions in a long session, or why pasting in a large test file can push out important context. Keeping prompts focused and context tight is a core skill for working with AI in QA.

Prompt Engineering

Designing inputs that reliably produce useful outputs from a model. For QA teams, this means being specific, stating what kind of test you want, what the expected behavior is, what framework to use, and what not to do. Vague prompts produce vague tests.

RAG (Retrieval-Augmented Generation)

A technique that pairs an LLM with external knowledge retrieval. Instead of relying solely on training data, the system pulls in relevant documents at query time and adds them to the context. For QA teams, this means an agent can be grounded in your actual API documentation, requirements, or test plans, reducing hallucinations on system-specific behavior.

Non-Deterministic Systems

Systems where the same input does not guarantee the same output across runs. LLMs are inherently non-deterministic - unlike traditional code, there is no single correct answer to assert against. For QA teams, this shifts the testing approach entirely: instead of exact output matching, you evaluate output quality, intent, and acceptable range.