When Your AI Builds the Tool That Controls the AI

There's a moment in every recursive sci-fi film where someone points out the paradox and everything gets weird. I had that moment last month, except instead of time travel it was an MCP server, and instead of a dramatic musical score it was the quiet hum of my office fan and the realisation that I'd built something genuinely strange.

Let me explain. Claude Code built the TestPlan MCP server. TestPlan is our QA test management product. The MCP server exposes 82 tools - yes, eighty-two - that let an AI interact with every part of the system. Create test cases, run test suites, record results, manage releases, the lot.

Then I connected Claude Code to that MCP server. So now Claude Code could operate TestPlan. And what was I using TestPlan to test? TestPlan. The product that Claude Code was actively building.

If that makes your head hurt, welcome to my Tuesday.

82 Tools and Counting

When I say TestPlan's MCP server has 82 tools, I should clarify what that actually means in practice. Each tool is a discrete operation - create a feature, update a test case, start a test run, record a result, get a report. Every meaningful action you can take in TestPlan's UI has a corresponding MCP tool.

Building this was a substantial undertaking. Claude Code did the implementation, but the design was mine. I had to think carefully about what each tool should be called, what parameters it should accept, and critically, how the descriptions should be worded so that Claude would understand when to use each one.

With 82 tools, there's a real risk of the AI getting confused about which tool to use for a given task. "Create test case" and "suggest test case" are different operations with different purposes. "Start test run" and "start smoke test" serve different workflows. The tool descriptions had to be precise enough that Claude could reliably pick the right one without me specifying it explicitly every time.

Getting this right took multiple iterations. Early versions had Claude occasionally using the wrong tool - creating a full test case when I wanted a suggestion, or starting a comprehensive test run when a smoke test would do. Each mistake meant refining a description, clarifying a boundary, and redeploying.

The Feedback Loop That Made Me Pause

Here's where it gets properly interesting. The workflow I settled into was this:

Step 1: Claude Code implements a new feature in TestPlan.

Step 2: Claude Code, via the MCP server, creates test cases for that feature in TestPlan.

Step 3: Claude Code runs those test cases against the feature it just built.

Step 4: Claude Code records the results back into TestPlan.

So the AI builds the feature, writes the tests, runs the tests, and records the results - all within the product it's simultaneously building and testing. The AI is the developer, the QA engineer, and the test management tool operator, all at once.

I'll be honest with you, the first time this full loop executed successfully, I sat back in my chair and stared at the screen for a good thirty seconds. Not because it was impressive in a "look what AI can do" way. Because it was weird. There's something fundamentally odd about watching an AI test its own work using a tool it built to test things.

Screenshots and Playwright: AI With Eyes

The testing wasn't just API-level validation. Claude Code was taking actual screenshots during test runs using Playwright. It would navigate to a page, perform an action, capture what the browser rendered, and evaluate whether the result matched expectations.

This matters because a lot of bugs are visual. The API returns the right data, the component renders, but the layout is broken, or a button is behind another element, or the text is truncated. These are the bugs that unit tests miss entirely and integration tests only catch if someone's written a very specific assertion.

With Playwright screenshots feeding back through the MCP server, Claude could see the actual rendered output. Not just the DOM. Not just the response payload. The actual pixels on the screen. And it would flag issues: "The modal is rendering behind the overlay", "The date picker is cut off at the bottom of the viewport", "The loading spinner is still visible after the data has loaded".

Is it perfect? No. Claude sometimes flags things that are fine, or misses things that are obviously wrong to a human eye. But it catches enough visual regressions to be genuinely useful, and it does it across every page, every time, without getting bored or distracted.

When the Human Has to Step In

Not every test case is straightforward enough for AI automation. Some tests require human judgement - does this workflow feel intuitive? Is this error message helpful? Does the animation timing feel right? These are subjective assessments that Claude can't reliably make.

So I built a mechanism for human override. When a human tester runs a test case that the AI previously failed or couldn't complete, the human can pass it and leave an explanation of why it's actually fine. That explanation gets fed back into the automation context, so the next time Claude encounters a similar situation, it has the benefit of the human's reasoning.

This turned out to be one of the most valuable parts of the whole system. It's not just AI testing and humans testing separately. It's a genuine feedback loop where human judgement improves the AI's future test runs. The AI fails a test, a human explains why it's actually acceptable, and the AI incorporates that understanding going forward.

Over time, the AI gets better at testing. Not because the model improved, but because the context around the testing improved. The accumulated human explanations become a knowledge base that makes each subsequent test run more accurate.

The Recursive Nature of the Thing

I keep coming back to the recursion. Claude Code built TestPlan's MCP server. That MCP server lets Claude Code operate TestPlan. Claude Code uses TestPlan to test TestPlan. The test results go back into TestPlan. Claude Code reads those results and adjusts its development approach accordingly.

At what point does this stop being "a developer using an AI tool" and start being "an AI development loop with a human supervisor"? I'm not sure. I'm not sure the distinction matters practically. What matters is whether the software gets better, the bugs get caught, and the features ship correctly. And on all three counts, this workflow delivers.

But it does make me think about what my role actually is in this process. I'm not writing the code. I'm not writing the tests. I'm not running the tests. I'm not recording the results. I'm making the decisions about what to build and when to ship. I'm reviewing the AI's work and overriding its judgement when it's wrong. I'm the product manager and the quality gate, not the developer.

That's not a metaphor for some futuristic vision of AI-first development. Just my actual Wednesday.

What 82 Tools Teaches You About AI Interface Design

Building an MCP server with 82 tools taught me things I didn't expect about designing interfaces for AI rather than humans.

Humans can handle ambiguity. If a button says "Run", a human can look at the context and figure out what it runs. An AI needs the tool description to say "Start a new test run for the specified release, executing all non-archived test cases associated with that release, and return the test run ID for tracking progress." Precision matters more than brevity.

Humans can recover from wrong choices. If you click the wrong button, you hit undo. An AI that calls the wrong tool might create a test run you didn't want, or record a result against the wrong test case. Error prevention beats error recovery when your user is a language model.

Humans read documentation once. An AI reads the tool description every single time it's invoked. So those descriptions need to be optimised for repeated comprehension, not first-time learning. Short, consistent, unambiguous.

These are genuinely new design principles. We're all still figuring them out. But if you're building MCP servers, start thinking about your AI as a user with very specific needs, very literal interpretation, and zero tolerance for ambiguity.

Where This Goes Next

The immediate next step is connecting this feedback loop to our CI/CD pipeline. When code is committed, TestPlan automatically creates a test run, Claude executes the relevant test cases, and the results feed back into the pull request. If tests fail, Claude can attempt to fix the issue before a human ever looks at it.

The longer-term vision is that TestPlan's test coverage expands organically. As Claude builds new features, it also builds the test cases for those features. As humans interact with the product and find edge cases, those get fed back into the test suite. The product and its test coverage grow together, maintained by a combination of AI automation and human judgement.

But I'm getting ahead of myself. Right now, the system works. The loop runs. The AI builds, tests, records, and adjusts. And I sit in the middle, making sure it's all heading in the right direction.

It's a strange way to build software. But it's working. And that's what matters more than whether it sounds normal.