Rewriting 417 Test Cases with a 6-Agent Team

I was auditing the test suite for a client project - a large web application with what appeared to be solid test coverage. 417 test cases across six feature domains. The numbers looked healthy. Management was satisfied. The QA team was maintaining green dashboards.

Then I actually read the test scripts.

88% of them were page-load checks. Navigate to URL. Verify page loads. Done. That's not a test. That's a health check dressed up in a test case's clothing. "Can the user manage their account?" Step 1: Navigate to the account page. Expected result: The page loads. Brilliant. That tells you precisely nothing about whether the account management actually works.

Four hundred and seventeen test cases, and the vast majority of them were confirming that the web server was running. We needed to rewrite them. All of them. With proper step coverage - actual user flows, actual assertions, actual validation that features work the way they're supposed to.

Doing that manually would take weeks. So I did something I'd never tried before at this scale: I built a team of AI agents and had them rewrite everything simultaneously.

The Team Structure

The application had six distinct feature domains - things like account management, dashboards and analytics, scheduling, reporting, billing, and messaging. Each domain had its own set of test cases, its own page structures, and its own user workflows.

So I created six agents, one per domain: a members-writer, a dashboard-writer, a diary-writer, a reporting-writer, a till-activities-writer, and a staff-messages-writer. Each agent had the same core instructions but different domain context - the relevant pages, the user flows, the business rules specific to their area.

The idea was simple. Each agent takes its domain's existing test cases, understands the feature being tested, and rewrites the test script with proper step-by-step coverage. Navigate to the page, yes, but then actually do the thing. Fill in the form. Click the button. Verify the result. Check the edge cases. Validate the error states.

The Step Coverage System

Before letting the agents loose, I needed a way to validate the output. "Write better tests" is a vague instruction, and vague instructions produce inconsistent results when you're running six agents in parallel.

So I built a step coverage validation system. Each test case had to include specific markers: navigation steps, interaction steps (clicking, typing, selecting), assertion steps (verifying outcomes), and at minimum a defined number of meaningful steps per test case. A test case with only a navigation step and a page-load assertion would be flagged as insufficient.

This gave me an automated way to check whether the rewritten tests were actually better than the originals. Not perfect - you can game step counts by adding trivial steps - but it caught the most common failure mode: agents that rewrote the test case description without actually adding meaningful coverage.

Running Six Agents Simultaneously

There's something slightly surreal about watching six AI agents work on the same project at the same time. Each one is reading existing test cases, understanding the feature area, and producing rewritten scripts. The terminal fills up with output from all six. They don't know about each other. They're working independently, following the same patterns, against different domains.

Some agents finished faster than others. The staff-messages-writer had about 20 test cases - relatively straightforward, mostly around sending, receiving, and filtering messages. It was done in minutes. The members-writer had 50+ test cases covering everything from member registration to payment processing to membership freezes. That one took considerably longer.

The diary-writer was particularly interesting because scheduling has complex business rules - available slots, booking conflicts, recurring events, cancellation policies. The existing test cases for diary were among the worst offenders for page-load-only testing, so the rewrite was substantial.

What the Agents Produced

Most of the output was good. Genuinely good. The agents understood the features from the existing test case descriptions and page context, and produced step-by-step test scripts that actually validated functionality. "Navigate to member profile, click Edit, change the email address, click Save, verify the confirmation message, navigate back to the member list, search for the updated email, verify the member appears."

That's a real test. It tests that editing works, that saving works, that the change persists, and that search reflects the updated data. Compared to the original "navigate to member profile, verify page loads" - night and day.

The step coverage validation system showed the improvement in numbers. Average steps per test case went from 2.1 (navigate and verify page load) to 7.8. Interaction steps per test case went from 0.3 to 4.2. The test suite went from confirming "the app is running" to confirming "the features work."

What the Agents Got Wrong

Here's the part that matters most, and the part that would have been a disaster if I hadn't caught it.

Some agents produced test scripts that looked thorough but were testing the wrong things. One agent, working on the till domain, produced a beautifully detailed test case for processing a refund. Twelve steps, proper assertions at each stage, excellent coverage. One problem: the refund workflow it described didn't match the actual application. The agent had inferred a refund process from the feature name and general domain knowledge, but the actual implementation in the client's application worked differently.

The test steps were plausible. They followed a logical refund workflow. A reviewer skimming the output would likely approve it. But running that test against the real application would fail, and - worse - someone might assume the failure indicated a bug in the application rather than a wrong test.

This happened across multiple agents and multiple domains. Not every test case, but enough that I couldn't just accept the output at face value. Maybe 15-20% of the rewritten tests had steps that were reasonable guesses rather than accurate descriptions of the actual application behaviour.

The Quality Audit

This is the stage that nobody wants to talk about, because it's the unglamorous bit. I had 417 rewritten test cases. I needed to verify that each one accurately reflected the actual application, not just a plausible version of it.

I ended up doing a two-pass review. First pass: automated validation using the step coverage system to flag any tests that seemed structurally weak. Second pass: manual review of each domain, comparing the test scripts against the actual application. This second pass is where I caught the "plausible but wrong" tests.

The manual review took longer than I'd have liked. But it was still dramatically faster than writing 417 test cases from scratch. The agents did maybe 80% of the work. The quality audit and corrections were the remaining 20%. That's still a massive time saving.

The Coordination Pattern

For anyone considering a similar multi-agent approach, here's what I learned about coordination:

Give each agent a clearly bounded domain. The reason six agents worked was that each had a distinct, non-overlapping area of responsibility. No two agents were working on the same test cases. No dependencies between agents. Clean boundaries mean no coordination overhead.

Use identical instruction templates. Every agent got the same core instructions - the same output format, the same step coverage requirements, the same quality criteria. Only the domain-specific context varied. This ensures consistency across agents even though they're working independently.

Build automated validation. You cannot manually review the output of six agents working simultaneously if you don't have automated checks. The step coverage system was essential. Without it, I'd have been reading hundreds of test cases with no systematic way to identify the weak ones.

Plan for a quality audit. The agents will produce output that looks good. Some of it won't be good. Budget time for review. In my case, the ratio was roughly 4:1 - four hours of agent work followed by one hour of quality review and corrections. Your ratio will vary depending on domain complexity.

Parallelism is the point. The entire value of multi-agent work is that six agents running simultaneously complete six domains' worth of work in the time one agent would complete one domain. If you have serial dependencies between agents - if one agent's output feeds into another - you lose this advantage. Design for independence.

The Result

417 test cases rewritten. Average step coverage increased nearly fourfold. The test suite now actually tests the application rather than just confirming it loads. The work took a day instead of the weeks it would have taken manually - and that includes the quality audit pass.

Was it perfect? No. I found and fixed issues in the agent output that would have been caught by an experienced QA person writing the tests manually. But "good enough to ship with reasonable review effort" is a very different proposition from "write 417 test cases by hand."

The multi-agent pattern works. Not because each individual agent is perfect - they're not - but because parallelism at this scale changes the economics of test coverage. When rewriting your entire test suite is a day's work instead of a month's, you actually do it. And a test suite that gets rewritten is infinitely more useful than one that everyone agrees should be better but nobody has time to fix.

Just check the output. Especially the refund workflows. Trust me on that one.