I had an idea for a feature in Task Board that was, in hindsight, far too ambitious for a single spec. Conversations. The concept: email flows into the app, tasks get created from email threads, external people keep using their normal email client and never know they're interacting with a task management system. Seamless, transparent, and - as I would discover - spectacularly complex to implement.
So I did what I always do now. I wrote a detailed product spec, handed it to Claude Code, and said "build this." What followed was an education in the limits of context windows, the art of specification writing for AI, and the importance of actually clicking the buttons before you declare victory.
The Spec
I don't do thin specs. When I'm building something with Claude Code, I've learned that the quality of the output is directly proportional to the quality of the input. So the Conversations spec was detailed. Really detailed.
It covered the data model - conversations, messages, participants, the relationship between conversations and tasks. It covered the email integration - inbound email parsing, outbound sending via Azure Communication Services, threading, reply detection. It covered the UI - conversation list, message thread view, compose interface, the task creation flow. It covered edge cases - what happens when an email bounces, how to handle CC recipients, what to do with attachments.
Nine implementation phases. Each phase had acceptance criteria. The spec was long enough that I started wondering whether I should have just built the feature myself in the time it took to write the spec. That thought lasted about three seconds before I remembered what building email integration from scratch actually involves.
The Context Window Problem
The spec was so large that it consumed a significant chunk of the context window before Claude Code had written a single line of code. This is the fundamental tension with detailed specs: you need the detail to get good output, but the detail itself takes up space that could be used for implementation context.
The conversation ran through four-plus context resets. For anyone who hasn't hit this: Claude Code has a finite context window. When you fill it up - with your spec, the code it's reading, the code it's writing, and the back-and-forth conversation - it hits a wall. The session summarises what's been done, what's pending, and continues in fresh context.
The first session covered phases one through three. Data models, the basic conversation UI, the email sending infrastructure. Good progress. Clean code. I was feeling optimistic.
The second session picked up at phase four. This is where email receiving and parsing lived - the complex bit. Claude Code had the summary from session one but not the actual code context. It was working from a description of what had been built rather than the code itself. This matters, and I'll come back to why.
Sessions three and four handled the remaining phases - task creation from conversations, the reply threading logic, UI polish, and integration testing.
The Summary Pattern
When a context reset happens, Claude Code produces a continuation summary. This is essentially a handover document: here's what was built, here's how the code is structured, here's what still needs doing. The next session starts by reading this summary and continuing from where things left off.
In theory, this is seamless. In practice, it's lossy. The summary captures the intent and the structure, but it loses nuance. Specific implementation details, edge cases that were discussed and handled, the reasoning behind certain architectural choices - some of this survives the summary, but not all of it.
I noticed the effects in session two. Code that was written in session one followed certain patterns - specific naming conventions, particular approaches to error handling. Code written in session two, working from the summary rather than the actual codebase, sometimes deviated from those patterns. Not wrong, exactly, but inconsistent. The kind of inconsistency you get when two different developers work on the same project without sitting next to each other.
By session four, there were three slightly different approaches to the same pattern in different parts of the codebase. All functional. None wrong. But the kind of thing that makes you twitch when you review the code.
The Button That Didn't Do Anything
Here's the moment that earned its own section heading.
After all four sessions were complete and the feature was nominally "done", I opened the app to test it. Clicked "Start a conversation." The page navigated to /conversations/new. Fine so far. On that page, I found two buttons: "Start a Conversation" and "+New Conversation." Neither of these buttons actually started a conversation.
I'll say that again. I clicked "Start a conversation" and was taken to a page with two more buttons that said "start a conversation", and neither of them did the thing they said they would do.
This is the kind of thing that's simultaneously funny and deeply instructive. The AI had built the navigation, the page structure, the button components, and the routing. It had not connected the buttons to the actual conversation creation logic. The entire path from "user clicks button" to "conversation is created in the database" was missing.
It looked complete. The UI was polished. The buttons were styled. The page layout was thoughtful. But the fundamental action - the reason the page exists - didn't work.
Why This Happens
I've thought about this a lot, because it's not an isolated incident. It happens regularly with large features built across multiple context windows. And I think there are two reasons.
First, context loss. The button handler was supposed to be wired up in phase six. By phase six, we were in session three. The session had the summary of what was built in sessions one and two, but not the detailed component structure. So when Claude Code built the "create conversation" UI, it built the visible parts - the page, the layout, the buttons - but missed the invisible part: the actual event handler that makes things happen.
Second, AI builds what it can see. There's a pattern I've noticed where AI is better at producing visible output (UI components, page layouts, styled elements) than invisible behaviour (event handlers, state management, database operations). I suspect this is because the visible parts have clear, describable structures - "a button with the text Start a Conversation" - while the invisible parts require understanding the full flow from user action to system response.
Structuring Specs for AI
After this experience, I changed how I write specs for large features. Some practical things that help:
Put the most critical behaviour first. Context windows mean the AI pays most attention to what it sees first. If the core user flow - the thing that absolutely must work - is buried in phase seven of nine, it's going to get less attention than the data model in phase one. Front-load the critical path.
Repeat key requirements across phases. If a button on page X needs to call function Y, mention it in the phase where page X is built AND in the phase where function Y is implemented. Redundancy in specs compensates for context loss between sessions.
Include "smoke test" criteria for each phase. Not full test cases, but simple "after this phase, you should be able to click X and see Y" checks. This forces the AI to validate that the visible and invisible parts are connected before moving on.
Keep phases small enough to complete in a single context window. This is the biggest lesson. If a phase spans a context reset, the second half of that phase will have degraded context for the first half. Better to have twelve small phases than nine large ones.
Write specs in terms of user actions, not system components. "The user clicks Start a Conversation, enters a subject and message, clicks Send, and sees the conversation appear in their list" is better than "build a ConversationCreateComponent with a form that posts to the ConversationService." The first version describes the complete flow. The second describes a component in isolation.
The Continuation Pattern
After four resets, I've developed a pattern for managing continuations that works reasonably well.
At the start of each new session, I don't just let it read the summary and continue. I explicitly re-state the most important requirements. I point it at the code that was generated in previous sessions and ask it to read key files before continuing. I highlight any inconsistencies I've noticed and ask it to align with the patterns from session one.
This takes ten minutes at the start of each session, but it dramatically improves consistency. Think of it as onboarding a new developer to a project - you wouldn't just hand them a summary document and say "carry on." You'd walk them through the important bits, show them the patterns, and make sure they understood the approach before they started writing code.
Was It Worth It?
The Conversations feature shipped. After the button incident and a fair amount of additional iteration, it works well. Email flows in, tasks get created, external people reply via email and it threads correctly. The feature I specced is the feature that exists.
Would I do it this way again? Yes, but with smaller phases, better smoke tests, and a realistic expectation that the first "complete" version will need a solid testing pass to catch the gaps. The spec-driven approach works for large features. It just requires understanding that the AI will build exactly what you describe, and anything you don't describe explicitly has a good chance of being missed.
The buttons that didn't do anything taught me more about working with AI than a dozen successful features. Because the failures are where the real lessons are, and "it looks done but doesn't actually work" is the most common failure mode when you're building big things with AI.
Check the buttons. Always check the buttons.