← The Anneal log

You Don't Have a Prompting Problem. You Have a Context Problem.


You know the moment. You ask Claude Code to refactor a module, it comes back fast and confident, and it’s almost right — except it ignored the pattern you use everywhere else in the repo, invented a helper you already have, and named things like it had never seen your codebase. So you rewrite the prompt. Add “follow existing conventions.” Paste in an example. Run it again. Still almost right.

At some point you blame yourself for not prompting well enough. Don’t. You’ve been debugging the wrong layer.

The thing that failed wasn’t your wording. It was what the model knew when you hit enter. And once you see that, a lot of your daily friction stops looking like a skill problem and starts looking like an engineering problem — the kind that can actually be solved, and mostly automated away.

The tell: “almost right, but not quite”

Stack Overflow’s 2025 Developer Survey put a number on the feeling. The single biggest frustration developers report with AI tools isn’t wrong answers or useless ones — it’s solutions that are “almost right, but not quite.” Sixty-six percent of developers named it.

Sit with what “almost right” actually means. The model understood your question fine. It wrote valid code. It just didn’t have the one thing that would have made the answer yours: your conventions, your architecture, the reason you structured things the way you did.

Qodo’s 2025 State of AI Code Quality research drills into exactly this: 65% of developers say AI assistants “miss relevant context” specifically when refactoring. The tool can refactor. It can’t refactor for you, because it doesn’t know you.

Here’s the part that matters for your workflow: you cannot prompt your way out of a missing-context problem. Write the most surgical instruction in history — if the context isn’t in the window, the answer comes back almost right. Every time. The ceiling isn’t your phrasing. It’s the knowledge in the room.

Three people who’d tell you the same thing

If it were just a hunch, I’d hold it loosely. But three of the sharpest voices in the field, from three unrelated corners, landed on the identical reframe within months of each other:

Andrej Karpathy (former Tesla AI Director, OpenAI co-founder): “Context engineering is the delicate art and science of filling the context window with just the right information for the next step.”

Tobi Lutke (CEO, Shopify): “I really like the term ‘context engineering’ over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM.”

Anthropic (September 2025): “Building with language models is becoming less about finding the right words and more about engineering the right context.”

A co-founder of OpenAI, the CEO of a $100B+ company, and the lab behind Claude — converging on the same term, independently. That’s not a trend piece. It’s three people telling you the skill you’ve been grinding on (better prompts) has a ceiling, and the skill that actually moves your output is a different one: getting the right context in front of the model.

What “just the right” means for your next prompt

Reread Karpathy’s line, because the load-bearing part is easy to skim past: “just the right information for the next step.”

Not everything. Not your whole repo pasted in. Not a CLAUDE.md so long the model skims it. The right information. For this step. Right now. More context is not better context — it’s often worse, because the signal you needed gets buried in the noise you added.

That precision is an engineering problem, and if you’ve ever hand-managed context for a session, you’ve done all five stages of it yourself without naming them:

Classification — figuring out what this request even needs. You already do this: a quick “why is this test flaky?” needs almost nothing; “redesign this service” needs the architecture, the constraints, the history. Guessing wrong is why the answer misses.

Retrieval — pulling the relevant stuff from wherever it lives: your files, past conversations, the decisions you made three weeks ago. This is the part everyone means when they say “AI memory.” It matters — and it’s one stage of five, not the whole game.

Assembly — ordering it so the model can use it. Dumping raw files in degrades the answer. Most-relevant-first, project state summarized not enumerated, instructions specific. This is the actual engineering in context engineering.

Delivery — keeping the payload inside the model’s effective range. Research and hard experience both show models degrade above roughly 32,000 tokens. If your context window is stuffed with 50,000 tokens of “helpful background,” you’ve sabotaged the very request you were trying to help. Targeted beats complete-but-bloated.

Feedback — noticing when it worked and when it didn’t, and adjusting. When you do this by hand, you are the feedback loop — which means it only improves as fast as you remember to tune it.

Most of what’s on your machine handles exactly one of these. RAG does retrieval. Prompt templates do assembly. Token counters watch delivery. Very little treats all five as one connected pipeline — which is the entire reason the term exists.

The tax you’re paying by hand

Right now, if you care about this at all, you’re doing it manually. Curating a CLAUDE.md. Re-pasting the same project background at the start of every session. Deciding by hand what to include for each task. Re-explaining your codebase to a fresh chat for the tenth time this week.

It works. It also has two problems you feel every day. It doesn’t scale past you — that context lives in your head and your habits, nobody else benefits. And it never improves on its own — every gain is one more thing you have to maintain. You’re not just doing the work; you’re doing the work of remembering to do the work.

That’s the gap Anneal is built to close. Anneal is one workspace for all your AI, and context engineering is the product — the five stages run for you, across every tool you touch, so the model shows up already knowing your work instead of you rebuilding that context by hand every session. (Under the hood that pipeline is grāmatr; you just get the result.) The difference between a discipline and a product is who does the work: a discipline tells you what to do, a product does it — every request, without you thinking about it.

Why everyone’s still trading prompts

If context is the real lever, why is your feed still full of prompt templates?

Because prompts are visible. You can see one, tweak it, screenshot it, post it. There’s a genuine craft satisfaction to a clean prompt, same as a tidy SQL query. Context engineering is invisible — when it works, the model just answers better and you never see the classification, the selective retrieval, or the effort call that decided this needed a 200-token reply, not a 2,000-token essay. The best plumbing is invisible. That’s the job.

And it’s genuinely harder to build. A prompt tool is a text box with model access. Doing context engineering properly needs classification, memory, routing, feedback loops, effort calibration, delivery tuning — an order of magnitude more machinery. Which is precisely why, for now, most people are still doing it by hand.

This isn’t going to get fixed by a bigger model

Tempting to think the next release just handles all this for you. The data says otherwise. IEEE Spectrum reported in January 2026 that “over the course of 2025, most of the core models reached a quality plateau, and more recently, seem to be in decline.” Developer trust in AI output slid from above 70% in 2023–2024 down to 60% in 2025.

Read that as good news for how you work: the next jump in how useful your tools feel isn’t gated behind a model you have to wait for. It’s available now, and it’s on the context side — the model showing up already knowing your codebase, your conventions, your history. That’s the lever you can actually pull.

The plumbing just got standardized

One more reason this is landing now. In March 2026, Anthropic donated the Model Context Protocol (MCP) to the Linux Foundation, which formed the Agentic AI Foundation — co-founded by Anthropic, Block, and OpenAI, with Google, AWS, Microsoft, Cloudflare, and Bloomberg backing governance. When competitors co-govern a protocol, it becomes infrastructure: MCP now has 97 million monthly SDK downloads, 10,000+ active servers, and adoption across ChatGPT, Cursor, Gemini, and Copilot.

What that means for you: MCP defines how your tools connect to context. It says nothing about how smart that context is. A source that only stores and retrieves through MCP is a filing cabinet with a standard plug. A source that classifies your request, pulls just what it needs, and learns from what worked is doing context engineering. Same pipe — completely different water. And because the pipe is now standard, that intelligence can reach every tool in your stack through one connection. The question stops being “can it connect?” and becomes “how smart is what I’m connecting?”

So what do you actually do

Next time an answer comes back almost right, resist the reflex to rewrite the prompt a fifth time. Ask instead: what did the model not know? Nine times out of ten that’s the fix — not sharper wording, but the convention, the file, the decision it never saw.

From there you have two options. Keep doing context engineering by hand — curating instructions, re-pasting background, managing what the model sees for every task. Or let it run for you. Either way, the skill the field’s sharpest voices now agree on isn’t writing better prompts. It’s engineering better context. The only real question is whether you keep paying that tax by hand, or hand it to something built to do it.