← The Anneal log

You Can Run the Whole Thing on Your Own Machine Now


There’s a specific moment that gets you. You’re on a plane, no Wi-Fi, half-finished feature open in your editor, and out of habit you fire off a question to your assistant expecting the usual “you’re offline” shrug. Instead it answers. Correctly. From a model sitting on your own SSD, that never once reached for the network.

That moment used to be a party trick. Now it’s just Tuesday. Every layer of a real AI workflow — the model, the context it needs, and the tools you already live in — can run entirely on your machine. No cloud you don’t control. No token meter ticking. No terms of service quietly changing under your work. And the reason to care isn’t a compliance memo. It’s that the setup is finally good enough to be your daily driver, and it belongs entirely to you.

The pieces are already on your laptop. The interesting part is what it takes to actually wire them together — and the one step most people skip.

The models got small enough to stop being the problem

Two years ago, “run it locally” meant a quantized model that lost the thread halfway through a sentence. That era is over.

Apple ships Foundation Models tuned for on-device inference — around 3 billion parameters, squeezed to 2 bits per weight with quantization-aware training, free for developers, fast on Apple Silicon, and handling long-context work without a single byte leaving the device. On everything else, Microsoft’s Phi models and Meta’s Llama 3.2 3B do the same job, and the ONNX Runtime keeps landing NPU acceleration, in-browser inference through WebNN, even on-device training.

For focused, domain-specific work — the kind of thing you’re actually doing at 11pm — a 3B model with the right context beats a 400B model with none. The horsepower question is settled. Your laptop can do this. Which means the bottleneck moved somewhere less obvious.

A local model is a genius with amnesia

Here’s what nobody tells you before you spin one up: the model on your machine knows nothing about you.

Every session starts from a blank slate. It has no idea which framework you’re on, that you already ruled out the obvious approach last Tuesday, that this repo has a hard rule against a certain pattern, or that “the client” means a specific person with specific opinions. Cloud tools at least keep a conversation history you can lean on. A raw local model doesn’t even give you that. You get a brilliant collaborator with total amnesia, and you pay for it in the first ten minutes of every session re-explaining your own project back to a machine that should already know.

Local inference without local context is a powerful engine with an empty tank. It can reason beautifully. It just has nothing to reason about — nothing that’s actually yours.

And this is the part people get wrong: they assume the fix is storage. Dump the repo in, paste the docs, point it at a folder, and surely it’ll figure things out. It won’t. A model handed everything is nearly as lost as a model handed nothing — it drowns. Context isn’t a pile you accumulate. It’s a thing you build.

Context is an engineering problem, not a filing problem

The move that makes a local stack feel like magic instead of a chore is this: at the moment you ask a question, something decides what small slice of everything you know is actually relevant — and hands the model only that.

Not the whole repo. The three files this change touches. Not your entire decision log. The one call from last week that contradicts what you’re about to do. Not a dump of your preferences. The two that apply to this exact task. That’s a routing and classification job, and it’s the difference between a model that feels psychic and one that feels like it’s reading a phone book out loud.

That’s engineered context. Storage is passive — it just sits there and hopes. Engineering is active — it looks at this request, figures out what matters, and delivers a targeted briefing instead of a document dump. Do it right and you can run classification like that in well under a second, on CPU, with nothing leaving the machine. (grāmatr is built around exactly this idea — classify the request first, then deliver only the context that earns its place. Its approach can run in local or otherwise controlled settings, which is what makes it a natural fit for the context layer of a stack like this.)

Skip this step and a local model is just a slower, dumber version of the cloud one. Nail it and it starts to feel like it’s been paying attention the whole time.

MCP is the glue you don’t have to build

The reason a hobbyist can assemble this at all — instead of it being a six-month integration project — is the Model Context Protocol.

MCP is a standard interface between models and context sources. An MCP server on your machine can serve context from a local knowledge graph, a local database, whatever you’ve got, and any MCP-compatible client can talk to it: Claude Code, Codex, Gemini, Cursor, VS Code. The protocol is identical whether that server lives in a cloud region or on your desk under a coffee mug. When it runs locally, the context request just hops from one process to another on the same machine and comes back in milliseconds. No outside call, no latency, no data leaving.

Which lets the whole thing snap together as three honest layers:

The model — Apple Foundation Models, Phi, Llama 3.2, any ONNX-compatible one. Runs on CPU, GPU, or NPU, whatever you’ve got.

The context engine — a local MCP server with a knowledge graph and routing logic that does the engineering job above: decides what this request actually needs and hands it over as a targeted packet.

The tools you already use — your editor, your assistant, the interface you live in every day. It connects to the local context server exactly the way it would connect to a cloud one, and mostly doesn’t know the difference.

The model never has to know where its context came from. The context server never has to know which model is asking. MCP handles the seam. Everything else stays on your hardware.

Why you’d actually bother

Not for a compliance checkbox. For reasons that are yours.

Because the side project you’re not ready to show anyone stays genuinely private — not “we promise not to look” private, but never left the machine private. Because the client code under NDA doesn’t have to trust a third party’s retention policy to stay contained. Because you do some of your best work on planes and trains and in cabins with one bar of signal, and a local stack gives you the same capability whether or not the network shows up. And because there’s a real, durable difference between privacy by policy and privacy by architecture: policies change, terms get updated, companies get acquired — but a stack that never phones home doesn’t have those failure modes.

None of this is an argument that the cloud is over. The biggest frontier models, heavy shared-team work, anything that needs collaboration across people — the cloud is still the right call there and will stay it. The point isn’t that local is better. It’s that local is now a real choice instead of a compromise you tolerate. That choice didn’t exist a year ago. It does now.

The piece that has to follow you

There’s one honest limit to the all-local dream, and it’s worth naming: context that lives on one laptop lives on one laptop. The second you switch from your editor at your desk to a different tool on a different machine, the engineered context you built up doesn’t come with you. You’re a stranger again.

That’s the thread Anneal is built around — a hosted workspace, not a local install, where what you’ve established in one tool carries into the next instead of being re-explained from scratch. Local models are what made a zero-cloud inference workflow possible in the first place. But context that travels with you across Claude Code, Codex, Gemini, and everything else is what turns a pile of one-off sessions into something that feels like it remembers.

The parts are already on your machine. The interesting move — the one worth your evening — is wiring them together so the model finally knows what you’re working on.


Apple Foundation Models, ONNX Runtime, Phi, and Llama are products of their respective companies. Performance details cited here come from their official documentation, linked above.