I Killed My Local-First AI Stack This Morning

At 8:20 AM I called time of death on a system I'd been building since late last week. By lunchtime it was gone. A dozen background services, three code repositories, the whole monitoring setup that watched it all run. Deleted, uninstalled, wiped.

Every automated process in my business now calls Claude directly. No clever routing. No local fallback. Just the simplest possible path from "my system needs to think about something" to "my system has an answer."

This is a post-mortem on a project I believed in, built hard, and killed deliberately once the math stopped making sense. If you run a business that's getting serious about AI, and anyone has pitched you the idea of hosting your own models to save on costs, this one is for you.

Why I Built It#

Forged Cortex is a one-person AI consultancy, and my own operation is heavily automated. Briefings compiled at 5 AM. A dispatch system that hands work off to specialist agents. Tools that watch my inbox, my calendar, my content pipeline, my coding projects, my entire business. All of it runs on AI calls....a LOT of AI calls.

The logic for going local was the kind of logic a business owner recognizes immediately.

Cost control. I can very easily see a future where my AI bill scales with my business, and I wanted to decouple the two. If I could route the routine work to models running on hardware I already owned, I'd cap my variable costs no matter how aggressive my automation got.

Data control. Keeping data on my own machines instead of sending it out to Claude felt like it could be faster and (even if slightly) reduce any attack surfaces I may have exposed.

Stack control. I don't love being dependent on any single vendor for anything, let alone something this central to the business. Having my own inference layer meant I could switch, mix, or fall back without rewriting anything.

This is the same logic that drives good infrastructure decisions in any business. Own what's core. Rent what's convenient. Keep your options open. I believed in it enough to invest a good chunk of my own time building it.

What It Actually Cost Me to Run#

Here's the part that doesn't show up in a spreadsheet.

Every time a better AI model came out, I'd have to spend a weekend evaluating it and teaching my system when to use it. That's me. That's my Saturday. Miss that update for too long and the system starts making worse decisions, quietly, without telling anyone.

Every time a cheaper model returned a bad answer, my code had to figure out what to do about it. Retry? Escalate to Claude? Fail? I built logic for all three and none of them felt great. The cleanest answer was always "escalate to Claude," which meant the savings on those particular requests evaporated and I'd paid for the failed attempt in my own time.

I needed a whole additional set of monitoring tools just to know whether the routing was working. Four more services to keep running, keep updated, keep an eye on. None of them generate revenue. All of them demand attention the moment they misbehave.

Every piece of code that called my system had to label its request correctly so the router knew where to send it. Enforcing that discipline across two dozen automated processes became its own full-time job. Miss a label, the router sends it to the wrong model, you get a worse answer, and you lose trust in the whole thing.

None of this showed up in the "how much am I spending on tokens" math. All of it was real. All of it was attention I'd be spending on my own plumbing instead of on my clients.

The Moment It Became Obvious#

I ran an honest comparison this morning, and the answer was uncomfortable.

I had my system compile this morning's briefing two ways. The first was the normal path, routed through the local stack under all the cost-optimization logic I'd built. The second was a straight pass through Claude with none of the routing in the way.

Then I read them side by side.

The no-routing version was, in my own words in my notes at the time, "a HELL of a lot better. Way more complete." Not marginally better. Not close enough to justify the savings. Obviously, unambiguously better. No contest.

The routing layer had been trading quality for marginal savings, and I hadn't really seen it until I ran the baseline for comparison.

Warning

This is the thing nobody tells you about building this kind of cost-optimization layer. The savings show up cleanly in a dashboard, in a number you can point at. The quality loss shows up as a vague sense that your outputs are a little less sharp than they used to be, and you rationalize it away. Until you run the baseline and see what you actually gave up. If you're making this call in your own business, the only trustworthy comparison is reading the outputs side by side. The dashboard will lie to you by omission.

That was the moment. The cost-control logic that had felt so responsible on paper was costing me quality on the outputs I actually rely on. Briefings that shape my morning. Drafts that eventually reach clients. Research my decisions sit on top of.

A small savings on my AI bill was not worth worse versions of those things. It was never going to be worth that. I just hadn't measured the trade honestly until this morning.

The Pivot#

I ripped the whole thing out before lunch.

Out came the routing layer. Out came the management layer that sat on top of it. Out came the monitoring stack I'd stood up to watch it all. A dozen background services unloaded. Three code directories deleted. Four cost-optimization tiers stripped out of the code that actually does my work. Almost a gig of supporting software uninstalled.

The core of my business returned to its previous running state....roughly 98% of my automated workload runs through Claude. Direct calls, no routing, no "what if we used a cheaper model for this one" logic anywhere in the path.

Note

This is not me abandoning my interest in local models. Local and open source models will only become more important as the frontier model landscape continues to shift. I still have about 15 models available to me, but they're largely experimental at this point and not involved in the production path of my business. For an operation like mine, Claude is the pragmatic choice: the highest-quality output I can get, effectively zero operational overhead, predictable cost I can plan around (at least, for the time being).

The core of my automation is entirely intact. The dispatch system still runs. The worker agents still work. The self-healing processes, the message board, the briefings, the whole operational layer. All of it still functions. It just takes the direct path now instead of the clever one.

The briefing this morning ran about fifteen percent slower without the optimization layer in place. I will take that trade every day of the week in exchange for the quality I got back and the weekend maintenance I no longer have to do.

The Actual Lesson#

Here is what I want you to walk away with, because I think it matters more than the technical story.

The consultant or partner you bring into your business should have the self-awareness to tell you when the shiny thing is not the right thing for you. Even when it's the shiny thing they wanted to build. Even when they've already invested in building it.

I had to do that to myself this morning. I built something I was proud of, watched it underperform, ran the honest comparison, and killed it the same day. That discipline matters more than any specific technology choice. The AI landscape changes every quarter. The willingness to measure, admit the answer, and adjust does not.

If you are evaluating an AI partner or an internal build, the question to ask is not "are they using the newest model" or "are they running their own infrastructure." The question is: will they tell me when the thing I'm paying them for is not the right answer? Will they kill their own work when the math says to?

That's the trust signal. Everything else is noise.

I'll probably circle back to local inference at some point, for specific problems where it genuinely makes sense. Privacy-sensitive data. Narrow use cases where a fine-tuned small model outperforms a general one. But it will be a deliberate return to a specific problem, not a philosophical commitment I talked myself into trying.

In the meantime, my stack is back to running on Claude. It's faster to build on. It's more reliable. The outputs are noticeably better. And I get my Saturdays back.

That's the trade I'll be keeping in mind the next time the urge strikes me to try again.

I Killed My Local-First AI Stack This Morning

Why I Built It#

What It Actually Cost Me to Run#

The Moment It Became Obvious#

The Pivot#

The Actual Lesson#

Share This Post

LIKED THIS POST?

Related Posts

Your MSP Isn't Ready for the AI Conversation. That's a Problem.

OpenClaw Is Powerful. It Is Also Very Much Not a Product You Can Just Install.

Evaluating AI Vendors Without Getting Burned