Orbit 3: Building an Agent That Deploys Itself

A few weeks ago I told someone, in passing, that my AI agent had shipped a fix to itself overnight and that I’d woken up to a green deploy. I watched their face cycle through about three emotions. Then they asked the right question:

“Okay — but what does that actually mean?”

Fair. Let me show my work.

This is a technical tour of what I’ve actually built over the last few weeks in a project I call Orbit 3. It is — I’ll just say it up front — the most personally useful piece of software I’ve ever written. Not because the ideas are novel. Because the pieces finally fit.

What Orbit 3 Actually Is

Strip the marketing off: Orbit 3 is a directed acyclic graph (DAG) execution engine for agentic work, with two main job shapes.

Code jobs. Spawn one or more coder agents in isolated git worktrees, run a verify gate, merge to trunk through a serialized merge train, and optionally deploy with smoke tests and auto-rollback.
Content pipelines. Scheduled, multi-step workflows — research, draft, review, publish — driven by the same engine on a cron-like trigger.

The chat UI on top of it is rebuildable surface area. The engine is the product.

That framing matters because once you treat the chat agent as just one consumer of the engine — the same engine that runs your TWIE newsletter pipeline, the same engine that runs your coder jobs — a lot of design pressure resolves itself. You stop asking “how do I make the chat smarter” and start asking “what jobs would I want to spawn from the chat.”

Okay. With that framing, the tour.

The First Dogfood Run

On May 23rd, Orbit 3 executed its first real end-to-end dogfood run: a coder agent picked up an instruction, worked in its own git worktree, committed, handed off to the integrator, passed the verify gate, merged to trunk, and ran a deploy with a smoke test. No human in the loop after the spawn.

That’s the boring summary. The interesting part was what broke first.

The coder finished. The integrator looked at the worktree, saw no new commits ahead of base, and politely refused to merge “nothing.” The coder had done the work — it just hadn’t actually committed it. The worktree was full of staged changes the model had described in its final message instead of saving with git commit.

This is what I now call the coder-must-commit invariant, and it’s enforced in two places:

In the worker prompt. The coder is told, in plain language and with examples, that the only signal of completion is a commit on its branch. Describing what you did doesn’t count.
In the integrator. A branch-ahead guard checks that the coder’s branch has at least one commit ahead of the base. If it doesn’t, the run fails with a clear error instead of silently merging an empty change.

Belt and suspenders. The model is fast and capable and occasionally lies about what it did. The guard makes lying impossible to succeed at.

Self-Deploy, Same Day

Later on May 23rd, I did something I’d been quietly afraid of: I registered Orbit 3 itself as a project in Orbit 3, and asked it to modify its own code.

It did. It merged. And then it deployed itself.

The trick — and it is a trick, the kind you only learn by getting burned — is that the deploy step can’t run inside the process you’re about to restart. If your deploy script lives in the same Node/Bun process that’s serving requests, the moment you tell systemd to restart the unit, your script dies mid-step. You never get the smoke test. You never get the rollback. You get a crash and a coin flip.

So the deploy step writes a small detached shell script and hands it to systemd via systemd-run. That script lives outside the server process. It restarts the unit, waits, runs smoke tests, and either declares victory or rolls back to the previous trunk. The whole thing leaves a verdict in deploy.log — which is the source of truth, because the process making the decision can’t be trusted to still exist when the decision needs reading.

If that sentence sounded paranoid, good. It should. Self-deploying systems are a tight little nest of “the thing that knows is the thing that just died.”

The other lesson from that day: spawnSync blocks the server. Naively running a deploy script with spawnSync from inside the agent loop will freeze every other request the server is handling, for as long as the deploy takes. I had to learn that twice before I believed it.

The Test Seam

I wrote a bun test suite that covers:

The orchestrator DAG — pending → ready → running → done transitions, dependency gating, cancel-cascade behavior.
The integrator’s merge train — branch-ahead guard, dirty-checkout guard, deploy step ordering, rollback on smoke-test failure.
Auth — token issuance, scope checks, the ORBIT3_WALL_TOKEN legacy path.

Each test boots an isolated temp SQLite database. None of them invoke a live LLM. The agent boundary — the place where an actual model call would happen — is mocked.

I mention this because it’s the single most important design choice in the test suite, and it’s not obvious. The temptation when you’re testing an “AI system” is to test the AI. Don’t. Test the plumbing around the AI. The plumbing has well-defined inputs and outputs and predictable failure modes. The model does not. The model is a vendor you can swap, and your tests shouldn’t depend on which one you’re using today.

Mock the agent boundary, not the agent. Test the parts you actually own.

The Chat Can Spawn Jobs

The chat agent has direct, in-process MCP tools for spawning and monitoring orchestration jobs. When you say “go fix the wall page’s calendar coloring,” the chat doesn’t paste a command for you to run. It calls spawn_code_job directly, returns the job id, and the run shows up live in the Runs view.

This is the single feature that flipped Orbit 3 from “interesting toy” to “daily driver.”

But it broke things in a specific way you should know about if you build something similar.

Never start a second query() inside a live streaming query(). The Claude Agent SDK’s streaming query holds state. Starting a nested query while one is mid-stream is undefined behavior in the polite sense and a soft hang in the rude sense. If the chat needs to do something heavy, it spawns a job and lets the orchestrator do it — out of process, in a worker. The chat itself never blocks on model work other than its own turn.

Never block the chat loop on a synchronous tool call. This is the bigger one, and it cost me an entire afternoon. More on that in a moment.

The SSE Idle Timeout That Looked Like a Cloudflare Bug

For a while, the chat would die mid-response on long replies. The browser would report unexpected EOF, the cloudflared logs would show a connection reset, and I went looking for problems in the tunnel.

The tunnel was fine. Bun’s HTTP server has a default idleTimeout of 10 seconds. If a stream goes quiet for more than 10 seconds — say, because the model is thinking — Bun kills the connection. Cloudflare sees the dropped TCP stream and reports it the only way it knows how: as an unexpected EOF.

The fix:

Set idleTimeout: 255 on the Bun server.
Send a heartbeat over the SSE stream every few seconds during quiet periods.

The error wasn’t where it looked. The error never is. Whenever you’ve got a misbehaving stream and the proxy is taking the blame, check the origin’s idle timeout first. It is shockingly often the answer.

When MCP Tools Go “Offline”

A more subtle problem showed up later. The chat would sometimes report that an MCP tool — say, the librarian — was “offline.” It wasn’t. The tool was alive. It just couldn’t get a turn on the CPU.

The in-process MCP bridge runs alongside the chat agent. When the chat spawns a heavy code job — say, an Opus-tier coder running flat out — that job can starve the bridge for tens of seconds at a time. The bridge’s health-check pings start timing out. The chat assumes the tool is down. It says so.

The fix, in pieces:

The librarian-propose tool self-detached. Shipped 2026-05-28. The librarian’s heavy step now runs in a separate process, so the chat’s view of “is the tool alive” no longer depends on whether the librarian is currently doing real work.
The bash tool needs the same treatment. Still open. When a chat-launched bash command runs long, it still occupies the bridge. The fix is the same shape — self-detach — but it’s not done yet.

The meta-lesson — and this is the one I’d put on a poster — is that your event loop is a shared resource, and every synchronous call on it is a tax that every other consumer of the loop pays. Treat blocking calls as bugs even when they appear to work.

The Architecture Decision

Around the end of May, chat quality started degrading in a way that wasn’t obvious from logs. Streaming felt laggy. Tool calls felt sluggish. Polls were spiky.

The culprit was the same spawnSync lesson from earlier, scaled up. The orchestration engine was running inside the same Node/Bun process as the chat server. Every time the engine did something synchronous — and the engine does a lot of synchronous things, because that’s what makes it correct — the chat paid for it.

I considered a rewrite to Go. The case for it was real: better concurrency primitives, no event loop to share. The case against it was bigger: the agent layer has to be TypeScript because the Claude Agent SDK is TypeScript and the subscription-auth path I lean on isn’t portable. Splitting languages across the same project would have given me the worst of both.

The decision I landed on, and the one I’d recommend to anyone with a similar shape:

Fix async first. Audit the hot paths in the existing code for blocking calls. Make them async. This is unglamorous and pays for itself within hours.
Then carve the engine behind a process boundary. Don’t rewrite the engine. Just move it into its own OS process and let SQLite be the command channel between the chat tier and the engine tier.
Defer the language change. Maybe forever. Maybe never. The agent layer stays TypeScript.

The Engine Carve-Out

Finished on May 29th. The engine now runs as orbit3-engine, its own systemd unit. The web tier — chat, UI, API — runs as orbit3-web. They communicate via a SQLite command channel.

This sounds elaborate. It’s not. SQLite is a file. One process writes commands. The other process polls for them. SQLite’s WAL mode makes this safe under concurrent access. There’s no message broker, no Redis, no Kafka, no “infrastructure.” There’s a file.

The rollout was four slices:

Move engine code into its own entry point, run it under a separate process, still in the same repo and the same deploy.
Add the SQLite command channel and switch the chat → engine path to use it.
Add the systemd drop-in so engine restarts don’t take the web tier down with them.
Cut over live on the home server. Old engine in-process code path removed.

The rollback recipe — and you should write the rollback recipe before the rollout, every time — is: remove the systemd drop-in, disable the engine unit, restart the web service. The old code path is gone but the schema is forward-compatible. If I’d needed to back out, the worst case was a few minutes of downtime, not a database migration.

The follow-up I haven’t done yet: the systemd drop-in lives as a separate file on disk that the deploy script writes. It should be vendored into the repo so a fresh checkout of the project gives you a working production setup. That’s a tomorrow problem. Today the cut-over is live and the web tier is finally insulated from the engine loop.

The Runs Page and the Polling Firehose

The Runs view shows live job status. The first version polled the orchestrator every second. That worked fine — until I left a Runs tab open in a background browser window for a few hours and the server quietly burned CPU answering polls for a tab nobody was looking at.

The fix:

Visibility-aware polling. When the tab isn’t visible (Page Visibility API), stop polling.
Exponential backoff. If nothing’s changed for a while, slow down.
A single shared poller across components on the page, instead of each component polling on its own.

The Runs page got quieter, the server got happier, and the firehose is gone.

The other bug I want to flag here, because it’s a category of bug worth recognizing: at one point I had cloudflared resetting connections under load. I diagnosed it down to a known tunnel configuration issue. I almost reached for the config file. I didn’t, because the production fix at that moment was simpler: restart cloudflared. The proper config change is still deferred. Worth knowing about: don’t blind-edit tunnel config under stress. Restart first, fix later, when you can think.

Inheriting Sol

Now the part that made all of this possible.

A few years ago, I loaned my son my workstation when he went off to college. It was a serious machine — a Ryzen Threadripper, 128 GB of RAM, the kind of box you build because you decided once that you were tired of waiting on computers. When he finished college this spring, the box came home.

I named it Sol. Because it runs Orbit. (You can groan; it’s fine.)

The point isn’t the silicon. The point is what 128 GB of RAM and a Threadripper change about what’s feasible on a single machine. Orbit 3 had always been designed to be a home-server agent. It was just stuck pretending to be a laptop tool. With Sol back in the house, I had something I hadn’t had before:

A 24/7 machine with the CPU and RAM to run the engine, the chat agent, parallel coder worktrees, an embedded database, MCP bridges, a Cloudflare tunnel, and a kiosk endpoint simultaneously, without trading any of them off against the others.
Real uptime. Not laptop-lid-closed uptime. Actual uptime.
A box that could deploy itself overnight while I slept.

The word that kept coming to mind during that week was opened. Things I’d been putting off for ages — “yeah, someday Orbit could also do that” — suddenly had somewhere to live. The pipeline that builds my weekly email newsletter, TWIE, used to run on a personal laptop on a fragile cron. It runs on Sol now. The chat agent used to bounce between my desk and my mac. It lives in one place now.

And then I killed DakBoard.

Replacing DakBoard

For years, the TV in my office had been running DakBoard — a perfectly fine paid SaaS kiosk app that displays a calendar, weather, photos, that kind of thing. Fine. Locked. Subscription. Couldn’t read from any of my own systems without an integration I didn’t want to build.

Once Orbit 3 had a real always-on host, I built /wall: an Orbit-native kiosk page on the same server as everything else. Pointed the TV at it. DakBoard, gone.

The key isn’t that I saved a subscription fee. It’s what owning the kiosk unlocked.

The wall reads from the same data Orbit already has. Today’s todos, this week’s goals, the calendar feed, the weather probe, my company’s stock price, the TWIE newsletter’s stats. There’s no “second system of record that has to be synced to the wall.” The wall is a view of Orbit. When I check off a todo from my phone, the wall updates. When the calendar feed gets a new event, the wall shows it.

A daily-rotation strip. A 140-pixel band between the weather and the two-column body shows a Greek word of the day and a Bible verse. DakBoard would never have shipped this. It’s ~50 lines of React on a page I own.

A live bottom ticker. A scrolling band shows email metrics, my stock ticker, daily deltas. It’s backed by a new metric_daily_snapshots table — one row per metric per calendar day — so the kiosk doubles as a quiet dashboard. Every time I walk into the office, I get a glance at how the business is doing.

A live control surface. I can dim the wall, push a banner announcement, hide a section, or reset everything by talking to the chat agent. There’s no admin UI to log into. No build cycle. No deploy. I say “dim the wall a little” and the wall dims. This was the whole point: ad-hoc personalization without a build.

Auth tuned for an office TV. A single environment-variable token (ORBIT3_WALL_TOKEN) on the query string. No login. No session. The device is mounted on my wall in my house and never moves. Anything more than a token would be theater.

Here’s the frame I’d offer: DakBoard didn’t lose because Orbit beat its feature list. It lost because once I had a server that could host the kiosk endpoint, the integration cost of building one inside Orbit collapsed. Sol turned “one more tool I pay for” into “one more route on the system I already own.”

That’s a pattern worth naming. When you own the platform, every new feature is a route. When you don’t, every new feature is an integration.

What’s Still Rough

In the spirit of not pretending — here’s the honest list of what I haven’t fixed yet.

The bash MCP tool still blocks the bridge under heavy chat-launched jobs. The librarian-propose tool was the prototype for the fix; bash needs the same self-detach treatment.
The systemd drop-in isn’t vendored. A fresh checkout of the repo doesn’t give you a working production setup. You’d have to know to write the drop-in. That’s a footgun.
The cloudflared origin-reset config fix is deferred. Restarting the tunnel works as a stopgap, but the proper fix is a config change I haven’t made.
I don’t have a real test for the deploy-and-rollback path. It’s been exercised plenty in production, which is one kind of testing. It’s not the kind I’d recommend.

That’s the state of it. Honest, warts and all.

Lessons, Stated Plainly

If you were going to take four things from this and put them somewhere you’d see them again, take these:

Detach long-running side effects from the request that triggered them. Deploys, heavy MCP tools, anything that lives longer than a few seconds. If it can outlive the process that started it, it should.
The event loop is a shared resource. Treat blocking calls on it as bugs even when they appear to work. Especially when they appear to work.
Logs beat in-memory state when processes can die mid-decision. deploy.log is the verdict because the process making the decision can’t be trusted to still exist when someone needs to read it.
Mock the agent boundary, not the agent. Test the plumbing. The model is a vendor you can swap. The plumbing is what you own.

And one more, the one this whole tour is really about: when you own the platform, every new feature is a route. That’s the difference between paying a subscription for an office kiosk and writing a 140-pixel React strip that shows a Greek word every morning.

The Greek word today, in case you’re curious, is χάρις.

Grace.

Fitting, for a project that started as a chat agent and ended up as the thing that quietly deploys itself while I sleep.