For about fourteen months I was the only person on the rotation for a system that wasn’t allowed to go down. Officially I had backup — a manager who would have answered the phone if it really mattered, an upstream team I could file tickets against during business hours. Practically, every page was mine, every diagnosis was mine, and every decision about what to do at 3am was mine.
This was, in the standard sense, a bad situation. I learned more in that year than in the three before it combined. Both of those things are true.
The first thing that changes is your relationship with your own tooling
When there’s a senior engineer on the other end of a call, sloppy tools are a personal annoyance. When there isn’t, sloppy tools are a liability. The runbook that “works for me” because I can fill in the missing context — that doesn’t work at 4am for the version of me who’s been awake for forty minutes and is missing glasses.
I started writing every runbook for the worst version of myself. Tired me. Me-on-day-six-of-the-flu me. The runbook had to be executable by that person, and any step that required judgment had to spell out the judgment criteria explicitly. If you see X, do Y. If you see X and also Z, do W. If you see anything not on this list, page my manager.
The runbooks got long. They also got used. By me, mostly, by the version of me I was writing them for.
The second thing is that your time horizon collapses
Normally engineering work is on a multi-week clock. Solo on-call shrinks that to a multi-hour clock for everything that touches the production system. You stop shipping changes you can’t immediately revert. You stop deploying on Fridays even when you swear you have time to babysit it. You stop working on the bigger refactor because the cost of being mid-flight when something breaks is too high.
This sounds bad. It is, mostly. But it forces a kind of discipline about scope that I didn’t have before. Every change had to be small enough that I could roll it back during a page without losing the thread. Every feature had to ship behind a flag I could disable without a deploy. Every database migration had to be reversible in practice, not just in theory.
The codebase that came out of that year was the most operable codebase I’ve ever worked on. Not by accident.
The third thing is that you become brutal about automation
When there’s a team, “we should automate this” is a backlog item. When you’re solo, “we should automate this” is a thing you’re going to do tomorrow because if you don’t, you’re going to do the manual version of it three more times before the quarter ends, each time at an inconvenient hour.
The threshold dropped. Anything I did manually twice got a script the third time. Anything that woke me up twice got an alert silencer with a TTL and a calendar reminder to fix the underlying cause. Anything I had to remember became a cron job, a healthcheck, or a config validation that ran at deploy time.
I shipped more automation in that year than in any year since, before or after. Not because I was trying to. Because every minute I didn’t spend automating was a minute the system was extracting from me directly.
The thing nobody tells you
The hard part wasn’t the work. It was the mental load of being always-on.
You can’t fully sleep when you might get paged. You can’t fully be present at dinner when your laptop is on the table. Vacation requires either explicitly handing off coverage (which I sometimes had no one to hand off to) or accepting that you might be debugging from a hotel lobby. The tax isn’t the pages themselves; it’s the way your brain refuses to fully exhale.
What worked, partially:
- Hard boundaries on tooling. Pager went to one app, on one device. Other comms could wait.
- Aggressive auto-resolution. A page that resolved itself before I got to it was a page I didn’t have to think about. Half my alert work was making this true.
- A weekly “what should never have paged me” review. Sunday morning, ten minutes, the previous week’s pages. Anything I shouldn’t have been woken up for got the underlying issue fixed or the alert tuned.
- A no-deploy window from Friday afternoon to Monday morning. Even when I was confident. Especially when I was confident.
What didn’t work:
- Telling myself it was fine. It wasn’t fine. It got better when I stopped pretending.
What I’d tell anyone in the same spot
Don’t romanticize it. Solo on-call is not a rite of passage. It’s a staffing problem that has been turned into your problem, and it has a cost that doesn’t show up on any dashboard.
But while you’re in it: write down everything. Automate aggressively. Ship small. Refuse the heroic late-night fix in favor of the boring rollback. The system will outlast your tenure on call, and the next person on the rotation — even if it’s still you, six months from now — will thank the version of you who took the time.
The thing I’m proudest of from that year isn’t an uptime number. It’s that when I finally got a partner on the rotation, the handoff took an afternoon. Everything they needed was written down. Nothing important lived only in my head.
That’s the real benchmark for solo on-call: when you’re not solo anymore, how much of the system have you handed off? If the answer is “everything,” you did it right.