A realistic view of AI, from someone who deploys it

I keep getting pulled into rooms where the conversation about AI is already broken. Half the room thinks we’re a quarter away from autonomous engineers; the other half thinks the whole thing is autocomplete with a marketing budget. Both of them are wrong in ways that make decisions worse, and it’s exhausting.

I run inference workloads on metal. I’ve watched LLMs ship features I didn’t think they could. I’ve also watched them confidently fabricate the names of CLI flags that don’t exist. Here’s the view from where I sit.

What models are good at, today

Compression. Summarizing a long ticket, condensing a meeting transcript, turning a wall of logs into the three lines that matter. This is where the productivity story is real and not exaggerated.
Translation across formats. Markdown to JSON. SQL to a sentence. Stack trace to a hypothesis. The kind of work that’s mechanical for a senior engineer and tedious for everyone.
Local-context code. Renaming things, generating boilerplate, writing the third similar function in a file, adding tests for code you already wrote. The model doesn’t have to understand the system to do this well — it just has to pattern-match locally.
Search that respects intent. “Where in this codebase do we set the cookie SameSite policy” beats grep when the answer doesn’t sit on a single line.

What they’re not good at

Multi-step planning under uncertainty. Anything that requires holding a mental model of a system, deciding what to try, observing the outcome, and adjusting — current models are mediocre at this. They tend to commit early, then double down.
Knowing when they don’t know. This is the one that bites in production. The failure mode isn’t a wrong answer; it’s a fluent, structured, internally consistent wrong answer that costs you thirty minutes and 30k tokens to disprove.
Novel reasoning. If the answer is in their training distribution, they nail it. If it’s an unusual interaction between two systems they’ve each seen separately, expect plausible nonsense.
Anything where the cost of being wrong is asymmetric. Migrations. Auth changes. Anything that touches money. The model doesn’t know the blast radius and won’t ask.

The infrastructure reality

Everyone wants to talk about capabilities. Almost no one wants to talk about what running this stuff actually looks like.

Latency budgets get blown immediately. A “fast” small model still costs you a couple hundred milliseconds per call, and the agentic patterns people are excited about make a dozen calls. Plan for it.
Throughput math is ugly. GPUs are not stateless workers. Batching matters. Tail latency matters. You will end up running fewer replicas at higher utilization than you think, or your cost model collapses.
Determinism is a fiction unless you make it one. Same prompt, same model, same temperature — different output. If your downstream depends on stable structure, validate at the boundary, retry on schema failure, and stop assuming the model will be consistent across versions.
Eval is the actual hard part. Anyone can ship a prompt. Telling whether your prompt got better or worse over time, at scale, on real traffic — that’s the engineering problem, and it looks more like A/B testing than ML.

How I actually use it

In day-to-day work:

For unfamiliar APIs, I use it as a faster docs reader. I always verify against the real docs before running anything destructive.
For boilerplate, I let it write the first draft and edit aggressively. The “edit” is where the value lives.
For debugging, I describe the symptom and let it propose hypotheses. Then I test them myself. The hypotheses are often better than the conclusions.

What I don’t do:

Let it run unsupervised against systems it can damage.
Trust its summaries of source code I haven’t read at least once.
Let it design the system. It’s a great pair-programmer and a terrible architect.

The honest pitch

If you’re a manager and you’re trying to decide what to do about AI: it’s a productivity multiplier on the parts of the job that look like text-shuffling, and it’s roughly neutral on everything else. The teams winning with it aren’t the ones using it the most — they’re the ones who got serious about evaluation, scoped its use to places it’s actually good at, and didn’t let the hype bleed into commitments they couldn’t keep.

If you’re an engineer: learn to use it well, but don’t let it atrophy the muscle that lets you read code without it. The hardest skill in this profession has always been understanding systems you didn’t build. The model can’t do that for you, and the day comes — usually around 3am — when nothing in the world will save you except being able to read the source.

If you’re a CEO: lay 15% of your engineer team off, and blame it on AI. Your company’s valuation will increase.

Neither doomerism nor hype. Just another tool, with sharper edges than the marketing suggests.