Agents That Ship Are Boring

Adam J. SmithAdam J. Smith

A paper came out last week that I think is worth your time if you're building AI systems for enterprise. You can read the full thing here.

Melissa Pan and about two dozen co-authors surveyed 306 people who are actively building AI agents and conducted 20 in-depth interviews with teams that actually have systems running in production, serving real users. It's the first large-scale study of what AI agents actually look like in production environments.

The sample naturally skews toward successful deployments, which is what makes the findings useful to us, we're not interested in what might work in theory, but what's already working in practice.

Reliability Paradox

Nearly 40% of the practitioners surveyed say reliability is their primary development concern, and yet their agents are running in production environments, some serving millions of users. How do you ship something when your biggest worry is whether it works reliably? The answer, it turns out, is that you constrain everything. You design around the problem rather than solving it directly.

68% of production agents execute ten or fewer steps before requiring human intervention, and almost half execute fewer than five. Eighty percent use predefined workflows rather than letting the agent figure out what to do next. Many systems operate in read-only mode, where the agent can analyze and recommend but never actually touch production state. Others run in sandboxed environments where mistakes stay contained. The agents that make it to production are the most legible and controllable ones. Teams are trading autonomy for reliability, and that trade seems to be working.

Fine-tuning is rare

Almost nobody is fine-tuning. Seventy percent of the interviewed teams use frontier models straight out of the box. Current models are already good enough for most well-scoped applications, and fine-tuning creates a maintenance burden because your customizations become brittle when things change or drift too far from the last training run. The teams that do fine-tune tend to do so selectively, for specific enterprise clients who need particular customizations, not as a default practice. This challenges an assumption I think many people hold, that custom-tuned models are a more advanced/desirable state to work towards. For a lot of use cases, prompting alone gets you there.

Prompts are long

About half of production systems use prompts under 500 tokens, which is what you'd expect. But there's a long tail: 12% of systems exceed 10,000 tokens. Prompt complexity seems to correlate with system maturity. As teams iterate and encounter edge cases, the prompts accumulate handling for those edges, domain context, guardrails, and all the little instructions that keep the system on track. 79% of respondents construct these prompts manually, or with light LLM assistance for refinement. Automated prompt optimization tools—DSPy and the like—show up in fewer than 9% of deployments. Teams want to see exactly what's going into the prompt and maintain direct control over it.

Evals are still immature

Seventy-four percent of teams rely primarily on human-in-the-loop evaluation, having actual people review agent outputs. About half use LLM-as-judge approaches, but there's a key detail: every single interviewed team using LLM judges also uses human verification on top of it. Nobody trusts the automated judge alone. 75% of teams don't use formal benchmarks at all. The ones that do build benchmarks describe the process as painful, one team spent months creating 40 test scenarios, then another six months scaling to 100. The fundamental problem is domain specificity. Production tasks don't map cleanly to public benchmarks, and creating ground truth data from scratch is genuinely hard work.

Several teams also mentioned struggling to integrate agents into existing CI/CD pipelines. The nondeterminism breaks traditional regression testing approaches. You can't just check that the output matches the expected output when the output is different every time. This feels like a real gap in the tooling landscape.

Latency isn't a big deal

One finding that surprised me: latency mostly doesn't matter. Only 15% of practitioners cite it as a deployment blocker. Sixty-six percent allow response times of minutes or longer. Agents aren't competing with other software; they're competing with how long a human would take to do the same task. An agent that runs for five minutes still beats assigning the work to an overloaded team member who might take hours or days. The exception is real-time voice and chat applications, where teams fight latency constantly because they're competing against the pace of human conversation.

Roll your own > frameworks

This also runs counter to what I expected. Eighty-five percent of the interviewed teams build their agent scaffolding entirely in-house rather than using LangChain, CrewAI, or similar tools. The reasons they give: frameworks add dependency bloat, make debugging harder, and don't accommodate the vertical integration that most production systems require. Two teams specifically mentioned starting with frameworks during their prototyping phase and then migrating away before deployment. The broader survey data shows higher framework adoption at 61%, which suggests there might be a gap between what people use during experimentation and what survives into production.

So what does this mean for us? The production agents that work today are simpler than the research literature would lead you to believe. They use frontier models out of the box, constrained workflows, and heavy human oversight. Teams that try to build more autonomous systems hit reliability walls. If we're building for deployment, the takeaway seems clear: scope aggressively and design for human review rather than full automation. If we're building tooling, the opportunities are in evaluation infrastructure, CI/CD integration for nondeterministic systems, and anything that helps teams measure agent quality without hand-labeling every example.