Agents That Ship Are Boring

December 11, 2025

Adam J. Smith

A paper came out last week that I think is worth your time if you're building AI systems for enterprise. You can read the full thing here.

Melissa Pan and about two dozen co-authors surveyed 306 people who are actively building AI agents and conducted 20 in-depth interviews with teams that actually have systems running in production, serving real users. It's the first large-scale study of what AI agents actually look like in production environments.

The sample naturally skews toward successful deployments, which is what makes the findings useful to us, we're not interested in what might work in theory, but what's already working in practice.

Reliability Paradox

Nearly 40% of the practitioners surveyed say reliability is their primary development concern, and yet their agents are running in production environments, some serving millions of users. How do you ship something when your biggest worry is whether it works reliably? The answer, it turns out, is that you constrain everything. You design around the problem rather than solving it directly.

68% of production agents execute ten or fewer steps before requiring human intervention, and almost half execute fewer than five. Eighty percent use predefined workflows rather than letting the agent figure out what to do next. Many systems operate in read-only mode, where the agent can analyze and recommend but never actually touch production state. Others run in sandboxed environments where mistakes stay contained. The agents that make it to production are the most legible and controllable ones. Teams are trading autonomy for reliability, and that trade seems to be working.

Fine-tuning is rare

Almost nobody is fine-tuning. Seventy percent of the interviewed teams use frontier models straight out of the box. Current models are already good enough for most well-scoped applications, and fine-tuning creates a maintenance burden because your customizations become brittle when things change or drift too far from the last training run. The teams that do fine-tune tend to do so selectively, for specific enterprise clients who need particular customizations, not as a default practice. This challenges an assumption I think many people hold, that custom-tuned models are a more advanced/desirable state to work towards. For a lot of use cases, prompting alone gets you there.

Prompts are long

About half of production systems use prompts under 500 tokens, which is what you'd expect. But there's a long tail: 12% of systems exceed 10,000 tokens. Prompt complexity seems to correlate with system maturity. As teams iterate and encounter edge cases, the prompts accumulate handling for those edges, domain context, guardrails, and all the little instructions that keep the system on track. 79% of respondents construct these prompts manually, or with light LLM assistance for refinement. Automated prompt optimization tools—DSPy and the like—show up in fewer than 9% of deployments. Teams want to see exactly what's going into the prompt and maintain direct control over it.

Evals are still immature

Seventy-four percent of teams rely primarily on human-in-the-loop evaluation, having actual people review agent outputs. About half use LLM-as-judge approaches, but there's a key detail: every single interviewed team using LLM judges also uses human verification on top of it. Nobody trusts the automated judge alone. 75% of teams don't use formal benchmarks at all. The ones that do build benchmarks describe the process as painful, one team spent months creating 40 test scenarios, then another six months scaling to 100. The fundamental problem is domain specificity. Production tasks don't map cleanly to public benchmarks, and creating ground truth data from scratch is genuinely hard work.

Several teams also mentioned struggling to integrate agents into existing CI/CD pipelines. The nondeterminism breaks traditional regression testing approaches. You can't just check that the output matches the expected output when the output is different every time. This feels like a real gap in the tooling landscape.

Latency isn't a big deal

One finding that surprised me: latency mostly doesn't matter. Only 15% of practitioners cite it as a deployment blocker. Sixty-six percent allow response times of minutes or longer. Agents aren't competing with other software; they're competing with how long a human would take to do the same task. An agent that runs for five minutes still beats assigning the work to an overloaded team member who might take hours or days. The exception is real-time voice and chat applications, where teams fight latency constantly because they're competing against the pace of human conversation.

Roll your own > frameworks

This also runs counter to what I expected. Eighty-five percent of the interviewed teams build their agent scaffolding entirely in-house rather than using LangChain, CrewAI, or similar tools. The reasons they give: frameworks add dependency bloat, make debugging harder, and don't accommodate the vertical integration that most production systems require. Two teams specifically mentioned starting with frameworks during their prototyping phase and then migrating away before deployment. The broader survey data shows higher framework adoption at 61%, which suggests there might be a gap between what people use during experimentation and what survives into production.

So what does this mean for us? The production agents that work today are simpler than the research literature would lead you to believe. They use frontier models out of the box, constrained workflows, and heavy human oversight. Teams that try to build more autonomous systems hit reliability walls. If we're building for deployment, the takeaway seems clear: scope aggressively and design for human review rather than full automation. If we're building tooling, the opportunities are in evaluation infrastructure, CI/CD integration for nondeterministic systems, and anything that helps teams measure agent quality without hand-labeling every example.

Agent Framework Agent Scaffolding AI Agent AI Agent Production Study Artificial Intelligence Automated Prompt Optimization Autonomy Benchmark Chat Application CI/CD Pipeline Constrained Agent CrewAI Debugging Dependency Bloat Domain Context Domain Specificity DSPy Edge Case Enterprise Deployment Evaluation Evaluation Infrastructure Fine-tuning Frontier Model Ground Truth Data Guardrail Human Conversation Human Intervention Human Oversight Human Worker Human-in-the-Loop Evaluation LangChain Language Model Latency LLM-as-Judge Machine Learning Manual Prompt Construction Melissa Pan Model Customization Model Drift Multi-step Agent Nondeterminism Predefined Workflow Production AI Agent Prompt Complexity Prompt Engineering Prototyping Read-only Mode Real-time Application Regression Testing Reliability Reliability Wall Research Research Literature Sandboxed Environment Software Engineering System Maturity Testing Tooling Vertical Integration Voice Application

Bandwidth, Signal, Noise

'We shape our tools, and thereafter they shape us' — Marshall McLuhan

Essay3 shared topics

Attention is All You Have

Our attention is stolen in pennies—notifications, emails, feeds—until our entire fortune is gone. We need AI not as an assistant, but as a bouncer: keeping the riff raff out so we can focus on what matters.

Essay1 shared topic

What is the Value of Data?

'Value is not intrinsic; it is not in things. It is within us; it is the way in which man reacts to the conditions of his environment.' — Ludwig von Mises