Part 1 of 2 on agentic system adoption.
A 25% bump in individual AI usage correlates with a 1.5% drop in delivery throughput and a 7.2% drop in delivery stability. That's not a hot take — that's the 2024 DORA report, surveying thousands of engineers across hundreds of orgs.
Worse: in a randomized controlled trial from METR on experienced open-source developers, engineers expected AI tools to speed them up by 24%, reported a 20% speedup afterward, and were actually 19% slower. Perceived uplift and real uplift were moving in opposite directions.
If you're a platform lead or eng manager rolling an agentic tool out to your own team, those two stats should wake you up. The problem isn't that the tech doesn't work. The problem is that "works in a demo" and "works on Tuesday afternoon when Sarah is three coffees in and needs to ship a hotfix" are different problems, and most internal rollouts optimize for the first.
I've been running NEXUS — my own agentic system — as a daily driver for months. It handles content drafts, trading decisions on real money in DeFi, financial dashboards across two people's accounts, and a Slack approval workflow fanning out to four publishing platforms. I've made every mistake below at least once. Here's what I've learned about getting an internal agent past the demo cliff.
The demo cliff
Every agentic rollout has a moment. The kickoff meeting goes great. Three engineers try it, one ships something impressive, the Slack channel fills with lightning-bolt emojis. Then week three hits, and the channel goes quiet. When you audit usage, maybe 20% of the team is still touching it weekly. The rest tried it once, got burned, and went back to the old way.
That gap — between first use and fourth use — is where most internal agents die. Stack Overflow's 2024 developer survey puts it plainly: 76% of developers are using or planning to use AI tools, but only 43% trust the accuracy of the output. By early 2026, the trust number had dropped to 29%. Usage is up and to the right; trust is the opposite.
The people who keep using the agent after week three aren't the people who liked the demo. They're the people who figured out what the agent is actually good at and built a working relationship with its specific failure modes. That's the adoption you want. Here's how to design for it.
1. Pilot with skeptics, not enthusiasts
The default instinct is to recruit the AI-curious engineer — the one who already has Cursor, Claude Code, and three Ollama models running locally. That's a mistake. That engineer will adopt anything. Their feedback tells you nothing about your ceiling.
Give the pilot to the two engineers who rolled their eyes in the kickoff. The ones who said "we tried this in 2023 and it hallucinated three migrations." Those people find every failure mode on day one, and if you can earn their trust, the middle 60% of your team follows.
The pattern you want to avoid: an AI-guild pilot group reports 98% adoption after six weeks, and org-wide adoption settles at 14% six months later. The pilot group was not the org. It never is. Skeptics are signal, not noise. DORA's trust deep-dive frames this as an organizational practice — trust calibration lives with the user, not the tool, and it's built through repeated exposure to honest failure, not through marketing.
2. Prove it on work nobody cares about first
The worst place to debut an agentic system is production incident response. The second-worst is customer-facing code. The best place is cron jobs, cleanup scripts, log triage, PR description drafts, release notes, and the 20 little engineering chores that never make it onto a roadmap.
I learned this the hard way from my own content pipeline. The first agentic workflow I shipped in NEXUS wasn't "draft my LinkedIn posts." It was "scan 15 subreddits at 7am and summarize what changed." If the summary was bad, I ignored it. If it was good, I skimmed it. There was no blast radius. After three weeks of watching it get better, I trusted it enough to let it draft. Six weeks in, I trusted it enough to auto-publish short-form content with Slack approvals.
If I'd started at "auto-publish to LinkedIn," the first mistake would have been public and I'd have killed the project. Boring work is a safe harbor for building calibration.
3. Readable logs beat magic
Anthropic's engineering team wrote Building Effective Agents at the end of 2024, synthesized from dozens of production rollouts. Their three design principles: simplicity, transparency in the agent's planning steps, and investment in the tool interface. Every successful team was doing the boring version on purpose.
What this means in practice: when the agent does something, an engineer should be able to pull up a single file or dashboard and read what it did, what it saw, and why it made the call. Not a stack trace. Not a wall of JSON. A paragraph a human can skim.
My DeFi trader posts every decision to Slack with three lines: the market, the fair-value estimate the model produced, and why the position was above or below the spread. When the trader loses money — and it does — I can tell within 10 seconds whether the model was wrong, the fill was bad, or the strategy is off. That 10-second diagnosis is why I still trust it with real capital.
If your team has to read code to understand what the agent did, they won't. They'll just stop using it.
4. Make the override obvious
Every agentic tool needs a visible off-ramp. Not buried in settings — on the main surface. A button that says "ignore this and do it myself," a keybind that kills the agent mid-task, a config flag that puts it in suggest-only mode.
The override isn't a bailout feature. It's how trust accumulates. When engineers know they can always take the wheel, they're more willing to let the agent drive first. When they can't, they'll never get in the car.
NEXUS has an approval surface in Slack for every content post. 80% of the time I hit approve. The other 20% I edit or reject, and what was posted, what was edited, and what was rejected all persist in SQLite. I can audit the agent's batting average on demand. That audit trail is the reason I let it draft at all.
5. Default to boring
The fastest way to kill an internal agent is to make it creative. Engineers don't want the agent to surprise them. They want it to do the obvious thing, predictably, 95% of the time, and escalate the other 5%.
Pick the default behavior that would make a senior engineer nod. Default to "ask before acting." Default to the lower-risk option when two paths exist. Default to writing to a branch, not main. Default to suggesting a command, not executing it.
You can add more autonomy later, once the team trusts the agent's reflexes. You cannot take autonomy back once someone gets burned.
How you know it's working
Hamel Husain's field guide to improving AI products, drawn from 30+ production rollouts, makes the same point I keep making to platform teams: generic metrics are useless. BERTScore, ROUGE, cosine similarity — none of it correlates with whether your team actually uses the thing on Tuesday afternoon.
What correlates: binary pass/fail evals tied to real failure modes you've observed in your own usage, tracked over time, with a human in the loop redefining "pass" as the product evolves. Husain calls this "criteria drift." It's not a one-time eval setup; it's an ongoing practice.
The leading indicator I watch for any internal agent: daily active use by engineers who aren't on the project team. Weekly active by skeptics. Ratio of accepted-without-edit to edited-or-rejected over time. Those numbers tell you whether the trust curve is going up or down. Thumbs-up counts and star emojis do not.
What's next
Internal adoption is the easier half. You control the tools, the data, the engineers, and the feedback loop. Your team shows up every day whether the agent is good or not.
External adoption — users who paid for a product and can leave the moment the agent confidently does the wrong thing — is a different problem. That's Part 2.