Back to Blog

The Rigor Has to Go Somewhere

AI Leadership Engineering Org Design

I used Claude (Anthropic) to help research and write this piece. The analysis and perspective are mine, but Claude did the heavy lifting on synthesizing industry data and making my draft readable.

At a retreat in Deer Valley, Utah this past February, Martin Fowler gathered fifty of the sharpest minds in software engineering to confront a question nobody had a clean answer to: if AI handles the code, where does the engineering actually go?

They spent a day and a half in open discussion. Kent Beck was there. Steve Yegge. Gene Kim. Practitioners, researchers, and enterprise leaders who collectively represent decades of thinking about how software gets built. The retreat was held on the 25th anniversary of the Agile Manifesto signing, in the same mountains, deliberately echoing that earlier inflection point.

The most honest moment came from a participant whose framing became the event’s unofficial thesis: “The rigor has to go somewhere.” If we stop caring about the code, the discipline doesn’t disappear. It migrates. To specifications. To tests. To architecture. To oversight. And to leadership.

That migration is already happening. But most organizations don’t realize it yet, because they’re measuring the wrong thing.

1. The Cognitive Debt Nobody’s Measuring

Ninety-five percent of developers now use AI coding tools weekly. AI generates roughly a quarter of all production code. By every adoption metric, the transformation is complete.

And yet. Measured productivity gains have plateaued at around 10%. A landmark study by METR found that experienced developers were actually 19% slower with AI tools, despite believing they were 20% faster. That’s a 39-point perception gap. Carnegie Mellon researchers found that Cursor adoption produces a velocity bump that fades quickly, while the increase in code complexity persists. Microsoft’s own field experiments showed real gains, but only among less experienced developers working on well-defined problems.

The tools work. But the outcomes depend almost entirely on how leaders deploy them.

Margaret-Anne Storey, a researcher at the University of Victoria, coined a term at the Deer Valley retreat that captured what the numbers were circling around: cognitive debt. The organizational knowledge loss that occurs when teams offload too much work to AI and stop understanding their own systems. The software may be working. But the team’s theory of the system, the shared understanding of why things are built the way they are, erodes quietly underneath.

Rachel Laycock of Thoughtworks offered the sharpest synthesis: “AI is just an accelerator of whatever you already have.” The 2025 DORA report confirmed it empirically. Strong engineering practices plus AI produces fewer incidents. Weak practices plus AI produces twice as many.

This is the staircase problem all over again. AI doesn’t redesign the path. It just makes you faster on whatever path you’re already on. If the path is well-designed, you get to the right floor sooner. If it isn’t, you arrive at the wrong floor faster, with more confidence, and more cognitive debt to repay later. The variable isn’t the technology. It’s whether leaders are doing the harder work of redesigning the staircase, or just polishing the steps.

2. Three Companies, Three Answers

Three companies deployed essentially the same AI coding tools and produced wildly different results. The contrast reveals which leadership patterns build durable capability and which produce fast metrics.

Three vignettes

Plaid treated AI adoption as an internal product. Engineering leaders built dashboards tracking adoption by cohort and team, messaged every engineer who stopped using the tools to understand why, and created short internal videos showing AI running on their actual codebase. Generic vendor demos didn’t move the needle. Videos of AI debugging Plaid’s own systems did. The breakthrough insight came from the data itself: teams with the highest adoption had engineering managers who were already excited about AI. Plaid stopped targeting individual engineers and started targeting EMs directly. Within six months, over 75% of engineers were regular users.

Coinbase took the opposite approach. CEO Brian Armstrong mandated all engineers onboard Cursor and Copilot within one week. The following Saturday, he met with every engineer who hadn’t complied. Those without a valid reason were fired. The numbers came fast. A third of Coinbase’s code is now AI-generated. But Stripe’s John Collison offered the counterpoint that should keep every leader up at night: “It’s clear that it is very helpful to have AI helping you write code. It’s not clear how you run an AI-coded code base.”

Stripe chose neither product thinking nor mandate. They built infrastructure. Their internal system, Minions, produces over 1,300 pull requests per week, all merged into production. The key innovation is what they call “Blueprints,” orchestration flows that alternate between deterministic code nodes and open-ended agent loops. Each agent gets at most two CI rounds before handing back to a human. Engineers learn not through training programs but through reviewing agent output. The work shifted from writing code to reviewing code, a different job entirely, enabled by architectural decisions from leadership.

What the contrast reveals

Three companies. Same tools. Three fundamentally different leadership philosophies. Plaid built a feedback loop. Stripe built infrastructure. Coinbase built compliance.

The approaches that create durable capability share a common thread: they create conditions for learning rather than mandating compliance. Plaid’s product thinking and Stripe’s architectural investment both treat AI adoption as an organizational design problem, not a procurement problem. They invested in the staircase, not just the steps.

The approaches that produce fast metrics may or may not produce lasting results. It’s too early to know. But the early signals from the research, on cognitive debt, on the perception gap, on the persistence of complexity over the transience of velocity, suggest that compliance-driven adoption is fragile. It optimizes for the wrong variable.

The decision is a leadership decision, not a technology decision. And it’s a decision most leaders are making by default rather than by design.

3. The Role Boundary Problem Is Your Problem

Engineers were just the first

Everything above describes what happened to software engineers. They were the first job family to absorb the full force of agentic AI. But the pattern is now spreading to every role that touches the software development lifecycle, and to knowledge work more broadly.

Product managers are building working prototypes in 20 minutes using AI coding tools. Designers are launching monetized apps with zero coding experience. Business analysts are constructing dashboards that once required engineering support. LinkedIn scrapped its Associate Product Manager program entirely and replaced it with a “Full Stack Builder” program, where new hires learn to code, design, and do PM simultaneously. Applicants submit a 60-second demo of something they built. No resume required.

This raises questions that only leaders can answer. When a PM ships a vibe-coded prototype to production, who owns the security review? When a designer launches an app with no coding experience, who owns the maintenance? When a business analyst builds a dashboard feeding executive decisions, who validates the data pipeline?

The new vulnerability

A paper accepted to ICSE 2026, the premier software engineering conference, studied vibe coding practices through 101 practitioner sources. It found that 36% of vibe coders skip quality assurance entirely. Eighteen percent place uncritical trust in AI-generated output. Ten percent delegate QA back to AI, creating a circular quality loop with no human verification. The researchers concluded that vibe coding is creating “a new class of vulnerable software developers” who can build products but cannot debug them.

Anthropic’s own research reinforces the concern. A randomized controlled trial with 52 software engineers found that AI-assisted developers scored 17% lower on comprehension tests, with the largest gap on debugging, the skill that matters most when things break in production.

Microsoft’s Azure CTO Mark Russinovich and VP of Developer Community Scott Hanselman published a piece in Communications of the ACM proposing a “preceptor-based organization” model borrowed from medical education, where senior practitioners actively supervise AI-augmented juniors through structured mentorship embedded in the work itself. Kent Beck captured the deeper shift: “When anyone can build anything, knowing what’s worth building becomes the skill.”

Financial services is already in production

Financial services leaders face a particularly sharp version of this challenge, and the major banks are already deep into it. JPMorgan has made AI adoption a measurable performance requirement for its 65,000 engineers, classifying employees as “light,” “heavy,” or “non” users on internal dashboards tied to annual reviews. Goldman Sachs embedded Anthropic engineers directly inside the bank for six months to co-develop autonomous agents that now handle workflows tied to $2.5 trillion in assets under supervision.

These are not pilot programs. They are production-scale leadership decisions about how work gets done, who does it, and what “good” looks like when the tools keep changing underneath you. The firms that navigate this well will share a common trait: their leaders will have recognized that the bottleneck moved.

What Leaders Should Do Next

The next twelve months will separate the organizations that built durable AI-native capability from the ones that built impressive adoption metrics.

Three things should be on every leader’s agenda right now.

Start measuring outcomes, not adoption. Adoption rates and tool usage dashboards tell you nothing about whether your organization is actually getting better at building software. The METR perception gap is a warning that even individual developers can’t reliably tell you whether AI is making them faster. Replace adoption metrics with outcome metrics: cycle time on real features, defect rates in production, comprehension tests on systems your team is supposed to own. If you can’t measure understanding, you can’t manage cognitive debt.

Redesign the role boundaries you’ve been letting drift. PMs, designers, and business analysts are already crossing into engineering territory using AI tools. The question is not whether to allow it, that ship sailed. The question is who owns quality, security, and maintenance for cross-boundary work, and whether anyone in your organization is building the capability to handle the hard parts that remain when AI handles the easy ones. This is an org design exercise, not a training exercise. Make it explicit.

Invest in where the rigor is migrating. If AI handles the code, the engineering discipline moves to specifications, tests, architecture, and oversight. Most organizations have under-invested in all four for years. The companies that build durable capability over the next twelve months will be the ones that treat specification quality, test coverage, architectural clarity, and review practices as first-class engineering work, with the headcount, tooling, and career paths to match.

The tools will keep getting better. The models will keep getting more capable. None of that will matter if leaders don’t decide where the rigor goes. Engineering teams have spent the last eighteen months running the experiment for everyone else. The lesson is clear, and the window to act on it is narrower than most leaders think.

The rigor has to go somewhere. The question is whether your organization is sending it somewhere intentional, or letting it dissipate into the gap between roles that no one redesigned.


References

Share this article