I'm still in Australia as I write this.

Two weeks of meeting founders and CTOs. Fourteen conversations so far.

Every single one had a strong opinion about AI-generated code.

The conversations kept splitting into two camps. And I think the gap between those camps, where almost nobody is talking, is where the actual answers live.

The two camps

Camp A: "People are shipping real apps with no dev background."

One of my clients (an Aussie founder working out of Bali) told me he's watching entrepreneurs around him build apps with thousands, sometimes hundreds of thousands of users, with zero engineering knowledge.

And he's not wrong. They are. The tools have gotten good enough that a non-technical person can prompt their way to a working product. They've seen it happen.

They believe, sincerely, that engineering teams are becoming optional.

Camp B: "We tried it. The gains were marginal at best."

These are the CTOs who rolled out Copilot across their team, expected a steep change, and got incremental improvements.

Some saw real but modest gains: faster boilerplate and quicker prototyping.

Others saw no measurable difference at all. A few quietly rolled it back.

The optimism was there. The results didn't match.

The METR study validated what they felt: experienced developers using AI believed they were 20% faster, but objective measurement showed no improvement — in some cases, the opposite. On SWE-bench, the best models resolve about 75–77% of curated issues. On live, real-world problems, the number drops significantly.

Both camps have real evidence. And both camps are only seeing part of the picture.

What each camp is missing

Camp A is right that the barrier to building software has dropped to nearly zero. That's real, and it's not going back. But there's a difference between an app that works and an app that works at scale — with security, with data integrity, with the kind of architecture that doesn't collapse when real users hit edge cases. Amazon ordered a 90-day reset on its deployment controls after a string of incidents tied to AI-generated code in Q3 2025. The Bali entrepreneurs are building real products. Some of those products will hit a wall the moment they need to handle payments, comply with regulations, or survive a security audit.

Camp B is right that the tools alone don't deliver the revolution. Giving your team Copilot and expecting a step change is like giving someone a piano and expecting a concert. The METR finding validates their experience — unstructured AI use doesn't produce the gains people expect. But what Camp B hasn't seen is what happens when you build real engineering discipline around these tools. A small number of teams have done that, and they're shipping production systems at 5–10x the pace. Not because the AI is better than Camp B thinks — but because those teams built the methodology that makes the AI reliable.

The spectrum nobody talks about

Andrej Karpathy coined "vibe coding" in early 2025. By February 2026, he'd moved on to "agentic engineering"

The difference between the two isn't the tools. It's the discipline.

Here's how I'd map the spectrum based on what we actually see in production:

Level 1: Vibe Coding
Prompt, accept, iterate by pasting error messages back in. Don't read the diffs. Ship and hope. This works for prototypes, internal tools, one-off scripts. It does not work for anything a real user will touch with real data.

Level 2: AI-Assisted Development
Copilot, tab-complete, and inline suggestions. The engineer still writes every line; AI accelerates the typing. 10–20% faster on a good day. This is where most teams are right now. It's Gen 1 → useful, but not transformative.

Level 3: Agent-in-the-Loop
IDE agents like Cursor or Windsurf making multi-file edits. The engineer watches closely, catches mistakes in real time, and approves or rejects each change. Max 2x productivity on good days, with breaks for complex systems. Gen 2.

Level 4: Spec-Driven Agentic Engineering. This phase is where it changes structurally. Project constitutions (CLAUDE.md). Phase-based specs. Reusable skills. Hooks that enforce constraints automatically. The human designs the system, the agent builds it, and the hooks verify it. 2–3x. Gen 3. This area is where we operate.

Level 5: Multi-Agent Orchestration. Multiple agents coordinating in parallel. an orchestrator dispatching work, specialized agents handling different domains, and verification agents checking the output. Humans define the system architecture and the verification criteria. The frontier.

Most of the internet argument is between Level 1 and Level 2. The teams getting outsized results (and there are a growing number of them) are operating at Level 4 or above.

What this means if you're deciding right now

If you're a founder exploring AI for product development, the question isn't, "does AI-generated code work?" The answer to that depends entirely on what level of the spectrum you're operating at.

The honest answer:

Vibe coding works.
for exploration. If you're testing a product idea, validating a user flow, or building a tool for internal use, Level 1 is fast, cheap, and good enough. Just don't confuse the prototype with the product.

AI-assisted development works
for productivity. If your engineering team is already strong, Level 2 and Level 3 tools will make them faster. Not transformatively faster, but meaningfully faster.

Spec-driven agentic engineering works
for production at scale. This is where 80%+ of the code can be AI-generated and still be production-grade. But it requires real engineering discipline: architecture decisions made before agents run, specs that define "done" before work starts, and verification layers that catch what agents get wrong. The tools are the easy part. The methodology is the hard part.

The mistake Camp A makes is assuming Level 1 results scale to production. They don't.

The mistake Camp B makes is testing at Level 2 and concluding the whole spectrum is overhyped. It's not.

They just haven't seen Level 4.

What Level 4 looks like in practice

I can speak to this from our own experience. We're one of the teams operating at Level 4, and I want to be specific about what that actually involves so it's useful, not abstract.

Every project starts with a CLAUDE.md. A project constitution committed to the repo. Every build runs through spec-driven phases where "done" is defined before work starts. We build reusable skills that encode accumulated expertise. Hooks enforce constraints automatically before anything ships.

On a recent project, a marketplace admin dashboard for an Australian client, this meant 21,799 lines of production TypeScript shipped in five days. On another, it meant a 40%+ infrastructure cost reduction through agent-driven security hardening. 80%+ of our code on new projects is AI-generated: production-grade, human-audited, spec-traceable.

We're not the only team doing this.

But the teams that do, they share the same characteristics: they invested in the methodology, not just the tools. The spectrum is wide, and the gap between Level 1 and Level 4 isn't a tool upgrade. It's a discipline shift.

The question I'd leave you with

Next time someone tells you "AI can build your whole product" or "AI-generated code is unreliable." Just ask them what level they're operating at.

The answer will tell you everything about whether their experience applies to yours.

I've still got a few days left here. But these conversations have sharpened something I've been thinking about for a while: the teams getting the best results aren't the ones with the best AI tools. They're the ones with the best harness around those tools. And there are more of them every month.

If you're somewhere in the middle of this spectrum and trying to figure out where to invest next.

Reply and tell me where you are.

I'll tell you what I've seen work at the next level up.

About me

North Head at Manly, Sydney

Karan Shah: Engineer turned Founder

15 years ago, I started my career as a software engineer. Took the entrepreneurial plunge with less than 5 years of work experience.

Since then, I’ve strived to work at the intersection of Product Engineering, Design, Marketing, and Sales.

I’ve had the pleasure to work with some of the fastest-growing startups and large enterprises alike. From creating MVPs and clients raising funds to large enterprises going for an IPO!

Brew. Build. Breakthrough.

Karan Shah
Founder & CEO, SoluteLabs
Building AI-native products before it became cool.

Reply

Avatar

or to participate

Keep Reading