This week in AI was shaped almost entirely by one event: the AI Engineer World's Fair in San Francisco. The signal-to-noise ratio at that conference was unusually high — real practitioners debating real constraints, not just keynote optimism. Here is what we took away.
The "Software Factory" Framing Is Taking Hold — and It's Not Hype

The dominant concept at the conference was the software factory: the idea that AI agents should eventually triage, implement, review, verify, and deploy code in an automated loop, with engineers steering rather than typing. Warp launched an agent orchestration platform called Oz explicitly built around this vision. Cursor's VP of Forward Deployed Engineering described the same concept from the enterprise implementation side — her team is growing tenfold by year-end, going on-site to wire agent-assisted development across full software development lifecycles in financial services, telco, and semiconductors.
The interesting split was not between believers and skeptics. Almost everyone on stage believed the loop is coming. The split was about whether it is here *now* in a form worth betting on. Loop advocates said deterministic verifiability is all that matters — if you can verify the output, it doesn't matter how it was produced. Skeptics countered that autonomous loops are economically fragile ("you can't orchestrate your problems away by buying more tokens") and that discipline, not abstraction, is what's missing. We keep seeing this exact tension with clients: the wins come from constrained, verifiable loops, not open-ended ones.
If you're building production AI agents right now, the framing worth internalizing is from Vercel's engineering team: agents are well-suited to repetitive tasks that still require some reasoning — not just fixed automation. That's a more useful filter than "is the task agentic enough?"
Fable 5 Came Back — and Revealed How Teams Are Actually Managing Model Risk

Anthropic's Claude Fable 5, which had been pulled from access briefly, was restored on July 1st. The relaunch itself was less interesting than what happened during the outage: builders didn't wait. Teams converged on multi-model orchestration rather than holding out for one model. The pattern that emerged — use Fable 5 for high-value reasoning and planning, delegate implementation and verification to cheaper models — is a meaningful signal. Single-model dependence is now recognized as an architectural risk, not just a cost issue.
Cursor confirmed Fable 5 leads its internal evaluations but is the most expensive per task. Devin integrated it across all surfaces. Perplexity reinstated it as an orchestrator. The ecosystem is building around Fable as a reasoning layer, not a do-everything workhorse — which is exactly how we approach LLM architecture decisions for clients who need predictable costs at scale.
Sonnet 5 Landed With a Shrug

Anthropic also released Sonnet 5 this week, pitching it as a smarter, more agentic middle-tier model sitting closer to Opus in capability. Practitioners who tested it came away unimpressed — not because it's bad, but because it failed to establish a clear use case. It can write, code, and analyze competently. But for every task, a cheaper, faster, or smarter alternative already exists in most teams' model rotations. A model pitched as "just right for everyone" tends to end up being no one's first choice.
The pattern here is familiar. When the gap between a mid-tier and frontier model narrows, mid-tier models need a distinct value proposition — price, speed, or specialization — to earn a spot in production. Sonnet 5 doesn't yet have that story.
Want to know which model tier actually fits your use case? Tell us what you're building and we'll give you a direct answer.
The "Human Outer Loop" Debate Has a Right Answer

The most substantive argument of the week was about where human judgment belongs in an AI-assisted development process. Two clear camps: one saying agents should run the inner execution loop while humans retain the outer loop of architecture, priorities, and judgment; another saying autoresearch systems — agents that study and improve the system itself — can take on more of that outer work too, given the right feedback signals.
Former Google engineering leader Addy Osmani put it cleanly: "Agents can run much more of the inner execution loop. But that outer loop is still engineering." Design tool creator Paul Bakaus framed it from a product angle — let agents handle the first 80%, then bring the human back for the last 20% to add taste, judgment, and authorship. His design tool, Impeccable, gives coding agents a precise vocabulary for design concepts ("bold," "quiet," "dense") rather than vague adjectives, so the human steer actually lands. The concept — which he's calling skill engineering — is worth watching.
The outer loop is still engineering — and every serious team we talk to is figuring out exactly where that line sits. The answer is different for a legal contract redlining agent versus a UI generation pipeline, and getting it wrong in either direction kills adoption.
Autoresearch: The Emerging Concept Worth Tracking

A newer idea getting serious attention was "autoresearch" — building an outer loop where agents monitor, evaluate, and improve the primary agent system over time, using evals, feedback signals, and human input. Introspection, a new company founded by ex-xAI engineers, is building infrastructure for exactly this. Their framing — "agent recipes" that encode human expertise, evals, and signal processing in a portable format — is a more structured answer to the question of how agent systems get better without requiring constant human intervention.
This is adjacent to the RAG and evaluation work we do in our own AI development services — the feedback loop design is often the hardest part, and most teams underinvest in it.
---
Practitioner takeaway this week: Stop evaluating models in isolation and start stress-testing your architecture against model unavailability. If one model going offline slows your team down, you have a single point of failure, not a production system. Design for model routing from the start — frontier model for reasoning and planning, cheaper models for implementation and verification — and your system becomes both more resilient and cheaper to run. Get in touch if you want a second opinion on your current stack.
The software factory metaphor is useful, but the real engineering question this week is simpler: where exactly does the agent loop stop and human judgment begin? The teams winning in production are the ones who've answered that honestly, not optimistically. Next week, we'll be watching how the Fable 5 cost-vs-capability tradeoff plays out in real enterprise deployments — and whether Anthropic sharpens Sonnet 5's positioning or lets it drift.
“The outer loop is still engineering — and every serious team we talk to is figuring out exactly where that line sits.”
