Is code evolving or just mutating?


Hello Reader,

AI can generate code, arguably with a pretty decent quality. That’s not news anymore. The question that’s been forming in my head all week is different: how do we decide what should go into production? Writing code is not the hard part (arguably, it never was). The hard part is making sure the right code ships and the wrong code doesn’t. And right now, that selection problem is becoming the defining challenge of AI-assisted development. Last week has definitely showed this.

Focus on code reviews

Last week, Anthropic announced that Claude Code now has a Code Review feature. When a PR opens, Claude dispatches a team of agents to hunt for bugs. Many people point out the irony in Claude reviewing it’s own code (insert spiderman meme). Boris Cherny, who created Claude Code, responded to one of the tweets in an interesting manner: the more tokens you throw at a coding problem, the better the result. In other words, one agent can cause bugs while another catches them.

A feature release like this one is always an interesting signal. As I mentioned in my last newsletter, it seems that code reviews are the next big challenge in AI adpoptin. With code velocity at an all time high, manual reviews just don’t cut it anymore. It seems like code review systems are what comes next.

In fact, Claude is not the first tool to tackle this problem. Martian recently released a review bench of various different code review tools. Tools were put on trial against real codebases and ranked on how thorough and precise each tool is. I highly recommend checking out the results.

I think that code reviews are indeed an interesting problem space. It does seem to be the current bottleneck and potentially a great way to keep bugs at bay. But I also feel that there’s more. Code review not just just about catching bugs. It has always been about knowledge transfer, about mentorship, about building a shared understanding of the codebase. I wonder what heppens with all this, when an AI reviews your code. Will the knowledge stay in the model’s context window and then be gone? I wonder how the part where the team gets smarter happen in the AI era.

Is code evolving, or just mutating?

Itamar Friedman, CEO of Qodo, published a piece that reframes the entire AI coding conversation through the lens of evolution. His argument is simple but profound: code generation is just mutation. Models write functions, agents generate pull requests, systems produce entire features - but from an evolutionary perspective, that’s just creating variation. What creates progress is selection. Evolution requires three ingredients: mutation, selection, and persistence. Without selection, mutations accumulate. With selection, improvement compounds.

He points out that software engineering has always had selection loops — tests, code review, CI pipelines, governance mechanisms. We just never described them that way. And now AI is dramatically increasing the mutation rate. Agents can understand unfamiliar codebases, propose architectural refactors, implement entire features. The rate of code production is skyrocketing. But the selection layer is not scaling at the same speed. The bottleneck in software development is moving from writing code to verifying it and selecting what survives.

This landed hard for me. In my previous newsletter I talked about how quality engineering is evolving, and Itamar’s framing gives it a language I’ve been missing. We’re not just ā€œtestersā€ or ā€œquality engineersā€. We’re the selection layer. And if that layer doesn’t keep up with the mutation rate, systems don’t evolve — they drift.

The hot dog problem

Mo Bitar posted a video called ā€œI was a 10x engineer. Now I’m uselessā€ and it hit me harder than I expected. Mo describes what happened when he used ChatGPT to deploy his entire product without looking at the code. It worked. And he hates it.

His analogy is perfect: he made a hot dog. It looks like food, it tastes like food, the transaction is complete. But he can’t sell it because he has no emotional connection to it. He didn’t earn it. He didn’t suffer for it. And that suffering, that struggle, that’s what used to make us better at our craft.

Mo’s video is honest, and asks an important question. What do you do, when you love to code? The activity and the craft of coding doesn’t seem to be in such a high demand as it used to. This new AI era takes something away from those who loved it. On the other hand, I belive there is a path forward, the goalpost has just moved. This tweet by Franziska seems to suggest an interesting problem space for engineers. Instead of making your work faster, you engineer AI systems.

The problem with AI demos

Vidhya Ranganathan wrote a piece called ā€œProduction Telemetry Is the Spec That Survivedā€ that I think should be required reading for anyone deploying AI agents on existing codebases. She introduces a framework that distinguishes between greenfield systems (new, clean, well-specified), brownfield (evolving, messy), and what she calls ā€œblackfieldā€ - legacy systems under heavy load where the original intent is lost, documentation has rotted, and business rules hide in undocumented conditionals.

AI coding tools are great at greenfield. They struggle with brownfield. And they fail at blackfield, because they infer specifications from code patterns, creating implicit specs that fail silently when they contradict accumulated production behavior. The only honest specification left in these systems lives in production telemetry: traffic patterns, error rates, usage data.

I think this has always been a great pointer for testers on which tests should be written first. But it is also a very smart approach for adoption of new tools and testing PoC for services that provide nice demos, but leave you curious about real world usage.

OpenAI acquires PromptFoo

This connects to OpenAI acquiring Promptfoo, an AI security startup that specializes in red-teaming and vulnerability testing for AI systems. Promptfoo serves about 25% of Fortune 500 companies and has 130,000 developers using it monthly. OpenAI is integrating it into their Frontier platform to make security testing a built-in part of how teams ship AI agents.

The fact that OpenAI felt the need to buy a company whose entire job is testing whether AI systems are safe tells you something about where we are. We’re building agents that write code, review code, and deploy code, and we’re only now starting to seriously ask: but who tests the agents?

Great questions to be asked about AI

Hank Green and Cal Newport sat down for a conversation about AI that I think captures the current moment better than most. Hank’s approach is to catalog every legitimate concern - addiction, manipulation, hallucination, labor displacement, economic bubbles, children’s exposure - and resist the urge to collapse them into a single narrative. Each concern has its own severity and its own likelihood. They’re separate problems.

Cal Newport introduced a concept: ā€œprogress laundering.ā€ Advances in one AI technology, like language models, get unfairly attributed to completely different domains like protein folding or robotics. These are separate technologies with separate trajectories, but the narrative treats them as one unstoppable wave. It’s a useful framing because it explains why the discourse feels so overwhelming. We’re not dealing with one problem. We’re dealing with dozens of separate problems being marketed as one.

The whole conversation is great, but what surprised me (but makes perfect sense) was Cal’s take on current AI models. He claims that we’ll probably end up with smaller, specialized systems that do specific things well - which, in a way, loops back to where we started. Specialized models. Specialized agents. Selection systems that keep the good mutations and discard the rest. Instead of having one know-it-all model like GPT-5.4, many will focus on models that are really good at specialized tasks.

But that’s a prediction, not a certainty, so we’ll see where we eventually end up.

I’d love to hear how this is landing for you. Has your team started using AI code reviews, or are you still doing them manually? Do you see yourself as the selection layer, or does that framing feel off? And if you’re someone who loves the craft of coding - how are you making peace with the hot dog era? Hit reply, I’m genuinely curious where everyone is at right now.

Filip Hric

Sign up for weekly tips on testing, development, and everything related. Unsubscribe anytime you feel like you had enough 😊

Read more from Filip Hric

Hey Reader, If you’re reading this, chances are you care about quality. Coming from QA, I never stop looking at the apps I build and systems I use through the lens of quality. Now, with more and more code being written by AI, this question matters more than ever. Many people wonder whether AI is even capable of delivering quality. I think it is. Though it’s worth remembering that quality is multidimensional. You can always have more or less of it. When it comes to AI and specifically LLMs,...

Hey Reader, It’s been a while since I’ve issued a newsletter, but I’m hoping to get back on track. There are just so many interesting things happening in the world of IT that I want to share with you all. But I’ve decided to change my approach to writing this newsletter a little bit. In the past, my main goal was to come up with some idea or some thought and basically write a post. This put a lot of pressure on me and I never wanted to force myself into writing a newsletter when I had nothing...

Filip Hric 22th July 2025 AI, BDD, and Why We're Bad at Predicting the Future Hello Reader,Lately, it seems the conversation around AI has shifted. We're moving past the initial "what can this chat window do?" phase and into the more practical, and sometimes awkward, integration into our development lifecycle. To me, one of the most interesting moments recently is that we're seeing major players like Amazon use BDD for AI-driven requirement planning. Plus the whole ecosystem of AI builder...