Notes on agentic programming - Valentin's musings

2025 apparently was the "year of agents". After spending a fair share of the year programming with LLMs both for my personal side-projects and at work, I wanted to reflect on my impressions.

Those impressions are based on using Cursor with various models (at work), Gemini CLI (personal projects), and Codex CLI (personal projects).

Observation 1: Agents are doing waterfall

The latest agents are supposed to be able to handle large and complex features by devising "Plans" and "TODOs". The agent does a first step of thinking and analysis to determine the steps needed to achieve the goal. Then it executes those steps in order. The problem is that:

Getting real-world feedback on whether the code works (usually by running the code through the terminal or through tests) is usually the last step, when everything has already entirely been coded.
Previous steps are a partition of the final state (usually divided by feature or technical components needed) instead of being simplified, incremental versions of the final state (first building a degraded, simplified version of the request, then iterating upon it, etc).

In spite of my attempts to fix this via specific instructions, I never succeeded in steering the agents away from this mode of working. And, as could be expected, this mode of working is very inefficient. Agents are not always strong enough to spit out thousands of lines of code that just happen to be correct. They realize this only at the very end, and then are very biased by their own context into making what they already coded work, which can result in bugfixes ranging from awkward to not working at all.

One simple example I have is when I asked an agent to build a deploy script in Bash that should do the following:

Have a step function which accepts a step description and a bunch of bash commands.
All those commands should be sent through SSH to be executed on a remote machine.
While they are being executed, only print the step description alongside some gum spinning animation.

The agent failed to implement this on its own. Then I had a conversation with it where we did it from the bottom-up instead of top-down: First have a function that sends the command to SSH. Once that works, hide the SSH output and just print the description. Once that works, wrap everything with gum for the animation.

The agent implemented every step of the bottom-up implementation, so it fundamentally had the ability to build the thing, but only if I guided it through the bottom-up planning myself.

Agents need to be re-trained and re-programmed to do bottom-up on their own.

Observation 2: The added-value is very fast code generation

Since agents are using waterfall, I have abandoned using them at the project level. So I'm using them at the small feature / commit level.

I find that agents are mediocre programmers. For an amount of instructions that is greater than what I would give in a ticket addressed to a colleague, I write an amount of code review items that is greater than those I would write for code produced by a colleague. More instructions for worse results.

However, where the agent wins is speed: This bad code is produced in a timeframe ranging from seconds to minutes (as opposed to hours for a human). So the amount of review items is completely offset by the code generation speed, because the agent is treating those items basically immediately.

This is the most productive usage of agents I'm currently having: I watch the agent writing the code, and already start compiling my code review items in the chat while the code is being generated (also because I'm "spying" into the thinking flow of the agent that the UI is flashing, which gives me insights into what it's gonna output). Once the agent has yielded control to me, I finish my review in a few minutes and submit. The agent starts over, addresses my feedback, and so on. Usually, after a few iterations like that, we end up with a code state that is satisfying.

I have no doubt that some features I have coded this way have taken way less time end-to-end than if I had coded them entirely by hand.

Observation 3: Agents have no career progress

It's frustrating that agents are mediocre programmers, but it's even more frustrating that you can't expect them to progress as if it was a junior developer learning the craft. With each new chat you are back to square one.

The usual reply to that is adding more and more sophistication to the agent AGENTS.md system prompt, but this is missing the point and terribly unambitious.

I'm convinced that the amount of total information needed to make a spotless contribution to a mature and big codebase does not fit in the agent context window. Any implicit "rule" that programmers use when contributing involves a lot of nuances, non-trivial judgments, and even notions of taste. Not only trying to encode those as a set of simple rules would fail because you would likely over-simplify and remove the nuances, but even if you could, that would be an enormous chunk of information to hold every time in the context window.

This is why I'm convinced that agents need information research and retrieval in all their chats history. Just like the agent is able to browse the codebase (which doesn't fit into the context window), the agent should be able to browse its previous development sessions. That is, by the way, exactly what a human does to progress on their skills: recall previous tasks, previous interactions with their peers, previous pieces of code review, etc, in order to incrementally become better at their craft.

For a more immediate (and less ambitious) increment, we would benefit much from this feature in the scope of a single chat session, when all the context has been consumed. Cursor uses summarization on the first half of the context, which has the tendency to give the agent amnesia of the earliest bits of chatting. I find it extremely frustrating that I can scroll back in the UI to see the entire context myself, but the agent has no access to it. It's literally just right there! If you can search in my codebase, why can't you search in this!

Observation 4: Code review rigor is the rampart against dropping code quality

So, agents are mediocre programmers, and they don't make progress on their skills. The resulting situation is that code review becomes the most important part of the development life-cycle. It's the only rampart against an eternally incompetent programmer trying to fuck up your codebase. And I have a very strong opinion that when working in a team context, the developer driving the agent should be responsible for delivering code up to the standard of quality of the team. Do not outsource LLM-code review to your team-mates. The most important reason for that being that reviewing shitty code that was obviously LLM-generated is very tedious, and you will likely obtain a reluctant code approval. If you outsource code review to your team-mates, you are effectively lowering the code quality of your codebase.

Unfortunately, the situation is very counter-intuitive: you have a magic AI producing code for you, but you still need to review it carefully and correct all of the small bits that are wrong. You end up with an overall increased productivity, but the mental gymnastics required to work this way are not what programmers have been trained to before LLMs. It is almost a new craft entirely.

I find it hard to believe that code quality wouldn't drop to some extent, in any case.

Observation 5: Agents over-rely on code analysis

This observation is about a very frustrating failure mode of agents, when they crazily crunch tokens in a loop trying to solve something. The failure mode is the following:

The agent attempts to track down the origin of a bug solely by reading the code and logically inferring what piece of code is the problem. It happens that the analysis is wrong, and this is not the true origin of the bug.
The agent attempts to fix the bug by applying a simple code fix to the incorrect culprit.
Observing that the bug is still there in spite of it being supposedly fixed, the agent hallucinates all sorts of stuff, often going on crazy refactors of the code.

A naive reaction would be to say that agents should be better at code analysis so that they correctly identify bugs, but I don't think that's viable (they will always fail at some points). I think agents should be resilient to their code analysis failures by proving that their diagnosis is the true source of the bug, for example by adding print statements to debug the code, or by asking the user to look into something for them, and only proceed to fix the bug once the true cause has been identified.

I have tried to solve this by specific instructions in AGENTS.md, but unfortunately still got cases of this failure mode afterwards (maybe fewer, though, I don't know). I think a deeper change in agents programming and maybe model fine-tuning would be required to properly solve this problem.

What is particularly frustrating is the stupidity of the bug from step 1. When I vibe-coded the "Wrapped 2025" feature for ratesmovies.net, the agent got stuck on a bug: Some movie posters were supposed to appear on the screen one after the other. For some reason, they never appeared at all. The agent looped over many attempts at solving the problem, never succeeding. It turns out that the agent had used the opacity CSS property to control display of the posters, and because of precedence of CSS rules, the initial state opacity: 0 was always taking precedence over the new state opacity: 1.

This last example shows that one of the reasons agents over-rely on code reading is also because they are under-powered. In many cases, the agent simply doesn't have a feedback loop at its disposal to validate or invalidate bug diagnosis. In the case of front-end development, the agent would need full access to the browser's developer tools to be productive.

At any rate, this mode of failure is very detrimental to productivity. Any time it happens, manually coding the feature would have been more efficient than relying on an agent. The speed aspect from observation n°2 completely collapses. I have personally settled on the "1 attempt" rule: The agent has 1 attempt to solve the bug. If that doesn't work, then I'm going to solve it myself.

Observation n°6: Everybody is becoming crazy

This last observation is the only one about humans, not agents. As imperfect as agents are, they have not been as frustrating as humans are when they deal with agents. I think we are still in the peak of inflated expectations from the Gartner hype cycle, and that is not a pleasant place to be when you have not drunk the kool-aid yourself.

Here I prefer making a list of all of my frustrations sourced in LLMs and agents:

The creation of noise when an LLM is used as a text generation machine to fill in the formalist of whatever document you're creating. No, your merge request's description doesn't need to be that long. I also have a Cursor license and can prompt an LLM to summarize the code changes to me.
The dehumanization of speech when addressing humans. I instantly lose interest in whatever message you're conveying when I know you've asked an LLM to generate it. I vastly prefer a flawed but authentic message to a polished but robotic one. Make that effort and take that risk, write it yourself.
The over-confidence in the ability to shape the agent behavior through AGENTS.md. You are fighting against billions of neural network parameters fine-tuned for specific behavior (the LLM), and a hard-coded state machine (the agent), with a few hundred tokens or yours. That's cute, but unfortunately some things are gonna require changes at deeper levels than the system prompt.
The lack of carefulness around code quality. It's one thing to recognize that agents are making us more productive. It's another to understand that this comes at the price of mastering a craft of "agent driving" without which code quality will necessarily drop.

Conclusion

So, in short, here is my wish list for 2026:

🎁 Make agents do bottom-up, incremental coding, instead of top-down, one-shot coding.
🎁 Keep agents as speedy as they are. Do not make the feedback loop slower.
🎁 Make agents able to dig into their past chats so that they learn.
🎁 Educate developers so that they carefully review the agent's code and don't outsource this work to their peers.
🎁 Give more debugging powers to agents and make them validate their diagnosis before coding fixes.
🎁 Have humans writing to other humans without using an LLM.

I will write again in one year to see which of those wishes have been fulfilled.