Thoughts on AI — 2026 edition

It’s a couple years since I wrote about AI[1] and things have been changing fast. In 2024, AI had started to impinge on my day-to-day work, but 2025 felt like the year in which it changed things dramatically. While I still would characterize a lot of my work with AI as consisting of "arguing with LLMs", there were definitely times when they produced results of acceptable quality after a few rounds of revision, feedback, and adjustment, and the net time and effort required felt like a relative win compared to doing it all myself.

New tools

The biggest change came in the form of the arrival of Claude Code. Instead[2] of just chatting with an LLM (mostly using my Shellbot fork), I could now delegate to it as an agent without having to abandon my beloved $EDITOR[3]. What began as experimentation (figuring out what can this thing do?) has since turned into an integral part of my workflow: even for changes I could very quickly carry out myself, I will instead turn to Claude and ask it to make the change, unless it is utterly trivial (ie. the threshold for cutting over to Claude is the point at which it can manually manipulate the text faster than I can).

New capabilities

2025 brought customization mechanisms, Model Context Protocol[4], subagents, skills, and custom slash commands among other things. From my point of view, these all have the same goal, namely, equipping agents with:

  1. Specialized knowledge that enables them to obtain the information they need; and:
  2. The means to carry out necessary actions in service of an objective; while simultaneously:
  3. Not overflowing the context window with garbage which obscures things and prevents the agent from producing a correct result.

Collectively, these are probably more important and useful than improvements to the models themselves. Speaking of which…

New models

2025 brought model sicophancy to the forefront, and Claude was no exception. Around mid-year, Claude’s "You’re absolutely right!" was ringing in the ears of users across the world in an almost continuous chorus. Thankfully, it seems to have subsided a bit now.

I didn’t follow the whole model benchmarking question very closely, and am in general only interested in how well the models improve my experience in my daily work. Overall, subjectively, I’d say that the models improved significantly over the last year, but as I said before, I believe that it’s the tooling around the models that had the greater impact.

Use cases

In my last post, I said that LLMs were good for "low-stakes stuff like Bash and Zsh scripts for local development", "React components", "Dream interpretation", and "Writing tests". In 2025, I used them for a lot more than that. I used them for fixing bugs, adding features, working across multiple languages and services, and for explaining foreign codebases to me.

Where they shine:

  • Places where there are a lot of guard rails in place to provide them with clear feedback about success (eg. working in a strongly typed language like Rust, or in an environment where you can perform automated verification in the form of linting or tests).
  • They’re also great in places where you may not even know where to start but their ability to quickly search large corpuses and repos can quickly locate leads for you to follow.

Places where they still leave much to be desired:

  • Things where the non-determinism of their output means that you can’t trust the quality of their results. For example, say you have a change that you want to uniformly make across a few hundred call sites in a repo. Your first instinct might be to say, "This is a repetitious change, one that should be amenable to automation, and if the LLM can be given clear instructions that allow it to do it correctly in one place, then it should be able to do it quickly and correctly in all 100 places". Sadly, this could not be further from the truth. LLMs are inherently non-deterministic, and that means there’s always a random chance that they’ll do something different on the 19th, 77th or 82nd time. You will have to check every single modification they make, and you may be far better off getting the LLM to create a separate, deterministic tool to carry out the work. And if you want to throw caution to the wind and have the LLM make all the changes for you anyway, you’re probably better off firing off the agent in a loop, with a clean context for every iteration and a clearly stated mechanism for verifying the correctness of the change, than expecting a single agent to carry out any significant amount of work serially.
  • Anything that can’t be trivially described with a minimum of context. This is a conclusion that I’ve recently come to. In the past, I thought that bigger context windows would endow the models with the ability to solve fuzzier problems, the kinds that humans are particularly good at (with their ability to take into account disparate sources of information scattered across time and place). But my experience with even relatively small amounts in their context (ie. far less than 200K tokens), is that models can easily "overlook" salient information that’s "buried" in the context, even when it’s not that large. Failure modes include things like telling the model to look at a series of commits, and then observing how it "forgets" something critical in the first of the series; it proposes a change that looks like it only actually attended to the most recent tokens in its context window, and often ends up contradicting itself, or reimplementing a decision that it previously reverted. My suspicion is that when we have models that have 10-million-token context windows, we’ll still get the best results when we distill down everything we want them to "know" into the first few thousand tokens.

Job security

In 2024 I said that I wasn’t worried about AI taking my job in the near term, but that things could change quickly, and I advised to "judiciously use AI to get your job done faster". In 2026, AI has clearly gotten to the point where it is making real waves in tech workplaces. Not only is AI making it possible for people to ship more code and faster than before, there is also considerable business pressure to make use of in in the name of maximizing productivity. Unfortunately, the signal here is very noisy: our corporate overlords can mandate the use of these tools and monitor their use, but I don’t think we have reliable evidence yet on how much of this is unalloyed value, and how much of it is technical debt, latent regressions, and noise masquerading as productivity.

Now more than ever it seems important to not only use the machines to deliver useful work, but also to focus on the places where I as a human can still deliver value where a mere next-token-predictor cannot. The pressure on both of those fronts is only going to increase. I’d say that my feeling of "precariousness" is quite a bit stronger now than it was two years ago, and I’m not looking forward to seeing that trend continue although I feel that is surely must.

In terms of job satisfaction, I’ve observed an inverse correlation: the more my job consists of me prompting the AI to do things for me, the less intrinsically satisfied I feel. This was one of the reasons why I had so looked forward to Advent of Code in December; I was itching to do some significant work with my own two hands. I look now towards the future with some dread, but also with a determination to not go gently into that good night: no matter what happens, I want to commit to finding things to take authentic pride in beyond "how I got a swarm of agents running in parallel to implement some set of inscrutable artifacts".

The impact on the world more generally

So far, I’ve been talking about how AI has affected my job. But "Gen AI", in particular, is having the expected effects on the wider world. Deep fakes, AI slop, and bot activity more generally are flooding YouTube, Twitter, and anywhere content is shared[5]. It seems that we’re already well on the way into a "post-truth" world, where our ability to distinguish fact from falsehood has been devastatingly damaged, with no prospect of putting the genie back in the bottle, given the inevitably increasing capabilities of AI systems to produce this stuff at ever higher levels of quality.

One can hold out, clinging to reliable information sources, but in the end it seems unavoidable that these will be islands of truth surrounded by oceans of worthless, endlessly self-referential fabrication. I shudder to imagine what this looks like when you fast-forward it a hundred years.


  1. I wrote that piece in March 2024, so just a couple months shy of two years ago, to be precise. ↩︎

  2. Maybe I should say "as well as" rather than "instead" because I still do chat with the LLM a fair bit when I want to ask it general questions about something in the world; but when doing almost anything related to coding, I almost exclusively do that via Claude Code. ↩︎

  3. Technically, I am "abandoning" it in the sense of switching focus to another tmux pane, but Neovim continues running and I can dip in and out of it whenever I want. ↩︎

  4. MCP nominally arrived in 2024, but as it required folks to actually build MCP servers, I think it’s fair to say that it "arrived" in a tangible way in 2025. ↩︎

  5. I almost wrote "where humans share content", but that’s already appallingly misleading. ↩︎

Lockout horror stories

Amazon

My account was banned recently because, years ago, I ordered two paper books that Amazon said would be split into two shipments. Both books arrived without any issues, but later Amazon refunded me for one of them, claiming that one package never arrived. This happened 4–5 years ago. Apparently, during a recent review, they decided this counted as fraud and banned my account. As a result, I can no longer log in and lost access to all my Kindle e-books. They also remotely wiped my Kindle, so my entire library is gone. I appealed the decision, but I’ve been waiting for over six months with no resolution.

icqFDR on Hacker News, 2025-12-19

A friend of mine received a double shipment for a $300 order. Being honest, he contacted customer service to arrange a return. Everything seemed fine until a few days later when he noticed they had also refunded his original payment. He reached out again to let them know, and they said they’d just recharge his card. Apparently, that transaction failed (no clear reason why), and without any warning, they banned his account, wiping out his entire Kindle library in the process.

egeozcan on Hacker News, 2025-12-19

Apple

My Apple ID, which I have held for around 25 years (it was originally a username, before they had to be email addresses; it’s from the iTools era), has been permanently disabled. This isn’t just an email address; it is my core digital identity. It holds terabytes of family photos, my entire message history, and is the key to syncing my work across the ecosystem.

The only recent activity on my account was a recent attempt to redeem a $500 Apple Gift Card to pay for my 6TB iCloud+ storage plan. The code failed. The vendor suggested that the card number was likely compromised and agreed to reissue it. Shortly after, my account was locked.

I effectively have over $30,000 worth of previously-active “bricked" hardware. My iPhone, iPad, Watch, and Macs cannot sync, update, or function properly. I have lost access to thousands of dollars in purchased software and media.

Paris Buttfield-Addison, 2025-12-13

(For additional context, see Daring Fireball and TidBITS.)

Advent of Code 2025

This year Advent of Code ("AoC") was only 12 days instead of 25. Last year I solved all 25 days (both parts) and that felt pretty good. This year, I also solved both parts of all the days, but given the overall puzzle count was lower, I’m not finishing with quite the same sense of achievement as last time.

I have mixed feelings about that. I’m simultaneously grateful that this year wasn’t a gruelling ordeal, but also disappointed that I didn’t feel stretched by it. Given how important AI has become in my working life as a software engineer, I had been looking forward to AoC as something that would force me to use my brain to create code solutions instead of using it to engineer prompts.

In a "normal" AoC year, the puzzles on the first days are relatively trivial, and the difficulty slowly ramps up from there. Around the middle of the month, you might find a day that is unexpectedly difficult or unexpectedly easy, but the general progression is clear. Towards the end of the month there are generally some "epic" problems that feel really satisfying to solve.

This year, folks were wondering what would happen given the shorter schedule? Would the difficulty ramp up more quickly? Would the final days this year be just as hard as the final days of prior years?

It’s really hard to give an objective answer to that, so I’ll give a subjective one instead. My impression is that no, the final days weren’t as hard. One of the reasons why the 25-day schedule feels so tough is precisely because it’s like a marathon: if you’re going to beat 25 two-part puzzles of increasing difficulty you have to bring a maintained level of intensity and commitment to carry you through to the end. With a merely 12-day schedule, it’s more like a sprint.

This year, there were two or three days that I found to be disappointing:

  • On day 9, I felt pretty good about my part 1 solution, which used bitwise operations to great effect (the code was simple and the solution ran fast), but part 2 ended up being too hard (for me) to solve without using a linear programming library…

    I initially represented each problem as a multivariate linear Diophantine equation and was able to solve the simple cases (eg. ax + by = n) using the Extended Euclidean Algorithm, and then I started extending it to trivariate and larger cases (eg. ax + by + cz... = n) before discovering that, while I could solve the equation, minimizing the coefficients was going to require a bunch of linear programming tricks that were beyond my reach. You’re supposed to solve the puzzle relatively quickly (ideally, the same day it is published), and coding up and debugging the kind of linear solver that I would need would have taken days, or maybe even weeks. So, I reached for an existing library and used that to find minimal solutions to a system of linear equations.

  • Day 11 (part 2) felt a bit unsatisfying in a different way. It was a graph traversal problem where a naïve DFS approach would involve some ungodly number of operations. You had to count the number of paths in a Directed Acyclic Graph starting from a given node, passing through two other specified nodes — call them a and b — (in any order), and ending at another.

    As I said, a simple DFS searching from start to finish would take forever, so I tried it out on some of the subproblems: start to a; a to b; b to a; a to finish; and b to finish. I also reversed the edges in the graph on the chance that finding the paths in the backwards direction was any easier (it was). Armed with this information it was clear: there were no paths from b to a, so all I had to worry about was solving the three subproblems: start to a, a to b, and b to finish. Solving any of the equivalent problems in the reversed graph would be valid too (ie. a to start, b to a, or finish to b). Once I had the path counts for each of the three segments, I just had to multiply them together.

    So, I set about solving the subproblems, and when I found one of them to be stubbornly resisting my DFS, my first thought was that, if I switched to a BFS, I might avoid a lot of time spent going deep into unproductive pathways via DFS. As a shortcut, I figured I would test this idea out first by coding a depth limit into my DFS, before going to the trouble of writing a BFS. And as I progressively increased the depth, the answer stabilized, and eventually settled on just over 7.5 million paths for that last segment that I needed to solve.

    Multiplying all this out gave me the right answer of 553,204,221,431,080 paths. Part of me felt like I should have been happy to narrow it down from "zillions" to a correct answer of some 553 trillion odd paths, but it didn’t feel like that at all. It felt more like I’d carried out the computer-assisted calculation by hand instead of constructing a general algorithm to solve the problem. I think if I’d written it as a BFS in the first place instead of a DFS, I would have felt better, although I still would have had to hard code a semi-arbitrary depth limit in there somewhere…

  • Day 12 was much worse. On the surface, the puzzle looked like a set of large combinatorial problems with exponential complexity, even one of which would have made a brute force approach unfathomably expensive. I jotted down a bunch of heuristics for pruning the search space down, but none of them promised to be remotely enough. I decided some research was in order. I skimmed a bunch of articles on packing irregular objects and came to the conclusion that my best bet was going to be to use simulated annealing; partly because it’s a good tool for approximating a global optimum even in an enormous search space, and partly because it’s easy to understand and I’ve used it before. There were 1000 problems in the puzzle input, but I didn’t even need to find global optimums for them all; I just had to find packing configurations that were good enough to get the job done. Surely (subtle foreshadowing), many or most of the problems would be solvable using this approach.

    All of that without writing any code. When I finally sat down to write some, I was delighted to see the genetic algorithm find solutions good enough to pack the sample problems. I ran it on the actual puzzle input and was even more delighted to see it work on the vastly larger problem set too.

    But I felt uneasy. Surely, it shouldn’t have been that easy, should it? I dropped into the channel at work and saw that only one person had shared a solution. I looked at it, wondering if they’d used something stochastic like I did, and was appalled to see that they didn’t do any packing at all. Yes, that’s right: the puzzle input doesn’t actually require you to rotate or fit shapes together at all; rather, you can count the number of shapes, compute how much space they’d need if they were all 3-by-3 squares, and see if the provided space is enough.

    Good lord. One of my shower thoughts based on eye-balling the puzzle input had been that, of the thousand problems, there was probably some subset that was trivially and obviously packable into its corresponding space, and another subset that was trivially and obviously not packable. Maybe I’d be able to eliminate a few hundred of the former kind and few hundred more of the latter, leaving me only having to solve the remaining few hundred that were on the borderline.

    Well, it turns out that all the problems could be trivially classified and counted like that, and there was no need for my fancy simulated annealing algorithm at all. I kind of felt deceived, because the sample input wasn’t trivially classifiable in that way. My initial satisfaction at coding up the algorithm and seeing it work was replaced with a feeling of regret for not having followed up on my shower thought sooner.