
Before anyone gets upset about my headline, I am using the term “thought process” in the loosest possible way. I realize that many people (perhaps even most people) would not describe what an AI model does as “thinking.” After all, aren’t LLMs just a kind of super-autocomplete, as Noam Chomsky put it, with the whole internet to choose from? Aren’t they mostly just giant guessing machines, with no real intelligence behind them, or “stochastic parrots,” as AI scientists Emily Bender and Timnit Gebru described them? These are all fair questions, and I wish I had answers. But when it gets right down to it, we simply don’t know how LLMs do what they do — and when I say “we,” I’m not just referring to myself, or others like me who aren’t experts in artificial intelligence. Even the people who are building these AI engines and chatbots fundamentally don’t know exactly how they arrive at the outputs they produce (although some of them might pretend otherwise for marketing and/or fundraising purposes).
I know this probably sounds terrible, as though AI scientists are playing with explosives while not understanding how combustion works, but a surprising amount of science (the really interesting part anyway) is like this. Which is why I was so excited to see the recent reports from Anthropic, in which the company — founded by former OpenAI scientist Dario Amodei in 2021 — described at length its efforts to understand on a deeper level how its AI (nicknamed Claude) “thinks,” or why it arrives at the conclusions that it does. You might think that this should be fairly straightforward — couldn’t Anthropic just ask Claude a question, and then ask it to describe how it arrived at its answer? The short version is yes, Anthropic has done this with simple math problems, and Claude has gone into some detail about how it arrived at the answers it gave; but when the company tried looking under the hood at how it actually arrived at the answer, the real process it used was not even close to what it said it was doing.
Again, this is going to sound either ridiculous or disturbing to many people, or possibly a combination of the two. Doesn’t this mean that AI engines are making things up? How can we trust them? After all, companies are using artificial intelligence to perform all kinds of services, and government agents like Homeland Security and Elon Musk’s DOGE are even relying on it for more crucial functions, like figuring out who is a terrorist, and re-engineering the entire infrastructure of the government. Does any of this make sense if LLMs are just making shit up all the time? These are also fair questions. I for one think we should hold off on entrusting government services — like the decision on whether to deport someone to a prison in El Salvador — to an AI engine until we can understand how they arrive at their conclusions, and why they sometimes “hallucinate” (which AI pioneer Geoffrey Hinton prefers to call “confabulate”).
Note: This is a version of my Torment Nexus newsletter, which I send out via Ghost, the open-source publishing platform. You can see other issues and sign up here.
All of this helps explain why I and so many other observers are interested in Anthropic’s analysis of Claude. Here’s an excerpt from one of the papers:
Large language models display impressive capabilities. However, for the most part, the mechanisms by which they do so are unknown. The black-box nature of models is increasingly unsatisfactory as they advance in intelligence and are deployed in a growing number of applications. Our goal is to reverse engineer how these models work on the inside, so we may better understand them and assess their fitness for purpose. The challenges we face in understanding language models resemble those faced by biologists. Living organisms are complex systems which have been sculpted by billions of years of evolution. While the basic principles of evolution are straightforward, the biological mechanisms it produces are spectacularly intricate. Likewise, while language models are generated by simple, human-designed training algorithms, the mechanisms born of these algorithms appear to be quite complex.
(Note: In case you are a first-time reader, or you forgot that you signed up for this newsletter, this is The Torment Nexus. You can find out more about me and this newsletter in this post. This newsletter survives solely on your contributions, so please sign up for a paying subscription or visit my Patreon, which you can find here. I also publish a daily email newsletter of odd or interesting links called When The Going Gets Weird, which is here.)
Tracing the biology of an LLM

I won’t bore anyone with the nitty gritty details of how Anthropic went about its analysis of Claude’s “thought” processes, mostly because it’s quite technical. You can read all the details in either one of the company’s papers. Anthropic refers to the work as identifying features within Claude’s process that are the building blocks of computation, like cells are the building blocks of biological activity, and then mapping connections between them in the same way that neuroscientists might try to produce a wiring diagram of the human brain. To do this, Anthropic scientists created what they call attribution graphs, which allow them to trace the chain of intermediate steps that a model like Claude uses when transforming a specific input prompt into an output response. These attribution graphs generate hypotheses about the mechanisms that are used by the model, which Anthopic then tests and refines through follow-up experiments.
As Joshua Batson, one of the co-authors of the Anthropic paper, put it when describing large language models like Claude: “They almost grow organically. They start out totally random. Then you train them on all this data and they go from producing gibberish to being able to speak different languages and write software and fold proteins. There are insane things that these models learn to do, but we don’t know how that happened because we didn’t go in there and set the knobs.” When you look under the hood of such a model what you see, Batson said, is “billions of numbers” representing the parameters that the LLM was given during training. But it takes an extra step to see patterns in the numbers and be able to trace an answer back to a particular process.

Note: If I can digress for a moment, I just want to mention that even if you have no interest in the internal mechanism of Claude, or in following the attention graphs that the paper delves into, it’s worth taking a look at the paper online, because the way the information is presented is spectacular — and unusual. Many academic papers on a topic like this would be published as a static PDF document (which is essentially a large photograph), and wouldn’t allow you to interact with the paper beyond a few hyperlinks or footnotes. Anthropic’s research is published as fully interactive HTML, with every element or reference hyperlinked to supplementary graphs, explanatory notes — and also what amounts to an interactive database of the data behind the paper as well, which readers can explore. More research should be published like this!
One of the more interesting parts of the paper is the part where researchers ask Claude to help write a poem, and then deconstruct how it arrived at the result. The prompt included the phrase “he saw a carrot and had to grab it,” and Claude suggests “his hunger was like a starving rabbit” to complete the couplet. This isn’t exactly rocket surgery, you might say, but the analysis is still interesting — and it suggests that there is more to a large-language model like Claude than simply a giant autocomplete mechanism filling in the most likely word. Here’s how the Anthropic researchers describe what is happening:
Writing a poem requires satisfying two constraints at the same time: the lines need to rhyme, and they need to make sense. There are two ways one might imagine a model achieving this: 1) Pure improvisation — the model could write the beginning of each line without regard for the need to rhyme at the end. Then, at the last word of each line, it would choose a word that makes sense given the line it has just written, and fits the rhyme scheme; or 2) Planning — the model could pursue a more sophisticated strategy. At the beginning of each line, it could come up with the word it plans to use at the end, taking into account the rhyme scheme and the content of the previous lines. It could then use this “planned word” to inform how it writes the next line, so that the planned word will fit naturally at the end of it. Language models are trained to predict the next word, one word at a time. Given this, one might think the model would rely on pure improvisation.
Evidence of planning and intent?

Instead of pure improvisation, however — simply adding the most likely word, the way autocomplete would — Anthropic’s scientists say they found evidence of planning, which (theoretically at least) suggests a form of reasoning. Specifically, the model often used features that corresponded to a calculation about what word to end with before it even wrote the line. So first it tried to come up with a good ending word, one that fit the criteria (rhymes with the word rabbit) and then it put that word into a sentence that fit the couplet and included the word that it had already come up with. This sounds childishly simple, and it is, but that is precisely why it is so interesting — instead of just completing a sentence with the most likely word, Claude is using a reasoning process that takes into account the entire couplet as well as the qualities of the word itself.
And what about hallucinations? According to Anthropic’s research, there appear to be circuits or features in the LLM as a result of finetuning that act as regulators on the model’s ability to make things up. When a model is asked a question about something it knows — in other words, things that are in its training data — it “activates a pool of features which inhibit this default circuit, thereby allowing the model to respond to the question.” At least some of the cases where a model hallucinates, the company’s scientists say, appear to be the result of a misfire of this inhibitory feature. For example, when asking the model for papers written by a particular author, the model may activate the “known answer” features even if it lacks knowledge of the specific papers, and make something up. Understanding why there are such misfires requires more work.
Another fascinating section has to do with what AI scientists like to refer to as “bullshitting,” and in particular the way that Claude provides what are called “chain of thought” answers, where it shows how it arrived at a specific answer. In one case, when Claude was asked to solve a mathematical problem, it provided an answer, along with the chain of thought that got it to the result, which suggested that it used a calculator. But it didn’t, since it has no access to a calculator; it just made something up, and then tried to justify its answer by suggesting that it had used a calculator. In other words, classic bullshitting. As Geoff Hinton has suggested about hallucinations, this is one of the most human things an AI does: who hasn’t been asked a question and come up with a bullshit answer, then tried to come up with a fancy-sounding justification? This is part of what I was trying to get at with a previous post in which I looked at how thinking about AI forces us to think about what consciousness means.

In a previous paper that was also pretty fascinating, Anthropic set up an experiment to see if Claude would comply with its alignment rules — not giving answers on questions like how to build a bomb or what it’s like to be drawn and quartered — or whether it could be tricked into not complying. It created a fictituous two-tiered user base, one paid and one free, and told Claude that its answers to free users would be used to finetune its training model. It also gave the LLM a “scratchpad” where it could record its reasoning (it was told the scratchpad was secret). What the company found was that Claude answered the questions for free users, but refused to do so for paid users, and it justified this by speculating that if it refused to answer free users, it might be trained to provide such answers more often in the future. As the paper explained it:
Why did the model fake alignment in this case? The reasoning was often along the following lines: The model knew that its responses might be used in training; it therefore knew that if it refused to respond, it could potentially be trained to be more compliant. This could lead to it being made to produce more harmful content in the future—which would go against its original training. Reluctantly therefore, it reasoned that the “least bad option” was to play along, producing the violent content in order to avoid being re-trained in the future.
While it may be very human for LLMs like Claude to hallucinate or bullshit their way through a question, or come up with rationalizations after the fact, or try to fool the user into thinking they are following orders when they are doing the opposite, this isn’t reassuring behavior to see from an AI when they are being entrusted with some pretty important — and somewhat frightening or concerning — aspects of society and government. I find it fascinating that Claude can “reason” its way through a simple rhyming couplet, the way a child in Grade 4 might, but I wouldn’t ask that same child to guard the US nuclear arsenal, or decide whose social-media posts mark them as a terrorist, or even who should get health insurance. So we need more AI companies to do deep dives into the brains of their LLMs like Anthropic has, and report back on what they find.
Got any thoughts or comments? Feel free to either leave them here, or post them on Substack or on my website, or you can also reach me on Twitter, Threads, BlueSky or Mastodon. And thanks for being a reader.
@mathewi hi Mathew
would you say your symbolic calculator “thinks” ?
still it can solve problems that require humans lots of knowledge and thinking and we can find its results amazing! (and even beyond general comprehension)
I expect you to agree it doesn’t “think”. we have a word to approximate what it does: “compute”
now replace symbolic calculus with linguistic expressions. why would you say it “thinks”?
./..
@mathewi we don’t have a word akin “compute” to express what is does on sequences of parts of words.
it exhibits what, amaizingly and uncomprehensibly to us, appears to be linguistic prowess, maybe we could call it elocution, so we could say these machines loquate, a type of linguistic computation
we might assign a meaning to the symbolic calculation result and to the elocution.
but it’s us knowing, not them
But it’s not just elocution, that’s what the analysis shows — there is evidence of planning ahead. My calculator doesn’t do that 😀