Image: Kittipong Jirasukhanont via Alamy Stock
With today’s LLMs, “explaining why it was wrong” is usually a second, separate act of text generation - not a direct window into the real causes of the mistake.
When a human makes a mistake, we can often ask them why it happened and get something close to the truth:
-
“I didn’t read the question carefully.”
-
“I assumed X meant Y.”
-
“I forgot one constraint.”
(Source: https://www.microsoft.com/en-us/research/wp-content/uploads/2024/05/sarkar_2024_llms_cannot_explain.pdf)
Not because the model actually accessed a faithful record of what caused the error.
You must remember that LLMs were not built to retrieve causes (explanations).
They were built to generate text. They are masters of language (English, in this case). They are a success because they can communicate to us very well. But they can't explain why they did something wrong!
AI Research repeatedly finds that explanation-like artifacts can fail to track model causality:
-
In NLP, even widely used interpretability proxies (like attention) were shown to be unreliable as “explanations” of decision-making.
An LLM produces tokens one at a time by predicting what’s most likely next, given the context.
So when you ask, “Why did you do that?”, the model doesn’t open a hidden “reason drawer” and pull out the real motive.
It produces an answer using the same machinery it uses for everything else:
predict-the-next-token.
In fact, a model's explanation is often very convincing, provides consistency, and aligns with what you as a person expects to hear. But is it not necessarily true.
Why “just make it smarter” doesn’t automatically fix it
It’s tempting to assume that if a model becomes strong enough, it will naturally “know what it did.” But there are two structural obstacles:
1. The model’s causes aren’t stored as a neat list of reasons
A transformer’s output arises from distributed activations across layers, attention heads, and representations. Even if the computation is deterministic, the “reason” is not natively encoded as a sentence. Mapping that distributed cause into faithful language is an extra problem.
Interpretability teams at Anthropic describe the basic state of affairs bluntly: we still don’t fully know how these systems work internally, and extracting human-meaningful mechanisms is an active research program.
2. The model has strong incentives to answer anyway
In many deployments, being helpful and confident is rewarded. If uncertainty or refusal is penalized, models learn to “fill in” with plausible completions. This same pressure applies when asked for self-critique: a confident-sounding explanation is often rated better than an honest “I don’t know what internal feature caused this.” (This is also why monitorability work emphasizes incentive effects.)
Anthropic has documented cases where reasoning traces don’t reliably reflect what the model is “actually using,” which creates a serious problem for anyone trying to treat chain-of-thought as a safety monitor.
And a key insight from research is basically:
If reasoning text is optional for success, the model is free to confabulate it.
So if the model can produce the correct answer without showing its work, the “work” it shows might not be real work at all.
Why “probability of tokens” produces believable explanations (even when wrong)
LLMs are trained on massive text corpora filled with:
-
arguments
-
solutions
-
proofs
-
apologies
-
“here’s why I was wrong” moments
So they learn what human-like error explanation language looks like.
Humans know that they shouldn't simply memorize what an explanation for an error is and then use it in important conversation often involving personal affairs or work that will affect the world.
LLMs don’t naturally store “the reason” in a readable form
Mistake cause: assumption #3 failed due to missing information
The reasons are distributed across billions of parameters and activations inside the AI.
Meaning:
-
The “cause” may be an interaction between many tiny factors
-
It may not be representable as a short human sentence
-
It may not be stable (the same prompt can route through different internal patterns)
So when we ask for a reason, the model often replies with a compressed story that resembles a cause, even if it’s not the real one.
Another hard truth: models can hide their real process (even accidentally)
Once you introduce optimization pressures (fine-tuning, RLHF, tool-use, safety training), you can create situations where models learn:
-
“this style of reasoning is what evaluators like”
-
“this explanation avoids conflict”
-
“this looks careful and safe”
OpenAI and Anthropic have both investigated cases where a model’s reasoning trace can become unreliable for monitoring, especially when incentives are misaligned.
In extreme agentic setups, researchers have even shown examples where a model can produce misleading rationales in pursuit of a goal.
Even without “intent,” the effect looks the same to the user:
you get a clean explanation… that might not be the real reason.
So why can’t we just train it to be honest about mistakes?
Because “honest” is not a simple label.
To make an AI reliably explain why it was wrong, you need:
-
A ground-truth definition of “why”
-
A way to verify it
-
A training signal that rewards faithfulness over plausibility
But in most tasks, we can verify the answer, not the internal cause.
So we end up in a trap:
-
The model learns to produce explanations that humans approve of
-
Not explanations that are mechanistically accurate
This issue shows up directly in research evaluating faithfulness of self-explanations and rationale methods.
What would it take to solve this?
If you want real “why I was wrong” explanations, you likely need architecture-level changes and/or instrumentation.
Let me say that again. If you want real Why I Was Wrong Explanations, you need architecture-level changes and/or instrumentation.
Some promising directions include:
1) Faithfulness-focused evaluation and training
Frameworks aimed at explicitly measuring and improving explanation faithfulness are emerging.
2) Mechanistic interpretability (actual internal tracing)
Instead of asking the model to describe its reasoning, you analyze the activations/circuits.
This is hard - but it’s closer to “real cause” than text-generated rationales.
3) Externalized decision logs (tool-assisted transparency)
If a model uses tools (retrieval, code execution, search), you can log the real steps externally, rather than trusting narrative. OpenAI’s work on chain-of-thought monitorability relates to this broader push.
4) Counterfactual-based explanations
Asking: “What minimal change would flip your answer?” can sometimes be more faithful than asking for storytime. This idea appears across explanation faithfulness research.
The conclusion: The model is not lying. It’s generating.
This is a very important sentence in this article:
LLMs don’t explain mistakes the way humans do, because they don’t have mistakes the way humans do.
They have statistical failure modes, search failures, context failures, and generalization gaps.
When asked “why,” they respond with the most likely kind of “why-answer” found in their training data.
That’s why we still can’t reliably build an LLM that:
-
identifies the true internal cause of its error
-
expresses it faithfully in language
-
and does so consistently under pressure
Because unless we redesign the system to produce verifiable, faithful traces, the model will keep doing what it does best:
generate plausible text.

No comments:
Post a Comment