LLM Reasoning Happens Before the Words -- Not Because of Them

What If Chain-of-Thought Is Just the Transcript, Not the Thinking?

Here's the simple version. "Let's think step by step" makes LLMs perform better -- that's the core insight behind Chain-of-Thought (CoT) prompting. But a new position paper argues that the actual reasoning doesn't happen in the text the model writes. It happens in the transformer's internal latent states before any words are generated.

The CoT text might be a byproduct of reasoning, not the cause.

The Paper

Latent reasoning mechanism visualization Visualizing latent reasoning pathways inside the transformer

arXiv 2604.15726, published April 2026. This is a position paper -- it doesn't propose a new model. Instead, it challenges the prevailing understanding of why CoT works by testing three hypotheses quantitatively.

What Wasn't Adding Up

CoT prompting became standard practice after Wei et al.'s 2022 Google Brain paper. The empirical evidence was overwhelming: step-by-step reasoning improves accuracy.

But some observations didn't fit the clean narrative:

Shuffling CoT text randomly sometimes barely hurt performance
Models occasionally wrote incorrect reasoning steps but still reached the right answer
Probing internal representations revealed that answer-related information was encoded before CoT text generation started

These anomalies pointed to an uncomfortable question: does CoT cause reasoning, or does the model finish reasoning internally and then narrate what it already decided?

Three Hypotheses, Tested

Hypothesis	Claim	Finding
H1: Latent Reasoning	Reasoning occurs in transformer latent states	Supported -- answer info exists in internal representations before CoT text
H2: Explicit CoT	CoT text directly causes reasoning	Weakly supported -- helps, but effect is often independent of text quality
H3: Serial Compute	CoT's value is providing extra computation steps	Partially supported -- more compute helps, but doesn't fully explain results

The key finding is H1. Probing experiments on middle-layer activations showed that answer-relevant information was already encoded in latent states before the model started generating CoT text. The written reasoning looks more like a post-hoc explanation than the reasoning itself.

Why This Matters

Implications for CoT prompting strategies How this paper reshapes our understanding of CoT

If this paper is right, several things need rethinking.

First, OpenAI's o1/o3 reasoning models. These generate long CoT sequences as part of their reasoning process. But through this paper's lens, o1/o3's performance gains might come from additional computation steps (more tokens = more serial compute), not from the content of the reasoning text itself.

Second, Google's Gemini Thinking Mode. When Gemini shows you its "thinking process," is that the actual reasoning or an after-the-fact narration of reasoning that already happened internally?

Third, it connects to the 2025 "Thinking Without Words" research on abstract CoT. That work showed models can "think" using abstract tokens instead of natural language. This paper strengthens the theoretical foundation for that approach.

Limitations

This is a position paper, and its scope has clear boundaries.

Experiments cover specific models and benchmarks. Generalization to all LLMs is an open question.
The paper demonstrates that information exists in latent states, not how the reasoning mechanism works.
This isn't arguing CoT is useless. CoT helps. The claim is that the reason it helps might be different from what we assumed.

CoT prompting won't change overnight. But our understanding of why it works definitely needs updating.

References

arXiv 2604.15726: LLM Reasoning Is Latent, Not the Chain of Thought

LLM Reasoning Happens Before the Words -- Not Because of Them

What If Chain-of-Thought Is Just the Transcript, Not the Thinking?

The Paper

What Wasn't Adding Up

Three Hypotheses, Tested

Why This Matters

Limitations

출처

관련 기사

GPT-5.4 Thinking Ships — 33% Fewer Tokens, 33% Fewer Errors, and the Reasoning AI Tipping Point

DeepSeek V4 — 1 Trillion Parameters, Open-Weight, and Everything You Need to Know

GPT-5.4 Deep Dive — The First General-Purpose Model That Actually Uses Your Computer

What If Chain-of-Thought Is Just the Transcript, Not the Thinking?

The Paper

What Wasn't Adding Up

Three Hypotheses, Tested

Why This Matters

Limitations

출처

관련 기사

GPT-5.4 Thinking Ships — 33% Fewer Tokens, 33% Fewer Errors, and the Reasoning AI Tipping Point

DeepSeek V4 — 1 Trillion Parameters, Open-Weight, and Everything You Need to Know

GPT-5.4 Deep Dive — The First General-Purpose Model That Actually Uses Your Computer

AI 트렌드를 앞서가세요