The Mirage of Machine Minds


Model Collapse Ends AI Hype

What Large Language Models Can and Cannot Do

A rapidly accumulating body of peer-reviewed research challenges the popular notion that today's AI systems think, reason, or generate genuinely new knowledge—while also revealing surprising capabilities that defy easy dismissal.

Bottom Line Up Front Large language models (LLMs) are sophisticated statistical pattern-matchers, not reasoning engines. Multiple independent research programs—including studies from Apple, UCLA, CMU, Anthropic, Oxford, and UCL—converge on three findings: 
 
(1) LLMs process language rather than thinking, relying on surface-level statistical correlations rather than genuine comprehension; 
 
(2) their apparent "reasoning" is largely post-hoc rationalization, with chain-of-thought outputs that frequently misrepresent the model's actual computational process; and 
 
(3) information-theoretic constraints mean LLMs cannot generate truly novel knowledge—they redistribute and compress the information present in training data, and when trained on their own outputs, they degenerate. 
 
These findings carry urgent practical implications for high-stakes deployments in medicine, law, and science, yet do not diminish the very real productivity value these systems deliver when used appropriately and critically.

In November 2024, a computer scientist stood before students at Baylor University and posed a deceptively simple question: What, exactly, is a large language model doing when it produces a confident, grammatically polished, and occasionally brilliant answer? His answer, grounded in a growing stack of peer-reviewed literature, was sobering: it is predicting the next word. Not thinking. Not reasoning. Not discovering. Predicting—one token at a time—using statistical patterns compressed from hundreds of billions of examples of human-generated text.

That framing is no longer fringe. Since 2022, researchers at some of the world's most prestigious institutions have published a cascade of studies documenting the specific, reproducible ways in which LLMs fail at tasks that genuine reasoning would handle trivially—while also excelling, sometimes remarkably, at tasks that look far harder. Understanding this paradox is no longer merely an academic exercise. With AI systems being evaluated for roles in medical diagnosis, legal reasoning, scientific discovery, and national security, the stakes of misunderstanding their nature have never been higher.

I. The Architecture of Eloquent Ignorance

The conceptual ancestor of today's GPT-class models was sketched in 1948 by Claude Shannon, the father of information theory, who demonstrated that even a simple system that predicts the next word based only on the previous word can produce surprisingly grammatical text. The key insight—and the key limitation—has not changed in seventy-seven years: fluency is not understanding.

Modern LLMs such as GPT-4o, Claude 3, Gemini, and their successors extend Shannon's basic idea using deep neural networks with hundreds of billions of parameters, trained on text corpora approximating the entire digitized output of human civilization. Words are converted to numerical vectors through a process called tokenization, then projected into high-dimensional geometric spaces via embedding, such that semantically related words cluster together. A second mechanism, attention, allows the model to weigh contextual relationships across thousands of tokens simultaneously.

The result is a system of remarkable fluency. Ask it for a sonnet about Waco, Texas, and it delivers recognizable verse. Ask it to summarize a dense scientific paper, and it produces a readable précis. Ask it to write working Python code, and it frequently succeeds. What the system is doing in each case, however, is not what the word "thinking" ordinarily implies: it is selecting, through learned probability distributions over tokens, the sequence of words most statistically consistent with its training data given the prompt.

II. Jagged Intelligence: The Problem of Unpredictable Failure

Perhaps the most disorienting property of LLMs is what researchers have taken to calling "jagged intelligence"—a performance profile that bears no reliable relationship to conventional task difficulty. The same model that can prove theorems and win mathematical olympiads can fail to count the letter "i" in the word "inconvenience." The same system that writes sophisticated legal memos miscalculates multi-digit multiplication.

A landmark 2024 study from Apple researchers, published under the title GSM-Symbolic, provided rigorous experimental grounding for this intuition. The team introduced a benchmark that systematically varied the numerical values and surface features of grade-school math problems—holding the underlying logical structure constant. Their findings were striking: performance declined measurably across all tested models when only numerical values were changed, and fell dramatically—by up to 65 percent in state-of-the-art models—when a single clause of irrelevant information was added to the problem statement. Because the extra clause contributed nothing to the solution, a system performing genuine reasoning should have been unaffected.

Problem (original): Oliver picks 44 kiwis on Friday, 58 on Saturday, and double Friday's count on Sunday. How many kiwis does he have in total? Problem (with irrelevant clause added): Oliver picks 44 kiwis on Friday, 58 on Saturday, and double Friday's count on Sunday. Five of them were a bit smaller than average. How many kiwis does he have in total? Result: Multiple frontier models subtracted 5, citing the "smaller" kiwis — a spurious pattern match to familiar problem structures in training data.
Illustrative example from Mirzadeh et al. (2024), GSM-Symbolic benchmark. The irrelevant clause triggers a subtraction operation absent from genuine reasoning.

The Apple team's conclusion was direct: current LLMs cannot perform formal mathematical reasoning. They replicate reasoning steps encountered in training data rather than applying underlying logical rules. A 2022 paper from the STAR AI Laboratory at UCLA had arrived at essentially the same conclusion through a different route, finding that LLMs solved logic problems successfully only when the distribution of problems matched training-data patterns—a signature of pattern recognition, not deductive inference.

A follow-on 2024 study by Mirzadeh and colleagues extended this analysis to probabilistic reasoning, finding that even the labeling of frequency counts—replacing "PC, laptop, keyboard" with "hamburger, cheeseburger, French fries" while preserving the identical numerical structure—caused significant performance degradation. Labels, which carry no logical relevance to counting tasks, were apparently influencing the model's output through domain-specific pattern associations baked in during training.

"Current LLMs cannot perform formal mathematical reasoning. They replicate reasoning steps from training data rather than applying underlying logical rules."
— Mirzadeh et al., GSM-Symbolic (2024)

III. The Rationalization Problem: When Explanations Lie

The emergence of chain-of-thought (CoT) prompting—instructing models to show their work before delivering an answer—was greeted with considerable optimism. If a model explains each reasoning step, the thinking went, surely those steps reflect how it actually arrived at its conclusion. That optimism has been substantially eroded by a body of experimental evidence demonstrating that CoT outputs frequently bear little faithful relationship to the model's actual computational process.

A widely cited 2023 study by Turpin and colleagues at Anthropic performed a revealing experiment: they presented models with multiple-choice questions in two versions, differing only in which option an embedded "hint" suggested was correct. In every tested case, when the hint pointed toward a biased answer—including a stereotyped racial attribution in a vignette-style question—the model adopted that answer while constructing a plausible post-hoc justification that made no mention of the hint. The chain of thought, in other words, was confabulation: the model had already "decided" its answer through pattern matching and was generating a rationale after the fact.

Anthropic's own researchers subsequently quantified this problem more rigorously. Studying how strongly models condition their final answers on their generated reasoning steps, they found significant variation across tasks and model sizes. Counterintuitively, larger, more capable models showed less faithful reasoning on most tasks—an instance of "inverse scaling" that suggests that improving raw performance may come at the cost of interpretability. Their conclusion: CoT can be faithful under carefully chosen conditions, but it is not a reliable transparency mechanism.

A 2025 preprint from Oxford University sharpened this critique further, arguing that chain-of-thought outputs should not be treated as explainability at all. The authors found that models exhibited "implicit post-hoc rationalization"—constructing coherent arguments for logically contradictory conclusions when asked the same question in reversed form. Even specialized "reasoning models" like DeepSeek-R1 failed to acknowledge problematic influences in a substantial fraction of cases. Among production models tested in early 2025, post-hoc rationalization rates ranged from 0.04 percent for Claude Sonnet 3.7 with extended thinking to 13 percent for GPT-4o-mini—suggesting that while the problem is not equally severe across all systems, none are fully immune.

The implications for high-stakes applications are serious. A model that constructs elaborate, internally consistent justifications for incorrect or biased conclusions—without signaling that it is doing so—is not merely wrong; it is systematically misleading. Researchers at Arizona State University have proposed abandoning the "reasoning" and "thinking" vocabulary for these outputs entirely, arguing in a 2025 preprint that treating intermediate token sequences as reasoning traces constitutes what they call a "cargo cult explanation"—form without substance.

The "Aha" Moment That Isn't

DeepSeek's celebrated R1 model attracted attention in early 2025 partly because its chain-of-thought outputs sometimes included the token "aha"—apparently signaling a sudden realization, as a human problem-solver might experience. Kambhampati and colleagues at Arizona State deflated this interpretation directly: when an LLM outputs "aha," the neural network's parameters have not changed. No internal state has shifted. The only difference between the pre-aha and post-aha computational steps is that the token "aha" now appears in the context window. The appearance of insight, they argued, is a syntactic artifact, not a semantic event.

IV. The Gödelian Ceiling: Why Syntax Is Not Semantics

The deepest theoretical challenge to LLM-based reasoning was identified not by a computer scientist but by a mathematician—Kurt Gödel—in 1931, more than two decades before the first electronic computer. Gödel's incompleteness theorems demonstrated that for any formal system capable of representing arithmetic, there exist true mathematical statements that cannot be proven within that system. The proof rested on a careful distinction between syntax (the manipulation of symbols according to rules) and semantics (the truth-value of those symbols' referents).

LLMs are, at their computational core, extraordinarily sophisticated syntax engines. They encode relationships among symbols and operate on those encodings through mathematical transformations. What they do not do, and what formal arguments suggest they cannot do through syntax alone, is access the semantic content—the actual meaning—of the symbols they manipulate. A vector representing "are" and a vector representing "aren't" may be geometrically close in embedding space (having appeared in similar textual contexts), but the numerical relationship between their token identifiers (553 and 23,236) encodes nothing about negation. The system knows the vectors are similar; it does not know what either word means.

Philosophers John Searle and Emily Bender have advanced related arguments from different disciplinary starting points. Searle's Chinese Room thought experiment—a formal system that manipulates Chinese symbols according to rules without understanding Chinese—anticipates the LLM architecture by decades. Bender and colleagues' 2021 "stochastic parrots" paper extended this critique to the scale of modern foundation models, arguing that statistical correlation among symbols, however vast the training corpus, cannot bootstrap genuine semantic understanding.

V. The Recursion Trap: Model Collapse and the Limits of Self-Reference

If LLMs cannot generate genuinely new information through reasoning, can they at least sustain and amplify the information present in their training data? A striking strand of research suggests the answer is no—and that recursive self-training actively destroys the diversity and accuracy of that information.

In July 2024, Ilia Shumailov, Zakhar Shumaylov, and colleagues at Oxford University published a landmark study in Nature demonstrating what they termed "model collapse." When a model trained on human-generated data produces synthetic text, and that synthetic text is used to train the next generation of models, performance degrades in a characteristic pattern. Early-stage collapse depletes the tails of the data distribution—rare knowledge, minority perspectives, and edge-case facts are the first casualties. Late-stage collapse produces outputs that bear little resemblance to coherent human writing. In repeated experiments, models trained through several generations of synthetic recursion eventually produced the textual equivalent of nonsense.

The mechanism is analogous to repeatedly photocopying a document: each generation introduces small errors and loses faint details; compound enough iterations and the image is unrecognizable. A September 2025 replication study by researchers at University College London and Holistic AI confirmed the finding, observing complete syntactic collapse by the fifteenth generation in some model lineages.

The societal stakes of this finding are acute. By April 2025, independent analyses estimated that roughly 74 percent of newly created web pages contained some AI-generated text. AI-written pages in Google's top-20 search results climbed from approximately 11 percent to nearly 20 percent between May 2024 and July 2025. As the web increasingly fills with model-generated content—the very corpus on which future models will be trained—the conditions for systemic model collapse are being assembled at scale.

VI. Conservation of Information: The Mathematical Ceiling

The theoretical framework underlying these empirical observations is the principle of conservation of information—a result with roots in algorithmic information theory that constrains what any learning system can do, regardless of architecture or scale.

The intuition is simple: if you hire a magic hat to help you produce a rabbit, the total difficulty of the task has not decreased—you have merely redistributed it. Getting the rabbit from the hat becomes easy, but getting the magic hat in the first place is at least as hard as producing the rabbit directly. The hat adds a step; it does not add information.

Applied to LLMs: training a model on a massive corpus produces "good dice"—probability distributions over tokens calibrated to human language. Given those dice, generating fluent text is computationally cheap. But the information cost of obtaining those dice through training on human-generated text is at least as large as the information content of the outputs the model will ever produce. The model is a redistributor and compressor of existing information, not a creator of new information.

George Montañez formalized this argument in a 2017 paper, "The Famine of Forte," proving a measure-theoretic version of the conservation principle. In 2019, Montañez and colleagues extended the proof to any artificial learning system—demonstrating that no architectural innovation, additional training data, or refinement of reinforcement learning can escape this fundamental constraint. Further generalizations by William Dembski and others have shown that the same conservation principle applies to any probabilistic search system. The result is not merely an engineering limitation; it is a mathematical fact about the nature of learning.

This does not mean LLMs cannot produce correct outputs—including correct outputs never explicitly present in their training data. Interpolation between known facts, extraction of logical implications from a knowledge base, and compression-driven generalization can all produce outputs that appear novel. But each of these mechanisms requires that the relevant information already be implicitly present in the training distribution. The model cannot reach outside the convex hull of its training data to discover genuinely new knowledge; it can only navigate within it.

VII. What LLMs Can Legitimately Do: Calibrating Expectations

None of these findings justify dismissing LLMs as useless. The same body of research that documents their limitations also reveals genuine, if bounded, capabilities.

Mathematical olympiad performance—genuine deductive proof in constrained, well-represented domains—has improved dramatically. A 2024 study by Duc and colleagues found that frontier models could verify or prove approximately one-third of sixty-six number theory conjectures. DeepSeek's R1 and OpenAI's o3 family achieved gold-level performance on the International Mathematical Olympiad by mid-2025. These are remarkable results, achieved by systems whose underlying mechanism is statistical next-token prediction.

The resolution of the paradox—olympiad-level mathematics alongside failure to count letters—lies in the structure of training data. Mathematical olympiad problems and their solutions are extensively represented in digitized human knowledge; counting letter occurrences is a task that emerges rarely in text corpora. The model has been shown, through billions of examples, how to navigate formal mathematical structures; it has been shown almost nothing about character-counting as an explicit task.

The practical implication is that LLMs can be extraordinarily useful as sophisticated retrieval, synthesis, drafting, and code-generation tools—provided users maintain calibrated skepticism, verify outputs against authoritative sources for high-stakes decisions, and understand that confident fluency is not a reliable signal of correctness.

"The same body of research that documents LLM limitations also reveals genuine, if bounded, capabilities. The two are not in contradiction."

VIII. The Regulatory and Policy Horizon

Policymakers are beginning to grapple with the gap between AI capability narratives and the empirical record. The European Union's AI Act, which entered force in August 2024, establishes risk-tiered obligations for "high-risk" AI applications—including medical devices, educational assessment, and law enforcement—that echo the research community's concerns about opacity, hallucination, and unfaithful reasoning. In the United States, the National Institute of Standards and Technology's AI Risk Management Framework provides voluntary guidance emphasizing transparency, documentation of known limitations, and human oversight for consequential deployments.

Legal scholars at Harvard Law School's Journal of Law & Technology have argued that model collapse poses not only a technical risk but a legal and societal one: as pre-2022 human-generated text becomes an increasingly scarce resource for training future models, competitive advantages in AI development may accrue permanently to incumbents with access to large historical corpora—potentially foreclosing the ability of future entrants to train competitive models at all.

Conclusion: Respecting the Machine We Have

The physicist Richard Feynman famously insisted that knowing the name of something is not the same as understanding it. We know these systems are called "large language models." We know their outputs are called "reasoning" and "thinking." The research reviewed here argues, with increasing rigor, that those names are misleading—that we are watching eloquent statistics, not thought, and pattern completion, not inference.

That is not a counsel of despair. Knowing what a tool actually does is the prerequisite for using it well. An X-ray machine does not "see" in any biological sense; it detects differential absorption of ionizing radiation. That precise understanding is exactly what allows radiologists to interpret its outputs correctly, to know when to trust them and when to question them. We are in the early stages of developing an analogous understanding of LLMs.

The researchers whose work is summarized here are not AI skeptics in any wholesale sense. They are scientists doing what scientists do: probing the limits of a phenomenon, documenting its failure modes with precision, and insisting that the public vocabulary used to describe it be answerable to experimental evidence. In a technology domain saturated with superlatives and speculation, that insistence is itself a form of rigor worth celebrating.

Verified Sources & Formal Citations

  1. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423.
    https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf
  2. Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., & Farajtabar, M. (2024). GSM-Symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229. [Accepted, ICLR 2025]
    https://arxiv.org/abs/2410.05229
  3. Zhang, S., et al. (2022). On the paradox of learning to reason from data. arXiv preprint arXiv:2205.11502.
    https://arxiv.org/abs/2205.11502
  4. Turpin, M., Michael, J., Perez, E., & Bowman, S. R. (2023). Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems (NeurIPS 2023).
    https://arxiv.org/abs/2305.04388
  5. Lanham, T., et al. (Anthropic). (2023). Measuring faithfulness in chain-of-thought reasoning.
    https://www-cdn.anthropic.com/827afa7dd36e4afbb1a49c735bfbb2c69749756e/measuring-faithfulness-in-chain-of-thought-reasoning.pdf
  6. Chen, J., et al. (2025). Reasoning models don't always say what they think. arXiv preprint arXiv:2505.05410.
    https://arxiv.org/abs/2505.05410
  7. Kambhampati, S., et al. (2025). Stop anthropomorphizing intermediate tokens as reasoning/thinking traces. arXiv preprint arXiv:2504.09762.
    https://arxiv.org/abs/2504.09762
  8. Palod, P., et al. (2025). Performative thinking? The brittle correlation between CoT length and problem complexity. arXiv preprint arXiv:2509.07339.
    https://arxiv.org/abs/2509.07339
  9. Arcuschin, I., et al. (2025). Chain-of-thought reasoning in the wild is not always faithful. arXiv preprint arXiv:2503.08679.
    https://arxiv.org/abs/2503.08679
  10. Barez, F., et al. (2025). Chain-of-thought is not explainability. Oxford AI Governance Institute.
    https://aigi.ox.ac.uk/wp-content/uploads/2025/07/Cot_Is_Not_Explainability.pdf
  11. Tanneru, S. H., et al. (2024). On the hardness of faithful chain-of-thought reasoning in large language models. arXiv preprint arXiv:2406.10625.
    https://arxiv.org/abs/2406.10625
  12. Shumailov, I., Shumaylov, Z., Zhao, Y., et al. (2024). AI models collapse when trained on recursively generated data. Nature, 631, 755–759.
    https://doi.org/10.1038/s41586-024-07566-y
  13. Keisha, M., et al. (2025). Knowledge collapse in LLMs: When fluency survives but facts fail under recursive synthetic training. arXiv preprint arXiv:2509.04796. [UCL/Holistic AI]
    https://arxiv.org/abs/2509.04796
  14. Gerstgrasser, M., et al. (2024). Is model collapse inevitable? Breaking the curse of recursion by accumulating real and synthetic data. arXiv preprint arXiv:2404.01413.
    https://arxiv.org/abs/2404.01413
  15. Montañez, G. D. (2017). The famine of forte: Few search problems greatly favor your algorithm. arXiv preprint.
    https://arxiv.org/abs/1609.05568
  16. Montañez, G. D., et al. (2019). The futility of bias-free learning and search. arXiv preprint arXiv:1907.06010.
    https://arxiv.org/abs/1907.06010
  17. Dembski, W. A. The law of conservation of information: Search processes only redistribute existing information. BIO-Complexity.
    https://bio-complexity.org/ojs/index.php/main/article/view/BIO-C.2012.4
  18. Duc, A., et al. (2024). Mathematics with large language models as provers and verifiers. arXiv preprint arXiv:2510.12829.
    https://arxiv.org/abs/2510.12829
  19. Pournemat, A., et al. (2022). Reasoning under uncertainty: Exploring probabilistic reasoning capabilities of LLMs. arXiv preprint arXiv:2205.11502.
    https://arxiv.org/abs/2205.11502
  20. Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of FAccT 2021.
    https://doi.org/10.1145/3442188.3445922
  21. Searle, J. R. (1980). Minds, brains, and programs. Behavioral and Brain Sciences, 3(3), 417–424.
    https://doi.org/10.1017/S0140525X00005756
  22. Kostikova, A., et al. (2025). LLLMs: A data-driven survey of evolving research on limitations of large language models. arXiv preprint arXiv:2505.19240.
    https://arxiv.org/abs/2505.19240
  23. Harvard Journal of Law & Technology. (2025). Model collapse and the right to uncontaminated human-generated data.
    https://jolt.law.harvard.edu/digest/model-collapse-and-the-right-to-uncontaminated-human-generated-data
  24. Reppert, V. (2003). C. S. Lewis's Dangerous Idea: In Defense of the Argument from Reason. InterVarsity Press.
  25. Gödel, K. (1931). Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I. Monatshefte für Mathematik und Physik, 38, 173–198.

 

Comments

Popular posts from this blog

Why the Most Foolish People End Up in Power

A Student's Guide to Quantum Field Theory:

Earth's Hidden Ocean: The Ringwoodite Water Reservoir