Critical Thinking in the Age of LLMs

The 2024 Microsoft / Carnegie Mellon study, what it found about heavy generative-AI use, and the deeper question it forces — what is critical thinking, and how would we know if it were eroding?

In 2024, a research team from Microsoft and Carnegie Mellon University surveyed roughly 660 knowledge workers about their use of generative AI at work. The study became, almost immediately, the canonical empirical reference for a claim the literature had been making since 2016: that high reliance on AI is associated with a measurable reduction in critical thinking.¹ The finding is so consequential that it deserves its own article, even though it overlaps with several others.

What the study actually found

The methodology was self-report — questionnaires, think-aloud protocols, a structured rubric for evaluating critical-thinking exercises. Self-report has known limitations, but the study controlled for many of them and the patterns were robust.

The core findings, in the authors’ framing:

Higher trust in AI tracks with less effort spent on critical thinking. Workers who reported high confidence in AI outputs spent less time evaluating, verifying, and questioning those outputs.
Effort shifts from execution to oversight. When workers did engage critically, the engagement clustered around accepting/rejecting/editing AI outputs rather than producing original analysis.
The shift correlates with self-reported “cognitive atrophy.” Workers using AI heavily reported that their independent analytical skills felt weaker than before. This is subjective, but the convergence with the more objective behavioral measures was the headline.

The phrase the paper used to summarize the dynamic: critical thinking under heavy AI use becomes more about evaluating the AI’s output and less about formulating one’s own. Both are forms of thinking. They are not the same form. The first does not preserve the second.

What “critical thinking” actually is

Pause on the term. In the educational literature, critical thinking has at least four components, and AI affects each differently.

Independent reasoning. Forming a position from one’s own materials, before consulting authorities. This is the most affected by AI use; the default workflow is to ask first and reason later.
Source evaluation. Judging the credibility of evidence. AI affects this ambivalently — it accelerates retrieval, but obscures provenance, and produces fluent text that looks sourced when it is not.
Argument analysis. Identifying the structure of a claim and where it might fail. AI can help here (it can lay out an argument cleanly) but its own outputs often have argument-shaped surfaces with structural gaps.
Metacognition. Awareness of one’s own reasoning, including its characteristic failure modes. This is the one component AI does not engage directly — and the one most needed if the user is to remain critical of the AI itself.

The literature suggests AI use erodes (1) most directly, eroded (2) and (3) inconsistently, and leaves (4) untouched — except that without active practice, (4) atrophies on its own.

The structural problem

The Microsoft / CMU paper’s deepest contribution is to name the structural problem behind the empirical finding. Generative AI raises the cost of critical engagement relative to the cost of acceptance. To accept an AI output is to do nothing. To engage critically with an AI output is to do something — read carefully, identify weak points, verify claims, formulate counterarguments. The asymmetry favors acceptance.

In a normal information environment — a book, a teacher, a colleague — the same asymmetry exists in milder form. What changes with AI is that the volume of plausible material to be evaluated rises sharply, and the cues that mark some of it as suspect are weakened by the uniform fluency of the medium. A bad argument from a colleague reads as a bad argument; a bad argument from an LLM reads as smoothly as a good one.

What can be done

The paper’s policy implications are modest, on the model that the authors are researchers, not legislators. They are also widely accepted across the literature. Three lines of response recur:

Education. Teach the structural features of LLM outputs explicitly — including their characteristic failure modes — so that users can develop critical habits that match the medium.

Tool design. Surface AI uncertainty and reasoning, not just outputs. Models that show their work give the user something to engage critically with. Models that do not, take that engagement off the table.

Workflow. Build in moments where the user is required to think independently — drafting before querying, evaluating before accepting, reviewing without the model in the loop. The constraint is artificial, but the practice it preserves is real.

A closing observation

There is a temptation, reading the empirical literature, to conclude that generative AI is bad for thinking. The conclusion is too strong. What is defensible is the weaker, more useful claim: under default workflows, in populations with default training, generative AI is associated with reductions in critical thinking. The claim is conditional on workflows and training. Both are within reach.

The encyclopedia’s argument here, which it returns to in Reading the Whole Argument (F.40), is that those workflows and that training cannot be left to chance. Default behavior is what scales. If the default is uncritical acceptance, that is what the population gets. If the default is informed engagement, that is what the population gets. Neither is more natural than the other; both are the result of choices, made repeatedly, mostly without noticing.

Lee et al. (Microsoft / CMU), 2024. The 666-participant knowledge-worker study. ↩