The Illusion of Reasoning: The Debate that's Shaking the World of AI

Newsletter

The Illusion of Reasoning: The Debate that's Shaking the World of AI

Apple publishes two devastating papers-"GSM-Symbolic" (Oct. 2024) and "The Illusion of Thinking" (June 2025)-which demonstrate how LLMs fail on small variations of classical problems (Hanoi Tower, river crossing): "performance decreases when altered only numerical values." Zero success on complex Tower of Hanoi. But Alex Lawsen (Open Philanthropy) retorts with "The Illusion of the Illusion of Thinking" demonstrating failed methodology: failures were limits of token output not reasoning collapse, automatic scripts misclassified partial correct outputs, some puzzles were mathematically unsolvable. By repeating tests with recursive functions instead of listing moves, Claude/Gemini/GPT solve Tower of Hanoi 15 records. Gary Marcus embraces Apple thesis on "distribution shift," but pre-WWDC timing paper raises strategic questions. Business implications: how much to trust AI for critical tasks? Solution: neurosymbolic approaches-neural networks for pattern recognition+language, symbolic systems for formal logic. Example: AI accounting understands "how much travel expenses?" but SQL/calculations/tax audits = deterministic code.

Fabio Lauria

Ceo & Founder of Electe‍

When AI reasoning meets reality: the robot correctly applies the logic rule but identifies the basketball as an orange. A perfect metaphor for how LLMs can simulate logical processes without possessing true understanding.

‍

In recent months, the artificial intelligence community has been through a heated debate sparked by two influential research papers published by Apple. The first, "GSM-Symbolic" (October 2024), and the second, "The Illusion of Thinking" (June 2025), questioned the alleged reasoning capabilities of Large Language Models, sparking mixed reactions across the industry.

‍

As analyzed in our previous in-depth study on "The illusion of progress: simulating general artificial intelligence without achieving it", the issue of artificial reasoning touches the very heart of what we consider intelligence in machines.

‍

What Apple Research Says.

Apple researchers conducted a systematic analysis on Large Reasoning Models (LRM) -those models that generate detailed reasoning traces before providing an answer. The results were surprising and, for many, alarming.

‍

The Tests Conducted

The study subjected the most advanced models to classical algorithmic puzzles such as:

Tower of Hanoi: A mathematical puzzle first solved in 1957
River crossing problems: Logic puzzles with specific constraints
Benchmark GSM-Symbolic: Variations of elementary-level math problems

‍

Testing reasoning with classic puzzles: the farmer, wolf, goat and cabbage problem is one of the logic puzzles used in Apple studies to assess LLMs' reasoning skills. The difficulty lies in finding the correct sequence of crossings while avoiding the wolf eating the goat or the goat eating the cabbage when left alone. A simple but effective test to distinguish between algorithmic understanding and pattern memorization.

‍

Controversial Results

The results showed that even small changes in problem formulation lead to significant changes in performance, suggesting worrying fragility in reasoning. As reported in the AppleInsider's coverage, "performance of all models decreases when only numerical values in the GSM-Symbolic benchmark questions are altered."

‍

The Counter-Offensive: "The Illusion of the Illusion of Thinking."

‍

The response from the AI community was not long in coming. Alex Lawsen of Open Philanthropy, in collaboration with Claude Opus of Anthropic, published a detailed rebuttal entitled. "The Illusion of the Illusion of Thinking", challenging methodologies and conclusions of the Apple study.

The Main Objections

Output Limits Ignored: Many failures attributed to "reasoning collapse" were actually due to model output token limits
Incorrect Evaluation: Automatic scripts also classified partial but algorithmically correct outputs as total failures
Impossible Problems: Some puzzles were mathematically unsolvable, but models were penalized for not solving them

The Confirmation Tests

When Lawsen repeated the tests with alternative methodologies - asking models to generate recursive functions instead of listing all the moves - the results were dramatically different. Models such as Claude, Gemini and GPT correctly solved Tower of Hanoi problems with 15 records, far beyond the complexity where Apple reported zero successes.

‍

The Authoritative Voices of Debate.

‍

Gary Marcus: The Historical Critic

Gary Marcus, a longtime critic of LLMs' reasoning skills, embraced the Apple results as confirmation of his 20-year thesis. According to Marcus, LLMs continue to struggle with "distribution shift"-the ability to generalize beyond training data-while remaining "good solvers of problems already solved."

‍

The LocalLlama Community

The discussion has also spread to specialized communities such as LocalLlama on Reddit, where developers and researchers debate the practical implications for open-source models and local implementation.

‍

Beyond Controversy: What It Means for Businesses

Strategic Implications

This debate is not purely academic. It has direct implications for:

AI Deployment in Production: How much can we trust models for critical tasks?
R&D investment: Where to focus resources for the next breakthrough?
Communication with Stakeholders: How to manage realistic expectations about AI capabilities?

The Neurosymbolic Way

As highlighted in several technical insights, the need for hybrid approaches that combine:

Neural networks for pattern recognition and language understanding
Symbolic systems for algorithmic reasoning and formal logic

Trivial example: an AI assistant helping with accounting. The language model understands when you ask "how much did I spend on travel this month?" and extracts the relevant parameters (category: travel, period: this month). But the SQL query that queries the database, calculating the sum and checking the fiscal constraints? That is done by deterministic code, not the neural model.

‍

Timing and Strategic Context

It has not escaped observers' notice that the Apple paper was released just before WWDC, raising questions about strategic motivations. As noted in the9to5Mac's analysis, "the timing of the Apple paper-just before WWDC-raised a few eyebrows. Was this a research milestone, or a strategic move to reposition Apple in the broader AI landscape?"

‍

Lessons for the Future

For Researchers

Experimental Design: The importance of distinguishing between architectural limitations and implementation constraints
Rigorous Assessment: The need for sophisticated benchmarks that separate cognitive capabilities from practical constraints
Methodological Transparency: The requirement to fully document experimental setups and limitations

For Companies

Realistic Expectations: Recognizing current limitations without giving up future potential
Hybrid Approaches: Investing in solutions that combine strengths of different technologies
Continuous Evaluation: Implement testing systems that reflect real-world usage scenarios

Conclusions: Navigating Uncertainty

‍

The debate arising from the Apple papers reminds us that we are still in the early stages of understanding artificial intelligence. As pointed out in our previous article, the distinction between simulation and authentic reasoning remains one of the most complex challenges of our time.

‍

The real lesson is not whether or not LLMs can "reason" in the human sense of the term, but rather how we can build systems that exploit their strengths while compensating for their limitations. In a world where AI is already transforming entire sectors, the question is no longer whether these tools are "smart," but how to use them effectively and responsibly.

‍

The future of enterprise AI will probably lie not in a single revolutionary approach, but in the intelligent orchestration of several complementary technologies. And in this scenario, the ability to critically and honestly evaluate the capabilities of our tools becomes a competitive advantage itself.

‍

For insights into your organization's AI strategy and implementation of robust solutions, our team of experts is available for customized consultations.

‍

Sources and References:

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models - Apple Machine Learning Research
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models - Apple Machine Learning Research
New paper pushes back on Apple's LLM 'reasoning collapse' study - 9to5Mac
Seven replies to the viral Apple reasoning paper - Gary Marcus
The Illusion of Thinking: What the Apple AI Paper Says About LLM Reasoning - Arize AI
Apple's study proves that LLM-based AI models are flawed - AppleInsider
The illusion of progress: simulating general artificial intelligence without achieving it - Electe

Resources for business growth

November 9, 2025

AI Regulation for Consumer Applications: How to Prepare for the New Regulations 2025

2025 marks the end of the "Wild West" era of AI: AI Act EU operational by August 2024 with AI literacy obligations by Feb. 2, 2025, governance and GPAI by Aug. 2. California pioneers with SB 243 (born after Sewell Setzer suicide, 14-year-old developed emotional relationship with chatbot) mandating ban on compulsive reward systems, suicide ideation detection, reminders every 3 hours "I'm not human," independent public audits, penalties $1,000/violation. SB 420 requires impact assessments for "high-risk automated decisions" with right to appeal human review. Real Enforcement: Noom cited 2022 for bots passed off as human coaches, settlement $56M. National trend: Alabama, Hawaii, Illinois, Maine, Massachusetts classify failure to notify AI chatbots as UDAP violation. Approach three tiers risk-critical systems (healthcare/transportation/energy) certification pre-deployment, consumer-facing transparent disclosures, general purpose registration+security testing. Regulatory patchwork without federal preemption: multi-state companies must navigate varying requirements. EU from August 2026: inform users AI interaction unless obvious, AI-generated content labeled machine-readable.

November 9, 2025

When AI will become your only choice (and why you'll like it)

November 9, 2025

Regulating what is not created: does Europe risk technological irrelevance?

**TITLE: European AI Act-The Paradox of Who Regulates What Doesn't Develop** **SUMMARY:** Europe attracts only one-tenth of global investment in artificial intelligence but claims to dictate global rules. This is the "Brussels Effect"-imposing regulations on a planetary scale through market power without driving innovation. The AI Act goes into effect on a staggered timetable until 2027, but multinational tech companies respond with creative evasion strategies: invoking trade secrets to avoid revealing training data, producing technically compliant but incomprehensible summaries, using self-assessment to downgrade systems from "high risk" to "minimal risk," forum shopping by choosing member states with less stringent controls. The extraterritorial copyright paradox: EU demands that OpenAI comply with European laws even for training outside Europe-principle never before seen in international law. The "dual model" emerges: limited European versions vs. advanced global versions of the same AI products. Real risk: Europe becomes "digital fortress" isolated from global innovation, with European citizens accessing inferior technologies. The Court of Justice in the credit scoring case has already rejected the "trade secrets" defense, but interpretive uncertainty remains huge-what exactly does "sufficiently detailed summary" mean? No one knows. Final unresolved question: is the EU creating an ethical third way between U.S. capitalism and Chinese state control, or simply exporting bureaucracy to an industry where it does not compete? For now: world leader in AI regulation, marginal in its development. Vaste program.

November 9, 2025

Outliers: Where Data Science Meets Success Stories.

Data science has turned the paradigm on its head: outliers are no longer "errors to be eliminated" but valuable information to be understood. A single outlier can completely distort a linear regression model-change the slope from 2 to 10-but eliminating it could mean losing the most important signal in the dataset. Machine learning introduces sophisticated tools: Isolation Forest isolates outliers by building random decision trees, Local Outlier Factor analyzes local density, Autoencoders reconstruct normal data and report what they cannot reproduce. There are global outliers (temperature -10°C in tropics), contextual outliers (spending €1,000 in poor neighborhood), collective outliers (synchronized spikes traffic network indicating attack). Parallel with Gladwell: the "10,000 hour rule" is disputed-Paul McCartney dixit "many bands have done 10,000 hours in Hamburg without success, theory not infallible." Asian math success is not genetic but cultural: Chinese number system more intuitive, rice cultivation requires constant improvement vs Western agriculture territorial expansion. Real applications: UK banks recover 18% potential losses via real-time anomaly detection, manufacturing detects microscopic defects that human inspection would miss, healthcare valid clinical trials data with 85%+ sensitivity anomaly detection. Final lesson: as data science moves from eliminating outliers to understanding them, we must see unconventional careers not as anomalies to be corrected but as valuable trajectories to be studied.