The Illusion of Reasoning: The Debate that's Shaking the World of AI

Newsletter

The Illusion of Reasoning: The Debate that's Shaking the World of AI

Apple publishes two devastating papers-"GSM-Symbolic" (Oct. 2024) and "The Illusion of Thinking" (June 2025)-which demonstrate how LLMs fail on small variations of classical problems (Hanoi Tower, river crossing): "performance decreases when altered only numerical values." Zero success on complex Tower of Hanoi. But Alex Lawsen (Open Philanthropy) retorts with "The Illusion of the Illusion of Thinking" demonstrating failed methodology: failures were limits of token output not reasoning collapse, automatic scripts misclassified partial correct outputs, some puzzles were mathematically unsolvable. By repeating tests with recursive functions instead of listing moves, Claude/Gemini/GPT solve Tower of Hanoi 15 records. Gary Marcus embraces Apple thesis on "distribution shift," but pre-WWDC timing paper raises strategic questions. Business implications: how much to trust AI for critical tasks? Solution: neurosymbolic approaches-neural networks for pattern recognition+language, symbolic systems for formal logic. Example: AI accounting understands "how much travel expenses?" but SQL/calculations/tax audits = deterministic code.

Fabio Lauria

Ceo & Founder of Electe‍

When AI reasoning meets reality: the robot correctly applies the logic rule but identifies the basketball as an orange. A perfect metaphor for how LLMs can simulate logical processes without possessing true understanding.

‍

In recent months, the artificial intelligence community has been through a heated debate sparked by two influential research papers published by Apple. The first, "GSM-Symbolic" (October 2024), and the second, "The Illusion of Thinking" (June 2025), questioned the alleged reasoning capabilities of Large Language Models, sparking mixed reactions across the industry.

‍

As analyzed in our previous in-depth study on "The illusion of progress: simulating general artificial intelligence without achieving it", the issue of artificial reasoning touches the very heart of what we consider intelligence in machines.

‍

What Apple Research Says.

Apple researchers conducted a systematic analysis on Large Reasoning Models (LRM) -those models that generate detailed reasoning traces before providing an answer. The results were surprising and, for many, alarming.

‍

The Tests Conducted

The study subjected the most advanced models to classical algorithmic puzzles such as:

Tower of Hanoi: A mathematical puzzle first solved in 1957
River crossing problems: Logic puzzles with specific constraints
Benchmark GSM-Symbolic: Variations of elementary-level math problems

‍

Testing reasoning with classic puzzles: the farmer, wolf, goat and cabbage problem is one of the logic puzzles used in Apple studies to assess LLMs' reasoning skills. The difficulty lies in finding the correct sequence of crossings while avoiding the wolf eating the goat or the goat eating the cabbage when left alone. A simple but effective test to distinguish between algorithmic understanding and pattern memorization.

‍

Controversial Results

The results showed that even small changes in problem formulation lead to significant changes in performance, suggesting worrying fragility in reasoning. As reported in the AppleInsider's coverage, "performance of all models decreases when only numerical values in the GSM-Symbolic benchmark questions are altered."

‍

The Counter-Offensive: "The Illusion of the Illusion of Thinking."

‍

The response from the AI community was not long in coming. Alex Lawsen of Open Philanthropy, in collaboration with Claude Opus of Anthropic, published a detailed rebuttal entitled. "The Illusion of the Illusion of Thinking", challenging methodologies and conclusions of the Apple study.

The Main Objections

Output Limits Ignored: Many failures attributed to "reasoning collapse" were actually due to model output token limits
Incorrect Evaluation: Automatic scripts also classified partial but algorithmically correct outputs as total failures
Impossible Problems: Some puzzles were mathematically unsolvable, but models were penalized for not solving them

The Confirmation Tests

When Lawsen repeated the tests with alternative methodologies - asking models to generate recursive functions instead of listing all the moves - the results were dramatically different. Models such as Claude, Gemini and GPT correctly solved Tower of Hanoi problems with 15 records, far beyond the complexity where Apple reported zero successes.

‍

The Authoritative Voices of Debate.

‍

Gary Marcus: The Historical Critic

Gary Marcus, a longtime critic of LLMs' reasoning skills, embraced the Apple results as confirmation of his 20-year thesis. According to Marcus, LLMs continue to struggle with "distribution shift"-the ability to generalize beyond training data-while remaining "good solvers of problems already solved."

‍

The LocalLlama Community

The discussion has also spread to specialized communities such as LocalLlama on Reddit, where developers and researchers debate the practical implications for open-source models and local implementation.

‍

Beyond Controversy: What It Means for Businesses

Strategic Implications

This debate is not purely academic. It has direct implications for:

AI Deployment in Production: How much can we trust models for critical tasks?
R&D investment: Where to focus resources for the next breakthrough?
Communication with Stakeholders: How to manage realistic expectations about AI capabilities?

The Neurosymbolic Way

As highlighted in several technical insights, the need for hybrid approaches that combine:

Neural networks for pattern recognition and language understanding
Symbolic systems for algorithmic reasoning and formal logic

Trivial example: an AI assistant helping with accounting. The language model understands when you ask "how much did I spend on travel this month?" and extracts the relevant parameters (category: travel, period: this month). But the SQL query that queries the database, calculating the sum and checking the fiscal constraints? That is done by deterministic code, not the neural model.

‍

Timing and Strategic Context

It has not escaped observers' notice that the Apple paper was released just before WWDC, raising questions about strategic motivations. As noted in the9to5Mac's analysis, "the timing of the Apple paper-just before WWDC-raised a few eyebrows. Was this a research milestone, or a strategic move to reposition Apple in the broader AI landscape?"

‍

Lessons for the Future

For Researchers

Experimental Design: The importance of distinguishing between architectural limitations and implementation constraints
Rigorous Assessment: The need for sophisticated benchmarks that separate cognitive capabilities from practical constraints
Methodological Transparency: The requirement to fully document experimental setups and limitations

For Companies

Realistic Expectations: Recognizing current limitations without giving up future potential
Hybrid Approaches: Investing in solutions that combine strengths of different technologies
Continuous Evaluation: Implement testing systems that reflect real-world usage scenarios

Conclusions: Navigating Uncertainty

‍

The debate arising from the Apple papers reminds us that we are still in the early stages of understanding artificial intelligence. As pointed out in our previous article, the distinction between simulation and authentic reasoning remains one of the most complex challenges of our time.

‍

The real lesson is not whether or not LLMs can "reason" in the human sense of the term, but rather how we can build systems that exploit their strengths while compensating for their limitations. In a world where AI is already transforming entire sectors, the question is no longer whether these tools are "smart," but how to use them effectively and responsibly.

‍

The future of enterprise AI will probably lie not in a single revolutionary approach, but in the intelligent orchestration of several complementary technologies. And in this scenario, the ability to critically and honestly evaluate the capabilities of our tools becomes a competitive advantage itself.

‍

For insights into your organization's AI strategy and implementation of robust solutions, our team of experts is available for customized consultations.

‍

Sources and References:

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models - Apple Machine Learning Research
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models - Apple Machine Learning Research
New paper pushes back on Apple's LLM 'reasoning collapse' study - 9to5Mac
Seven replies to the viral Apple reasoning paper - Gary Marcus
The Illusion of Thinking: What the Apple AI Paper Says About LLM Reasoning - Arize AI
Apple's study proves that LLM-based AI models are flawed - AppleInsider
The illusion of progress: simulating general artificial intelligence without achieving it - Electe

Resources for business growth

November 9, 2025

AI trends 2025: 6 strategic solutions for smooth implementation of artificial intelligence

87% of companies recognize AI as a competitive necessity but many fail to integrate-the problem is not the technology but the approach. 73% of executives cite transparency (Explainable AI) as crucial for stakeholder buy-in, while successful implementations follow the "start small, think big" strategy: targeted high-value pilot projects rather than total business transformation. Real case: manufacturing deploys AI predictive maintenance on single production line, achieves -67% downtime in 60 days, catalyzes enterprise-wide adoption. Verified best practices: prioritize integration via API/middleware vs. full replacement to reduce learning curves; devote 30% resources to change management with role-specific training generates +40% adoption rate and +65% user satisfaction; parallel implementation to validate AI results vs. existing methods; gradual degradation with fallback systems; weekly review cycles first 90 days monitoring technical performance, business impact, adoption rates, ROI. Success requires balancing technical-human factors: internal AI champions, focus on practical benefits, evolutionary flexibility.

November 9, 2025

The Winning Strategy for IA Implementation: A 90-Day Plan

87% of support teams have seen customer expectations rise, with 68% attributing this to AI. The first 90 days are essential to avoid analysis paralysis and start getting real results. The 3-phase plan covers from strategic alignment to pilot implementation and measurable expansion, avoiding common mistakes and tracking key metrics such as efficiency and revenue impact. With dedicated support and ongoing training, you will turn initial successes into a pro-IA company culture.

November 9, 2025

Developers and AI in Websites: Challenges, Tools, and Best Practices: an international perspective

Italy is stuck at 8.2% AI adoption (vs. 13.5% EU average), while globally 40% of companies already use AI operationally-and the numbers show why the gap is fatal: Amtrak's chatbot generates ROI of 800%, GrandStay saves $2.1M/year by handling 72% of requests independently, Telenor increases revenue by 15%. This report explores AI implementation in websites with practical cases (Lutech Brain for tenders, Netflix for recommendations, L'Oréal Beauty Gifter with 27x engagement vs. email) and addresses real-world technical challenges: data quality, algorithmic bias, integration with legacy systems, real-time processing. From solutions-edge computing to reduce latency, modular architectures, anti-bias strategies-to ethical issues (privacy, filter bubbles, accessibility for users with disabilities) to government cases (Helsinki with multilingual AI translation), learn how web developers are moving from coders to user experience strategists and why those navigating this evolution today will dominate the web tomorrow.

November 9, 2025

AI Decision Support Systems: The Rise of "Advisors" in Corporate Leadership.

77% of companies use AI but only 1% have "mature" implementations-the problem is not the technology but the approach: total automation vs. intelligent collaboration. Goldman Sachs with AI advisor on 10,000 employees generates +30% outreach efficiency and +12% cross-sell while maintaining human decisions; Kaiser Permanente prevents 500 deaths/year by analyzing 100 items/hour 12h in advance but leaves diagnosis to doctors. Advisor model solves trust gap (only 44% trust corporate AI) through three pillars: Explainable AI with transparent reasoning, calibrated confidence scores, continuous feedback for improvement. The numbers: impact $22.3T by 2030, strategic AI collaborators will see 4x ROI by 2026. Practical 3-step roadmap-assessment skills and governance, pilot with confidence metrics, gradual scaling with continuous training-applicable to finance (supervised risk assessment), healthcare (diagnostic support), manufacturing (predictive maintenance). The future is not AI replacing humans but effective orchestration of human-machine collaboration.