Newsletter

Evolution of LLMs: a brief overview of the market

Less than 2 percentage points separate the top LLMs on key benchmarks-the technology war ended in a tie. The real 2025 battle is played out on ecosystems, deployment, and cost-DeepSeek proved it can compete with $5.6M vs $78-191M of GPT-4. ChatGPT dominates brand (76% awareness) despite Claude winning 65% of technical benchmarks. For companies, the winning strategy is not to choose "the best model" but to orchestrate complementary models for different use cases.

The War of Language Models 2025: From Technical Parity to the Battle of Ecosystems

The development of Large Language Models has reached a critical tipping point in 2025: the competition is no longer played out on the fundamental capabilities of the models-now essentially equivalent in the main benchmarks-but on ecosystem, integration, and deployment strategy. While Anthropic's Claude Sonnet 4.5 maintains narrow margins of technical superiority on specific benchmarks, the real battle has shifted to different terrain.

The Technical Draw: When the Numbers Equalize

MMLU (Massive Multitask Language Understanding) Benchmark.

  • Claude Sonnet 4.5: 88.7 percent
  • GPT-4o: 88.0%
  • Gemini 2.0 Flash: 86.9%.
  • DeepSeek-V3: 87.1%.

The differences are marginal-less than 2 percentage points separate the top performers. According to Stanford's AI Index Report 2025, "the convergence of core capabilities of language models represents one of the most significant trends of 2024-2025, with profound implications for the competitive strategies of AI companies."

Reasoning Ability (GPQA Diamond)

  • Claude Sonnet 4: 65.0%
  • GPT-4o: 53.6%
  • Gemini 2.0 Pro: 59.1%

Claude maintains significant advantage on complex reasoning tasks, but GPT-4o excels in response speed (average latency 1.2s vs. 2.1s of Claude) and Gemini in native multimodal processing.

The DeepSeek Revolution: The Chinese Game-Changer

January 2025 saw the disruptive entry of DeepSeek-V3, which demonstrated how competitive models can be developed with $5.6 million vs. $78-191 million for GPT-4/Gemini Ultra. Marc Andreessen called it "one of the most amazing breakthroughs-and as open source, a profound gift to the world."

DeepSeek-V3 specifications:

  • 671 billion total parameters (37B active via Mixture-of-Experts)
  • Training cost: $5.576M
  • Performance: outperforms GPT-4o on some mathematical benchmarks
  • Architecture: Multi-head Latent Attention (MLA) + DeepSeekMoE

The impact: Nvidia shares -17% in single post-announcement session, with market reevaluating model development entry barriers.

Public Perception vs. Technical Reality

ChatGPT maintains unchallenged dominance brand awareness: research Pew Research Center (Feb. 2025) shows 76% Americans associate "conversational AI" exclusively with ChatGPT, while only 12% know Claude and 8% actively use Gemini.

Paradox: Claude Sonnet 4 beats GPT-4o on 65% technical benchmark but has only 8% consumer market share vs. 71% ChatGPT (Similarweb data, March 2025).

Google responds with massive integration: native Gemini 2.0 in Search, Gmail, Docs, Drive-strategy ecosystem vs. standalone product. 2.1 billion Google Workspace users represent instant deployment without customer acquisition.

Computer Use and Agents: The Next Frontier

Claude Computer Use (beta October 2024, production Q1 2025)

  • Capabilities: direct mouse/keyboard control, browser navigation, application interaction
  • Adoption: 12% enterprise clients Anthropic computer use in production
  • Limitations: still 14% failure rate on complex multi-step tasks

GPT-4o with Vision and Actions

  • Zapier integration: 6000+ controllable apps
  • Custom GPTs: 3 million published, 800K actively used
  • Revenue sharing for creator GPTs: $10M distributed Q4 2024

Gemini Deep Research (January 2025)

  • Autonomous multi-source research with comparative analysis
  • Generates complete reports from single prompt
  • Average time: 8-12 minutes per 5000+ word report

Gartner predicts 33% knowledge workers will use autonomous AI agents by the end of 2025, vs 5% today.

Philosophical Differences on Security

OpenAI: Safety Through Restriction Approach.

  • Rejects 8.7% prompt consumer (internal OpenAI leak data)
  • Strict content policy causes 23% developer churn to alternatives
  • Preparedness Framework public with continuous red-teaming

Anthropic: "Constitutional AI"

  • Model trained on explicit ethical principles
  • Selective rejection: 3.1% prompt (more permissive OpenAI)
  • Transparent decision-making: explain why it refuses requests

Google: "Maximum Safety, Minimum Controversy."

  • Tighter market filters: 11.2% prompt blocked
  • Gemini Image failure February 2024 (bias overcorrection) guides extreme caution
  • Enterprise focus reduces risk tolerance

Meta Llama 3.1: zero built-in filters, responsibility on implementer-opposite philosophy.

Vertical Specialization: The True Differentiator

Healthcare:

  • Med-PaLM 2 (Google): 85.4% on MedQA (vs. 77% best human doctors)
  • Claude in Epic Systems: adopted by 305 US hospitals for clinical decision support

Legal:

  • Harvey AI (GPT-4 customized): 102 law firms top-100 clients, $100M ARR
  • CoCounsel (Thomson Reuters + Claude): 98% accuracy legal research

Finance:

  • Bloomberg GPT: trained on 363B proprietary financial tokens.
  • Goldman Sachs Marcus AI (GPT-4 base): approves loans 40% faster

Verticalization generates 3.5x willingness-to-pay vs generic models (McKinsey survey, 500 enterprise buyers).

Llama 3.1: Meta's Open Source Strategy

405B parameters, competitive capabilities with GPT-4o on many benchmarks, fully open-weights. Meta strategy: commoditize infrastructure layer to compete on product layer (Ray-Ban Meta glasses, WhatsApp AI).

Adoption Llama 3.1:

  • 350K+ downloads first month
  • 50+ startups build AI verticals on Llama
  • Cost of self-managed hosting: $12K/month vs $50K+ API costs closed models for equivalent usage

Counterintuitive: Meta loses $billions on Reality Labs but invests massively open AI to protect advertising core business.

Context Windows: The Race for Millions of Tokens

  • Claude Sonnet 4.5: 200K tokens
  • Gemini 2.0 Pro: 2M token (longest commercially available)
  • GPT-4 Turbo: 128K tokens

Gemini 2M context allows analyze entire codebases, 10+ hours video, thousands of pages documentation-use case enterprise transformative. Google Cloud reports 43% enterprise POCs use context >500K tokens.

Adaptability and Customization

Claude Projects & Styles:

  • Custom persistent cross-conversation instructions
  • Style presets: Formal, Concise, Explanatory
  • Knowledge bases upload (up to 5GB documents)

GPT Store & Custom GPTs:

  • 3M GPTs published, 800K active monthly usage
  • Top creator earns $63K/month (revenue sharing)
  • 71% enterprise uses ≥1 custom GPT internally

Gemini Extensions:

  • Native integration Gmail, Calendar, Drive, Maps
  • Workspace context: reads email+calendar for proactive suggestions
  • 1.2B workspace actions executed Q4 2024

Key: "single prompt" to "persistent assistant with memory and context cross-session."

Q1 2025 Developments and Future Trajectories.

Trend 1: Mixture-of-Experts DominanceAlltop-tier 2025 models use MoE (activate subset parameters per query):

  • Reduced inference costs 40-60%
  • Better latency while maintaining quality
  • DeepSeek, GPT-4, Gemini Ultra all MoE-based

Trend 2: Native MultimodalityGemini2.0 natively multimodal (not separate glued modules):

  • Simultaneously understands text+images+audio+video
  • Cross-modal reasoning: "compare architectural style building photo with textual description historical period"

Trend 3: Test-Time Compute (Reasoning Models)OpenAI o1, DeepSeek-R1: use more processing time for complex reasoning:

  • o1: 30-60s for complex math problem vs 2s GPT-4o
  • Accuracy AIME 2024: 83.3% vs 13.4% GPT-4o
  • Explicit latency/accuracy trade-off.

Trend 4: Agentic WorkflowsModelContext Protocol (MCP) Anthropic, November 2024:

  • Open standard for AI agents to interact with tools/databases
  • 50+ partner adoption first 3 months
  • Allows agents to build persistent "memory" cross-interactions

Cost and Pricing Wars

API Pricing for 1M tokens (input):

  • GPT-4o: $2.50
  • Claude Sonnet 4: $3.00
  • Gemini 2.0 Flash: $0.075 (33x cheaper)
  • DeepSeek-V3: $0.27 (open source, hosting costs)

Gemini Flash case study: startup AI summarization reduces costs 94% switching from GPT-4o-same quality, comparable latency.

Commoditization accelerates: inference costs -70% year-on-year 2023-2024 (Epoch AI data).

Strategic Implications for Companies

Decision Framework: Which Model to Choose?

Scenario 1: Enterprise Safety-Critical→Claude Sonnet 4

  • Healthcare, legal, finance where mistakes cost millions
  • Constitutional AI reduces liability risks
  • Premium pricing justified by risk mitigation

Scenario 2: High-Volume, Cost-Sensitive→Gemini Flash or DeepSeek

  • Customer service chatbots, content moderation, classification
  • Performance "good enough," volume 10x-100x
  • Main differentiator cost

Scenario 3: Ecosystem Lock-In→Gemini for Google Workspace, GPT for Microsoft

  • Already invested in ecosystem
  • Native integration > superior marginal performance
  • Training costs employees on existing platform

Scenario 4: Customization/Control→Llama 3.1 or DeepSeek open

  • Specific compliance requirements (data residency, audit)
  • Heavy fine-tuning on proprietary data
  • Cheap self-hosting on volume

Conclusion: From Technology War to Platform War

The 2025 competition on LLMs is no longer "which model reasons best" but "which ecosystem captures the most value." OpenAI dominates consumer brand, Google leverages distribution billion-users, Anthropic wins enterprise safety-conscious, Meta commoditizes infrastructure.

Prediction 2026-2027:

  • Convergence further core performance (~90% MMLU all top-5)
  • Differentiation on: speed, cost, integrations, vertical specialization
  • Multi-step autonomous agents become mainstream (33% knowledge workers)
  • Open source closes quality gap, maintains cost/customization advantage

Final winner? Probably not single player but complementary ecosystems serving different use-case clusters. As smartphone OS (iOS + Android coexist), not "winner takes all" but "winner takes segment."

For enterprise: multi-model strategy becomes standard-GPT for generic tasks, Claude for high-stakes reasoning, Gemini Flash for volume, Llama custom-tuned for proprietary.

The year 2025 is not the year of the "best model" but of intelligent orchestration between complementary models.

Sources:

  • Stanford AI Index Report 2025
  • Anthropic Model Card Claude Sonnet 4.5
  • OpenAI GPT-4o Technical Report
  • Google DeepMind Gemini 2.0 System Card
  • DeepSeek-V3 Technical Paper (arXiv)
  • Epoch AI - Trends in Machine Learning
  • Gartner AI & Analytics Summit 2025
  • McKinsey State of AI Report 2025
  • Pew Research Center AI Adoption Survey
  • Similarweb Platform Intelligence

Resources for business growth