Newsletter

The Strawberry Problem

"How many 'r's' in strawberry?" - GPT-4o answers "two," a six-year-old knows it is three. The problem is tokenization: the model sees [str][aw][berry], not letters. OpenAI didn't solve it with o1-it got around it by teaching the model to "think before you speak." Result: 83% vs. 13% in Math Olympiad, but 30 seconds instead of 3 and triple the cost. Language models are extraordinary probabilistic tools-but you still need a human to count.

From the "Strawberry" Problem to Model o1: How OpenAI Solved (Partially) the Limit of Tokenization

In the summer of 2024, a viral internet meme embarrassed the world's most advanced language models: "How many 'r's' are in the word 'strawberry'?" The correct answer is three, but GPT-4o stubbornly answered "two." A seemingly trivial error that revealed a fundamental limitation of language models: their inability to analyze individual letters within words.

On September 12, 2024, OpenAI released o1-internally known by the codename "Strawberry"-the first model in the new series of "reasoning models" designed specifically to overcome these kinds of limitations. And yes, the name is not accidental: as an OpenAI researcher confirmed, o1 finally can correctly count the 'r's in "strawberry."

But the solution is not what the original article envisioned. OpenAI did not "teach" the model to analyze words letter by letter. Instead, it developed a completely different approach: teaching the model to "reason" before responding.

The Counting Problem: Why Models Get It Wrong

The problem remains rooted in tokenization-the fundamental process by which language models process text. As explained in a technical paper published in arXiv in May 2025 ("The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models"), the models see words not as sequences of letters but as "tokens"-units of meaning converted into numbers.

When GPT-4 processes the word "strawberry," its tokenizer divides it into three parts: [str][aw][berry], each with a specific numeric ID (496, 675, 15717). For the model, "strawberry" is not a sequence of 10 letters but a sequence of 3 numeric tokens. It is as if he is reading a book where each word is replaced by a code-and then someone asks him to count the letters in a code he has never seen written down.

The problem is compounded with compound words. "Timekeeper" is broken into separate tokens, making it impossible for the model to determine the exact position of letters without an explicit reasoning process. Fragmentation affects not only letter counting but also the understanding of the internal structure of words.

The Solution o1: Reason Before Responding

OpenAI o1 solved the problem in an unexpected way: instead of changing the tokenization-something that is technically difficult and would compromise the efficiency of the model-it taught the system to "think before it speaks" using a technique called "chain of thought reasoning."

When you ask o1 how many 'r's are in "strawberry," the model does not answer immediately. It spends several seconds-sometimes even minutes for complex questions-processing internally a "chain of reasoning" hidden from the user. This process allows it to:

  1. Recognize that the question requires character-level analysis
  2. Develop a strategy for breaking down the word
  3. Testing the response through different approaches
  4. Correct any errors before providing the final answer

As OpenAI researcher Noam Brown explained in a series of posts on X, "o1 is trained with reinforcement learning to 'think' before responding via a private chain of thought." The model receives rewards during training for each correct step in the reasoning process, not just for the final correct answer.

The results are impressive but costly. On a qualifying exam for the International Mathematics Olympiad, o1 solved 83 percent of problems correctly versus 13 percent for GPT-4o. On doctoral-level science questions, it achieved 78 percent accuracy against 56 percent for GPT-4o. But this power comes at a price: o1 takes 30+ seconds to answer questions that GPT-4o solves in 3 seconds, and costs $15 per million tokens in input versus $5 for GPT-4o.

Chain of Thought: How It Really Works

The technique is not magical but methodical. When it receives a prompt, o1 internally generates a long sequence of "thoughts" that are not shown to the user. For the 'r' problem in "strawberry," the internal process might be:

"First I need to understand the word structure. Strawberry could be tokenized as [str][aw][berry]. To count the 'r's, I need to reconstruct the complete word at the character level. Str contains: s-t-r (1 'r'). Aw contains: a-w (0 'r'). Berry contains: b-e-r-r-y (2 'r'). Total: 1+0+2 = 3 'r'. I check: strawberry = s-t-r-a-w-b-e-r-r-y. I count the 'r': position 3, position 8, position 9. Confirmed: 3 'r's."

This internal reasoning is hidden by design. OpenAI explicitly prohibits users from attempting to reveal o1's chain of thought, monitoring prompts and potentially revoking access to those who violate this rule. The company cites reasons of AI security and competitive advantage, but the decision has been criticized as a loss of transparency by developers working with language models.

Persistent Limits: o1 It's Not Perfect

Despite the progress, o1 has not completely solved the problem. Research published in Language Log in January 2025 tested various models on a more complex challenge: "Write a paragraph where the second letter of each sentence makes up the word 'CODE'."

o1 standard ($20/month) failed, mistakenly counting the first letter of each initial word as the "second letter." o1-pro ($200/month) fixed the problem--after 4 minutes and 10 seconds of "thinking." DeepSeek R1, the Chinese model that shook up the market in January 2025, made the same mistake as standard o1.

The fundamental problem remains: the models still see text through tokens, not letters. o1 has learned to "work around" this limitation through reasoning, but it has not eliminated it. As one researcher noted in Language Log, "Tokenization is part of the essence of what language models are; for any wrong answer, the explanation is precisely 'well, tokenization.'"

Academic Research: Emergence of Understanding at the Character Level

A significant paper published in arXiv in May 2025 ("The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models") analyzes this phenomenon from a theoretical perspective. Researchers created 19 synthetic tasks that isolate character-level reasoning in controlled contexts, showing that these abilities emerge suddenly and only late in training.

The study proposes that learning character composition is not fundamentally different from learning common sense-knowledge emerges through processes of "conceptual percolation" when the model reaches a critical mass of examples and connections.

The researchers suggest a lightweight architectural modification that significantly improves character-level reasoning while preserving the inductive advantages of subword-based models. However, these modifications remain experimental and have not been implemented in commercial models.

Practical Implications: When to Trust and When Not to

The "strawberry" case teaches an important lesson about the reliability of language models: they are probabilistic tools, not deterministic calculators. As Mark Liberman noted in Language Log, "You should be wary of trusting the response of any current AI system in tasks involving counting things."

This does not mean that the models are useless. As one commenter noted, "Just because a cat makes the stupid mistake of getting scared by a cucumber doesn't mean we shouldn't trust the cat with the much more difficult task of keeping rodents out of the building." Language models are not the right tool if you want to systematically count letters, but they are excellent for automatically processing thousands of podcast transcripts and extracting names of hosts and guests.

For tasks requiring absolute precision-landing a spacecraft on Mars, calculating pharmaceutical dosages, verifying legal compliance-current language models remain inadequate without human supervision or external verification. Their probabilistic nature makes them powerful for pattern matching and creative generation, but unreliable for tasks where error is not acceptable.

The Future: Toward Models that Reason by the Hour

OpenAI has stated that it intends to experiment with o1 models that "reason for hours, days or even weeks" to further increase their reasoning capabilities. In December 2024, o3 was announced (the name o2 was skipped to avoid trademark conflicts with mobile operator O2), and in March 2025 the API for o1-pro, OpenAI's most expensive AI model to date, was released at a price of $150 per million tokens in input and $600 per million in output.

The direction is clear: Instead of making models bigger and bigger (scaling), OpenAI is investing in making them "think" longer (test-time compute). This approach may be more energetically and computationally sustainable than training increasingly massive models.

But an open question remains: are these models really "reasoning" or simply simulating reasoning through more sophisticated statistical patterns? Apple research published in October 2024 reported that models like o1 could replicate reasoning steps from their own training data. By changing numbers and names in math problems, or simply re-running the same problem, the models performed significantly worse. By adding extraneous but logically irrelevant information, performance plummeted by 65 percent for some models.

Conclusion: Powerful Tools with Fundamental Limits

The "strawberry" problem and the o1 solution reveal both the potential and inherent limitations of current language models. OpenAI has shown that through targeted training and additional processing time, models can overcome some of the structural limitations of tokenization. But they have not eliminated it-they have circumvented it.

For users and developers, the practical lesson is clear: understanding how these systems work-what they do well and where they fail-is critical to using them effectively. Language models are tremendous tools for probabilistic tasks, pattern matching, creative generation, and information synthesis. But for tasks requiring deterministic precision-counting, computation, and verification of specific facts-they remain unreliable without external supervision or complementary tools.

The name "Strawberry" will remain as an ironic reminder of this fundamental limitation: even the world's most advanced AI systems can stumble over questions that a six-year-old would solve instantly. Not because they are stupid, but because they "think" in ways profoundly different from us-and perhaps we should stop expecting them to think like humans.

Sources:

  • OpenAI - "Learning to Reason with LLMs" (official blog post, September 2024)
  • Wikipedia - "OpenAI o1" (entry updated January 2025)
  • Cosma, Adrian et al. - "The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models," arXiv:2505.14172 (May 2025)
  • Liberman, Mark - "AI systems still can't count," Language Log (January 2025)
  • Yang, Yu - "Why Large Language Models Struggle When Counting Letters in a Word?", Medium (February 2025)
  • Orland, Kyle - "How does DeepSeek R1 really fare against OpenAI's best reasoning models?", Ars Technica
  • Brown, Noam (OpenAI) - Series of posts on X/Twitter (September 2024)
  • TechCrunch - "OpenAI unveils o1, a model that can fact-check itself" (September 2024)
  • 16x Prompt - "Why ChatGPT Can't Count How Many Rs in Strawberry" (updated June 2025)

Resources for business growth