The Strawberry Problem

The announcement of "Strawberry" by OpenAI has turned the spotlight on a fundamental limitation of language models: their inability to analyze individual letters within words. This weakness reveals profound aspects about how they work.

The problem of counting

When ChatGPT is asked to count the 'r's in the word 'strawberry,' the model often gets it wrong. This error does not result from a lack of intelligence, but from the way the language models analyze the text. To understand why, one needs to know the concept of tokenization.

The world as seen through tokens

Language models do not see words as sequences of letters, but as 'tokens'-units of meaning converted into numbers. It is like reading a book where each word is replaced by a numerical code. The word 'schoolbooks,' for example, is divided into two separate tokens: 'school' and 'books.' This explains why the model struggles to count the 'o' in this word correctly-it does not in fact see it as a word.

An illuminating example

Imagine learning a language where the word 'school' is always represented by the number '412'. If someone asked you how many 'o's' there are in '412,' we could not answer correctly without ever having seen the word written in full. Language models are in a similar situation: they process meanings through numbers, without access to the literal composition of words.

The challenge of compound words

The problem becomes even worse with compound words. 'Timekeeper' is broken into separate tokens, making it difficult for the model to determine the exact position of the letters 'and'. This fragmentation affects not only letter counting but also understanding the internal structure of words.

The solution to the strawberry problem (maybe)

This future OpenAI model, Strawberry, is expected to overcome this limitation by introducing an innovative approach to text processing. Instead of relying only on traditional tokenization, the model should be able to analyze words at the level of individual letters, allowing for more precise counting and analysis operations.

Future implications

The importance of this problem goes beyond simple letter counting. This granular analysis capability could significantly improve the linguistic understanding of AI models, enabling them to tackle problems that require detailed character-level text analysis.

The planned integration of this technology will be a major advance in the direction of language models more capable of "reasoning" about the fundamental details of language, not just statistical patterns.

‍

The strawberry problem

Fabio Lauria

Most popular

Sign up for the latest news