The Curious Case of Spelling ‘Strawberry’
LLM Tokenization Explained: Why Simple Words Aren’t Always Simple for AI
Large language models (LLMs) like GPT-4 are powerful tools for generating text, solving complex problems, and engaging in human-like conversations. However, they often stumble on simple tasks such as counting letters in a word due to the fundamental differences in how they process text compared to humans. These limitations stem primarily from tokenization and the architecture of LLMs.
How LLMs and Humans Understand Text Differently
- Human Understanding: Humans read and process text linearly, seeing each letter in sequence. For us, the word “strawberry” is immediately recognized as a series of individual characters: ‘s,’ ‘t,’ ‘r,’ ‘a,’ ‘w,’ ‘b,’ ‘e,’ ‘r,’ ‘r,’ ‘y.’ We can effortlessly count letters, identify patterns, and understand the word as a complete unit.
- LLM Understanding: LLMs, however, do not process text like humans. They rely on tokenization, a process that breaks text into tokens — basic units that can be entire words, parts of words, or even single characters. For instance, “strawberry” might be split into “straw” and “berry,” or even processed as one token if it frequently appears in training data. This method allows LLMs to handle text efficiently but can cause issues when tasks require exact character-level understanding.
Tokenization and Compound Words
Tokenization
Tokenization is how AI models break down text into smaller chunks called tokens. These tokens can be whole words, parts of words, or even individual letters, depending on how the model was trained.
Example with “Strawberry”:
When the AI reads the word “strawberry,” it might split it into “straw” and “berry.” This split happens because “strawberry” is made of two recognizable parts, or tokens, that the model has seen before. In some cases, if the model is trained to see “strawberry” as a common word, it might treat the whole word as a single token.
Why does this matter? Because when you ask the AI to do something like count the letters in “strawberry,” it doesn’t see individual letters — just the bigger chunks, “straw” and “berry.” So if you ask, “How many ‘r’s are in ‘strawberry’?” the AI might get confused, because it doesn’t naturally break the word down to the level of each letter. It’s like trying to count the threads in a rope without untwisting it first.
Compound Words:
Now, let’s consider a compound word like “timekeeper.” Humans see this as one single word, but AI might break it down into “time” and “keeper.” This tokenization makes it easier for AI to process and understand the word in the context of a sentence, but it creates issues when you need exact details.
Example with “Timekeeper”:
If you ask the AI, “How many ‘e’s are in ‘timekeeper’?” the AI might only count the ‘e’s in “time” and “keeper” separately, potentially missing one or more. Even worse, because the AI might start counting from different points (like from 0 or 1), it could give you an inconsistent answer. So, you could end up with the wrong count because the AI isn’t looking at “timekeeper” as a whole word with a precise letter structure.
Transformer Architecture: The AI’s Brain
To understand why these tokenization quirks happen, we need to look at the transformer architecture, the backbone of most modern LLMs.
What is Transformer Architecture?
Think of a transformer like a super-smart reader that doesn’t just read word by word but understands whole sentences at once. It’s designed to figure out how words relate to each other instantly, making it very good at tasks like translating languages or writing coherent paragraphs.
- Reading in Chunks (Tokenization): Transformers read by breaking sentences into tokens. For example, “The cat sat on the mat” might become tokens like “The,” “cat,” “sat,” “on,” “the,” “mat.” These chunks help AI process the information, but it means it doesn’t naturally see individual letters.
- Turning Words into Numbers (Encoding): Once the text is split into tokens, transformers turn these into numbers. These numbers, or encodings, represent the meaning of each token rather than its exact characters. So, instead of seeing ‘T,’ ‘H,’ ‘E,’ the AI sees a numerical pattern for “the.”
- Understanding the Big Picture (Attention Mechanism): Transformers excel because they use something called attention mechanisms. This allows them to focus on the most important parts of a sentence, not just read left to right. For example, in “The cat sat on the mat,” the AI might pay more attention to “cat” and “sat,” figuring out that the cat is the one sitting.
- Putting It All Together (Layers of Processing): Transformers have multiple layers that process the tokens, refining their understanding at each step. The first layer might catch the basic meaning, while the next layers add more detail until the AI has a full grasp of the sentence’s context.
Transformers are great at grasping the overall meaning and context of sentences, but they don’t naturally focus on the small details, like counting specific letters. It’s like knowing what a picture shows without being able to count every brushstroke — excellent for big ideas, not so much for fine details.
Guiding LLMs with Critical Thinking
To overcome these challenges, guiding LLMs through a critical thinking approach can help improve their performance on specific tasks. For example, rather than directly asking, “How many ‘r’s are in ‘strawberry’?” you can prompt the LLM to:
- Observe the Problem: Clearly define the task, directing the LLM to break down the word into individual characters.
- Reasoning Steps: Instruct the LLM to systematically count each character, paying close attention to each letter and its position.
- Iterate and Reflect: Encourage the LLM to review its process, check its counts, and verify its results.
Example prompt:
“Instead of guessing, observe the word ‘strawberry’ carefully. Break it down into individual letters. Count the ‘r’s one by one, checking each step for accuracy. Reflect on your count and verify the total to ensure it’s correct.”
By employing such a step-by-step, methodical approach, LLMs can be nudged towards more accurate performance in tasks that are naturally intuitive for humans.
Why Prompt Engineering Matters
The “strawberry problem” is a perfect example of why crafting the right prompts matters. Prompt engineering is all about asking AI the right way. It’s not just what you ask, but how you ask it that can make a huge difference.
- Getting AI to Do It Right: Without the right prompts, AI will just do its usual thing and likely mess up. But with a well-structured prompt, you can steer it to handle tasks it usually fails at, like counting letters accurately.
- Unlock AI’s Potential: Good prompts help AI reach its full potential, guiding it through detailed reasoning instead of just guessing.
- A Real-World Example: The struggle with “strawberry” shows how critical the right prompt is. It’s a simple task for humans, but AI needs a push to get it right. With the right guidance, you can make AI perform way better, even on tasks it usually finds tricky.
So, prompt engineering isn’t just helpful — it’s essential. It’s all about getting AI to work smarter, not harder, and turning tricky problems into solvable ones just by asking the right way.