This is one of those things that feels like it shouldn't be true until you actually test it yourself. You take a sentence, paste it into a token counter for GPT-4o, note the number, then paste the exact same sentence into a counter for Claude or Gemini, and the number is different. Not by a huge amount most of the time, but different enough to make you wonder what's going on.
The answer has everything to do with how each model was built, and once you understand it, a lot of other things about working with AI models start to make more sense.
Tokenization is not a universal standard
There is no single agreed-upon way to split text into tokens. Every AI lab built their own system, trained on their own data, optimized for their own goals. The result is that GPT-4o, Claude, and Gemini each have a tokenizer that works a little differently from the others.
All of them are based on the same general approach, which is called byte pair encoding. The basic idea is that instead of splitting text letter by letter or word by word, you find the most common sequences of characters across a large body of text and treat those sequences as single units. Over many iterations, you end up with a vocabulary of common chunks that covers most of what you'd ever want to say.
But the vocabulary each model ended up with depends on what text it was trained on, how many iterations of that process it went through, and a bunch of other decisions made by the people building it. So even though they all use a similar method, the specific tokens in their vocabularies are different, and that's what causes the counts to vary.
Some content types show bigger differences than others
For a simple English sentence like "the cat sat on the mat," the differences between models will be minimal, maybe zero. That kind of text is so common in training data that every tokenizer handles it almost the same way.
But move away from simple English and the differences grow. Here are the situations where you'll notice the most variation.
Code is probably the most obvious one. Curly braces, semicolons, indentation, variable names written in camelCase or snake_case, these all get handled differently by different tokenizers. A function that's fifty lines of JavaScript might come out to four hundred tokens in one model and four hundred and sixty in another. That's not a rounding error, it's a real difference.
Non-English languages are another area where the gaps get significant. Tokenizers trained heavily on English text tend to be more efficient at English, meaning they can express the same idea in fewer tokens. A sentence in Turkish or Vietnamese or Hebrew might require substantially more tokens than the equivalent meaning in English, and the exact multiplier varies by language and by tokenizer.
Emojis, special characters, and formatting symbols also behave inconsistently. Some tokenizers treat an emoji as a single token. Others break it into several pieces. This matters if you're building something like a social media tool or a customer service bot where user input might include a lot of non-standard characters.
Why this creates real cost differences
Once you understand that the same text produces different token counts in different models, the cost implication is straightforward. You're paying per token, so more tokens means more money, even if the underlying text and the quality of the response are identical.
The tricky part is that this doesn't show up obviously when you're testing. If you're comparing two models and you send them the same prompt and get back similar quality responses, you might conclude they're roughly equivalent in cost. But if one model is tokenizing your specific content less efficiently, your actual costs in production could be noticeably higher than you expected based on testing.
According to Google's research on tokenization efficiency, even small differences in how whitespace and punctuation are handled can change token counts in ways that aren't obvious from looking at the text. This is why testing with your actual content, not a generic sample, is important before committing to a model for production use.
The system prompt is where this bites people most
Your system prompt is probably the single most token-efficient thing to pay attention to, because it gets sent with every single request. If your system prompt is five hundred tokens in one model and five hundred and eighty in another, that eighty token difference is multiplied by every request you send.
At low volumes this doesn't matter. At ten thousand requests a day, that's eight hundred thousand extra tokens per day just from the system prompt. Whether that's a meaningful cost depends on the pricing of the model you're using, but it's the kind of thing that's very easy to overlook and very straightforward to check.
The Token Counter on Prompt Toolbox lets you paste your system prompt and see its token count across multiple models at once, which is the fastest way to catch this kind of difference before it shows up in your usage report.
How to use this knowledge practically
The main thing to take away from this is that you should always measure token counts with your actual content on your actual target model, not with a sample piece of text on whatever tokenizer happens to be convenient.
Before you finalize a system prompt, count its tokens on the model you're deploying to. Before you switch from one model to another, count how your most common request types tokenize on both. Before you build a cost estimate for a new feature, run a realistic sample of the inputs through a counter and use that number rather than a generic words-to-tokens rule of thumb.
None of this takes more than a few minutes and it removes a lot of the uncertainty that tends to make AI API costs feel unpredictable. The differences between models aren't random, they're consistent and measurable, which means they're also manageable once you know to look for them.
