How to Count Tokens in GPT-4o, Claude, and Gemini? Differences That Will Cost You Money If You Ignore Them

If you've ever switched from one AI model to another and noticed that your costs changed in a way that didn't quite make sense, tokens are probably why. Most people assume that if you send the same text to GPT-4o, Claude, and Gemini, the token count will be the same or at least very close. It won't always be, and that difference, small as it might seem at first, can add up to real money over time.

This article is about understanding why that happens, what it means in practice, and what you can actually do about it.

Every model has its own way of reading text

When you send a message to an AI model, the first thing that happens is that your text gets broken down into tokens before the model processes anything. Each company built their own system for doing this, and those systems don't work exactly the same way.

OpenAI uses a tokenizer called tiktoken for their models including GPT-4o. Anthropic uses their own tokenizer for Claude. Google has their own approach for Gemini. They all have the same general idea behind them, which is to break text into the most common patterns rather than individual letters or complete words, but the specific patterns they learned from and the way they handle edge cases are different.

For plain, simple English sentences, the differences are usually small. But the moment you introduce anything outside of that, things start to diverge.

Where the differences actually show up

The biggest gaps between tokenizers tend to appear in a few specific situations.

Code is one of them. Programming languages have a lot of special characters, indentation, and syntax that different tokenizers handle in their own ways. The same block of Python or JavaScript code can produce noticeably different token counts depending on which model you're sending it to.

Languages other than English are another big one. Tokenizers are generally trained on data that's heavily weighted toward English, which means that text in Spanish, French, Arabic, Chinese, or any other language tends to use more tokens to express the same amount of information. The exact ratio varies by language and by tokenizer, but it's something worth knowing if you're building anything that handles non-English content.

Numbers and formatting also behave differently. A long number, a date written in a specific format, or a table with a lot of special characters can all produce different results across models. It's not dramatic, but it's consistent enough to matter if you're sending a lot of structured data.

A real example of what this looks like

Take a paragraph of English text, something around two hundred words. In most cases, GPT-4o, Claude, and Gemini will all land within a similar range for that paragraph, maybe within ten or twenty tokens of each other. Not a big deal for a single request.

Now imagine you're running an application that sends that paragraph, along with a system prompt and some conversation history, thousands of times a day. That small difference in token count per request starts to multiply. And if your content includes code, or you're serving users in multiple languages, the gap per request gets wider before you even start multiplying it.

According to Anthropic's documentation on how Claude handles tokens, the way a model tokenizes text is tied directly to how it was trained, which is why you can't assume one tokenizer's output will match another's even for identical input.

How to actually check token counts across models

The most reliable way to know how many tokens your content uses is to count them directly rather than estimating. For GPT-4o specifically, OpenAI provides a tool called the Tokenizer on their platform that shows you exactly how your text gets split. For Claude, Anthropic has their own token counting method available through the API.

If you want to check counts across multiple models without switching between different tools and platforms, the Token Counter on Prompt Toolbox lets you paste your text once and see the count for GPT-4o, Claude, Gemini, and other models side by side. It's the fastest way to spot differences before they show up on your bill.

What this means for your system prompt

One place where tokenizer differences matter more than people expect is the system prompt. A system prompt is the set of instructions you give the model at the start of every conversation or every request, and it gets sent every single time.

If you wrote your system prompt while working with one model and then switched to another, the token count for that prompt might be different even if you didn't change a single word. Multiply that difference by the number of requests you send per day and it becomes something worth checking.

A good habit is to run your system prompt through a token counter every time you make significant changes to it, and again if you switch models. It takes less than a minute and it removes a lot of uncertainty from your cost estimates.

The pricing layer makes this more complicated

Token counts aren't the only variable in what you end up paying. Each model also has its own price per token, and those prices are different for input tokens (what you send) and output tokens (what the model writes back).

So it's possible for a model that uses slightly more tokens than another to still be cheaper overall, depending on how it prices those tokens and what the ratio of input to output looks like in your specific use case. This is why comparing models on cost requires looking at both the token count and the price per token together, not just one or the other.

The Tokens-per-Dollar Calculator on Prompt Toolbox is useful here because it lets you factor in both variables and see what your budget actually gets you across different models.

The practical takeaway

You don't need to memorize the exact differences between tokenizers or become an expert in how they work internally. What matters is knowing that the differences exist, knowing where they tend to show up (code, non-English text, numbers, formatting), and having a simple way to check counts before you commit to a design or a budget.

Switching models without checking token counts first is one of those things that seems fine in testing and then becomes a surprise on your invoice. A few minutes of checking upfront saves a lot of explaining later.