How the GPT-4o Tokenizer Handles Spanish, Emojis, and Code

Most people who use GPT-4o regularly have no idea that the model reads Spanish differently than it reads English, or that a single emoji can count as more than one token, or that a few lines of code can use up tokens faster than a full paragraph of plain text. None of this is hidden information, but it's also not something OpenAI puts front and center when you sign up.

Understanding how the tokenizer actually behaves with different types of content is one of those things that pays off quickly once you know it, whether you're trying to keep API costs down, avoid hitting context limits unexpectedly, or just get a clearer picture of what's happening when you send a request.

Why the tokenizer isn't equally efficient for all content

The GPT-4o tokenizer, like all modern AI tokenizers, was built by analyzing enormous amounts of text and finding the most common patterns. Those patterns get turned into tokens, and the more common a pattern is in the training data, the more likely it is to be a single token rather than several.

The training data for models like GPT-4o is heavily weighted toward English. Not exclusively English, but English is by far the most represented language. This means the tokenizer is most efficient at English, in the sense that it can express the most meaning per token. When you move to other languages, or to content types like code and emojis that follow completely different patterns, the efficiency drops and token counts go up for the same amount of information.

How Spanish text tokenizes compared to English

Spanish is a relatively well-supported language in GPT-4o, better than many others, but it still tokenizes less efficiently than English in most cases. The reason is partly structural. Spanish uses more characters per word on average than English, it has accented characters like á, é, í, ó, ú, and ñ that don't always map to single tokens cleanly, and it uses longer word forms to express things that English handles with shorter constructions.

A rough estimate that holds up reasonably well in practice is that Spanish text uses somewhere between 15 and 30 percent more tokens than equivalent English text. The exact number depends on the content, formal writing tends to be closer to the lower end, casual or regional language closer to the higher end.

This matters a lot if you're building something for a Spanish-speaking audience and you based your cost estimates on English text. Your real costs in production could be meaningfully higher than what your testing suggested, and the gap grows with every request.

According to MIT Technology Review's coverage of language model efficiency, non-English languages remain a significant challenge for tokenization systems that were primarily designed and optimized around English-language training data.

What happens with emojis

Emojis are interesting because they look like single characters but they often aren't, at least not from the tokenizer's point of view. Most common emojis are represented in Unicode as sequences of multiple bytes, and the tokenizer may treat those bytes as separate tokens depending on how the emoji is encoded.

A simple emoji like 😊 or 👍 might be one or two tokens. A more complex emoji like a person with a skin tone modifier, or a family emoji made up of multiple characters joined together, can be four, five, or even more tokens. This is completely invisible when you're looking at the text, because the emoji still looks like a single symbol on screen.

For most use cases this is a minor inconvenience. But if you're building a tool that processes social media content, customer messages, or any other text where emojis are common, it's worth knowing that your token counts will be higher than a character or word count would suggest.

The practical thing to do is run a realistic sample of your actual content through a Token Counter rather than estimating, so you see what you're actually dealing with before you build your cost model around a number that doesn't account for emojis.

How code tokenizes and why it's so different

Code is where the tokenizer behavior diverges most dramatically from what you'd expect based on word or character counts. A block of code that looks short can use a surprising number of tokens, and the reason comes down to the density of special characters and the way programming syntax gets split up.

Take something as simple as a function definition in JavaScript. The curly braces, parentheses, semicolons, equal signs, arrow functions, and other syntax elements all contribute tokens. Variable names written in camelCase or snake_case often get split at the case boundaries or the underscore, turning one identifier into two or three tokens. String literals with quotes around them, numbers with decimal points, comments with slashes, all of these add up faster than prose does.

As a rough comparison, a paragraph of English prose might use one token per four characters. A block of code might use one token per two characters or even less, depending on how symbol-heavy the syntax is. That's roughly double the token density for the same visual footprint on screen.

This has real implications if you're building a coding assistant, a code review tool, a documentation generator, or anything else that processes source code. Your context window fills up faster than you'd expect, and your costs per request are higher than a word-based estimate would suggest.

Putting it together for your specific use case

The most useful thing you can do with this information is test your actual content rather than relying on general estimates. If your application handles Spanish text, grab a realistic sample and count its tokens. If users can send emojis, include some in your test. If you're processing code, use real code from your codebase rather than a simplified example.

The differences between content types aren't small enough to ignore if you're running any kind of volume. A tool that handles customer support messages in Spanish with occasional emojis and code snippets could easily be using two or three times more tokens per message than a plain English text tool, and that multiplier goes straight to your API bill.

Knowing your actual token usage per request type is the foundation for making good decisions about which model to use, how to structure your prompts, and what your real costs are going to look like once you're in production. Everything else is just guessing.

Why the tokenizer isn't equally efficient for all content

How Spanish text tokenizes compared to English

What happens with emojis

How code tokenizes and why it's so different

Putting it together for your specific use case

Try the tools