arrow_backAll articles
We Tested the Same Prompt on GPT-4o, Claude, and Gemini. Here Is What Actually Happened.
Tutorials

We Tested the Same Prompt on GPT-4o, Claude, and Gemini. Here Is What Actually Happened.

Javier Echeverria··4 min read

There are a lot of model comparison articles on the internet and most of them are either sponsored, cherry-picked, or focused on benchmark scores that don't tell you much about how these models actually behave when you use them for real tasks. This is not that kind of article.

We ran a set of prompts through GPT-4o, Claude Sonnet, and Gemini 1.5 Pro and paid attention to the things that actually matter in day-to-day use: how well each model followed instructions, how the outputs compared in quality, how the token counts differed, and where each one surprised us in either direction.

What we tested and why

We chose five categories of tasks that represent a wide range of real use cases: summarization of a long technical document, creative writing with specific constraints, code generation for a practical problem, multi-step reasoning with a logic puzzle, and instruction following with an unusual format requirement.

These are not the kinds of tasks that show up in academic benchmarks. They're the kinds of things people actually use AI models for, and the differences between models on these tasks are more relevant to most users than scores on standardized tests.

Summarization: Claude pulled ahead, but not by much

For summarizing a long technical document with the instruction to focus on practical implications rather than technical details, all three models produced usable summaries. Claude's summary was the most focused on what we actually asked for, staying on the practical implications without drifting into technical explanation. GPT-4o produced a slightly longer summary that included more technical detail than requested. Gemini's summary was the most concise but missed one of the key practical points from the document.

The token counts for the outputs were noticeably different. Claude's response was the shortest while being the most on-target. GPT-4o's was the longest. Gemini's was the shortest overall but also the least complete.

Creative writing: GPT-4o felt more natural

For a creative writing task with specific constraints (a product description written in the voice of a tired but enthusiastic late-night infomercial host, under 150 words), GPT-4o produced something that felt the most natural and actually made us laugh. Claude's version was technically correct and hit all the constraints but felt slightly more polished and less chaotic than the brief called for. Gemini's version missed the tone almost entirely and produced something that read like a normal product description with a few exclamation marks added.

This pattern held across several creative tasks we tested. GPT-4o seems to have a better feel for voice and tone, especially for unconventional or humorous briefs. Claude is more reliable but can lean toward being well-behaved in ways that don't always serve creative tasks.

Code generation: all three were solid, with different failure modes

For a practical coding task, generating a function that parses a specific data format with error handling, all three models produced working code. The differences were in how they handled edge cases and how they explained their choices.

Claude's code was the most thoroughly commented and included the most complete error handling by default. GPT-4o's code was clean and worked well but had less defensive handling of edge cases. Gemini's code worked for the happy path but required the most follow-up prompting to get robust error handling added.

According to Bloomberg's reporting on enterprise AI adoption patterns, code generation is consistently one of the top use cases driving API usage, and the differences in how models handle edge cases and error conditions are often more important to developers than headline capability differences.

Reasoning: the gap was most visible here

The logic puzzle we used involved multiple interdependent constraints and required tracking several conditions simultaneously. This is where the differences between models were most pronounced.

Claude worked through the problem the most systematically, showed its reasoning clearly at each step, and arrived at the correct answer. GPT-4o also got the correct answer but the reasoning was less explicit and harder to follow. Gemini got partway through the problem correctly and then made an error in one of the later steps that led to a wrong conclusion.

This doesn't mean Gemini is bad at reasoning broadly. One problem is not a representative sample. But the pattern of Claude being more systematic about showing its work on multi-step reasoning tasks was consistent across the other reasoning tasks we tried.

Instruction following: Claude was the most precise

The unusual format task asked each model to respond to a question using exactly three sentences, where the first sentence had to be a question, the second had to be a statistic, and the third had to be a recommendation. This is an arbitrary constraint but it's a good test of how precisely a model follows specific structural instructions.

Claude followed all three constraints correctly on the first try. GPT-4o got the structure right but the statistic sentence was vague rather than an actual number. Gemini produced four sentences and the order didn't match the requirement.

What this actually means for choosing a model

The honest conclusion from our testing is that there is no universally best model, because the right choice depends on what you're doing. Claude is the most reliable choice for tasks that require precise instruction following, systematic reasoning, and handling long documents. GPT-4o is the best choice for creative tasks, conversational feel, and cases where natural-sounding output matters more than strict precision. Gemini is the most cost-effective option for high-volume tasks where the quality bar is somewhat more flexible, and its performance is improving rapidly.

For most people building applications, the answer is to test your specific use case on all three rather than relying on anyone else's comparison including this one. The differences in how each model handles your particular prompts and content types are more informative than any general benchmark.

The Tokens per Dollar Calculator on Prompt Toolbox is useful here because once you've decided on a model, it helps you understand what that choice means for your costs at the volume you're planning to run.

Try the tools