Even Pokémon is not safe from AI benchmark controversy.
Last week, X’s post went viral, claiming that Google’s latest Gemini model surpassed the flagship Claude model of humanity in the original Pokémon video game trilogy. Gemini reportedly arrived in the town of Lavender, a twitching stream of developers. Claude was stuck at Mount Moon in late February.
Gemini literally goes ahead of the Pokémon Claude ATM after arriving in Lavender town
Live view of 119 is only the highly underrated stream pic.twitter.com/8avsovai4x
– justh (@jush21e8) April 10, 2025
But what the post didn’t mention was that Gemini was in his advantage.
As Reddit users pointed out, developers who maintain the Gemini stream have built custom minimaps to help models identify “tiles” in games like Cuttable Trees. This reduces the need for Gemini to analyze screenshots before making gameplay decisions.
Currently, Pokémon is at most a semi-energy AI benchmark. Few people argue that it is a very useful test of the model’s functionality. However, this is a useful example of how different implementations of benchmarks affect the outcome.
For example, humanity reported two scores for the recent Human 3.7 sonnet model of the benchmark SWE bench, designed to assess the coding capabilities of the model. Claude 3.7 Sonnet achieved 62.3% accuracy in SWE bench verification, but 70.3% achieved with the “custom scaffolding” that has been developed by humanity.
More recently, Meta has tweaked a version of one of the newer models, the Llama 4 Maverick, to work well with a specific benchmark LM Arena. The vanilla version of the model is significantly worsened with the same rating.
Given that AI benchmarks (including Pokemon) are incomplete tools from the start, custom and non-standard implementations are even more threatening waters with muddy waters. This means that it is likely that it will be easier to compare models when they are released.