Did you think Pokemon is a tough benchmark for AI? One group of researchers argues that Super Mario Bros is even more strict.
Hao AI Lab, a research organization at the University of California, San Diego, threw AI into the Live Super Mario Bros Games on Friday. Anthropic’s Claude 3.7 performed best, followed by Claude 3.5. Google’s Gemini 1.5 Pro and Openai’s GPT-4o were a struggle.
As an original release in 1985, it wasn’t the exact same version of Super Mario Bros. The game ran on an emulator and was integrated with the framework Gamingagent, allowing AIS to control Mario.

Hao’s internally developed Gamingagent gave basic AI instructions, such as “moving/jumping left to avoid obstacles or enemies nearby.” AI generated input in the form of Python code to control Mario.
Still, Hao says the game forced each model to “learn” to plan complex operations and develop gameplay strategies. Interestingly, the lab discovered that despite being generally strong in most benchmarks, it performs worse than the “irrational” model by “thinking” problems step by step.
One of the main reasons why is that you have a hard time playing real-time games like this. Researchers say it takes time to decide on actions. At Super Mario Bros, timing is everything. Second, it means the difference between a safely cleared jump and a plunge in your death.
The game has been used as an AI benchmark for decades. However, some experts have questioned the wisdom that portrays the link between AI’s gaming skills and technological advances. Unlike the real world, games tend to be abstract and relatively simple, providing theoretically infinite amounts of data to train AI.
The recent flashy gaming benchmark points to Andrej Karpathy, a research scientist called the “assessment crisis” and founding member of Openai.
“I really don’t know what (AI) metrics are looking at now,” he wrote in X’s post.
At the very least, you can see the AI Play Mario.