New papers from AI Lab Cohere, Stanford, MIT and AI2 denounce LM Arena, the organization behind the popular crowdsourced AI benchmark chatbot arena.
According to the author, LM Arena allowed industry-leading AI companies such as Meta, Openai, Google and Amazon to personally test some variations of their AI models, and did not publish the lowest performer scores. This made it easier for these companies to gain top spots on the platform leaderboard, but not all companies were given the opportunity.
“We were told that this private test is available only a handful (companies) and that some companies received far more private tests than others,” said Sara Hooker, VP of AI Research at Cohere and co-author of the study, in an interview with TechCrunch. “This is gamification.”
Created in 2023 as an academic research project at Berkeley, California, Chatbot Arena has become a go-to benchmark for AI companies. It works by placing answers from two different AI models side by side in a “battle” and asking users to choose the one that’s best suited to them. It’s not uncommon to see unreleased models competing in arenas under pseudonyms.
Voting over time contributes to the model’s score. As a result, it contributes to the placement of chatbot arena leaderboards. Many commercial actors have joined the chatbot arena, but LM arena has long maintained its benchmark as fair and fair.
But that’s not what the authors of the paper say they’ve revealed.
One of the AI companies, Meta was able to personally test 27 model variants at Chatbot Arena during the month of January and March following Tech Giant’s Llama 4 release. At launch, Meta publicly revealed the scores for a single model. This is a model ranked near the top of the Chatbot Arena Leaderboard.
TechCrunch Events
Berkeley, California
|
June 5th
Book now

In an email to TechCrunch, Ion Stoica, co-founder of LM Arena and professor UC Berkeley, said the study was full of “inaccurate” and “suspecting analysis.”
“We are committed to a community-driven, fair assessment, and we invite all model providers to submit more models for testing and improve human preference performance,” LM Arena said in a statement provided to TechCrunch. “If a model provider chooses to submit more tests than another model provider, this does not mean that the second model provider will be unfairly treated.”
Armand Joulin, a leading researcher at Google Deepmind, said in a post on X that some of the research numbers were inaccurate, claiming that Google sent the Gemma 3 AI model to the LM Arena for pre-release testing. Hooker responded to X’s Joulin, and the author promised to correct it.
Probably the favorite lab
The author of the paper began his research in November 2024 after learning that some AI companies may be given priority access to chatbot arenas. In total, they measured over 2.8 million chatbot arena fights over five months of stretching.
The author says that LM Arena has found evidence that certain AI companies, including Meta, Openai and Google, have allowed them to collect more data from Chatbot Arena by bringing their models into more models, Battles. The authors argue that this increase in sampling rates has resulted in unfair benefits for these companies.
Using additional data from LM Arena can improve your model’s performance in Arena on a hard basis. Another benchmark LM Arena maintains 112%. However, LM Arena said in a post on X that Arena Hard Performance is not directly correlated with Chatbot Arena’s performance.
Hooker said it is unclear how certain AI companies received priority access, but it is an obligation at LM Arena to increase transparency.
In X’s post, LM Arena said that some of the paper’s claims do not reflect reality. The organization pointed to a blog post published earlier this week. This shows that non-major lab models appear in more chatbot arena battles than studies suggest.
One important limitation of this study was that it relied on “self-identification” to determine which AI models were in private testing at chatbot arenas. The authors urged AI models several times about the company in their country of origin, relying on relying on model responses to categorize them. This is not a complete method.
However, Hooker said the organization did not object to them when the author contacted LM Arena to share their preliminary findings.
TechCrunch contacted Meta, Google, Openai and Amazon. All of these were mentioned in the study, but please contact us for comment. They didn’t respond immediately.
Hot Water LM Arena
In the paper, the author calls the LM Arena and implements many changes aimed at making the chatbot arena more “fair.” For example, according to the author, LM Arena sets clear and transparent limits on the number of private tests that AI labs can perform, and can publish scores from these tests.
In a post on X, LM Arena rejected these proposals and claims it has released information on pre-release testing since March 2024. The benchmark organization said that “it’s pointless to show scores for pre-release models that are not publicly available” because the AI community cannot test the model.
Researchers also say that LM Arena can adjust the sampling rate of the chatbot arena so that all models within the arena appear in the same number of battles. The LM Arena has publicly accepted this recommendation and has shown that it will create a new sampling algorithm.
This paper occurs a few weeks after Meta caught the game benchmark at the chatbot arena during the launch of the above llama 4 models. Meta has optimized one of the four “conversational” Llama models. This helped me achieve an impressive score on the Chatbot Arena leaderboard. However, the company has never released an optimized model, and the vanilla version has been much worse in the chatbot arena.
At the time, LM Arena said Meta should be more transparent in its approach to benchmarking.
Earlier this month, LM Arena announced that it would launch the company with plans to raise capital from investors. This study increases scrutiny of private benchmark organizations. And whether it is reliable that the influence of a company can assess AI models without clouding the process.