AI Labs are increasingly relying on crowdsourced benchmark platforms such as Chatbot Arena to explore the pros and cons of modern models. However, some experts say this approach is a serious problem from an ethical and academic perspective.
Over the past few years, labs, including Openai, Google and Meta, have turned to platforms recruiting users to assess the capabilities of their upcoming models. When a model scores favorably, the lab behind it often promotes its score as evidence of meaningful improvement.
But it’s a flawed approach, according to Emily Bender, a professor of linguistics at the University of Washington and co-author of the book “The Ai Con.” Bender has a certain problem with Chatbot Arena. This prompts two anonymous models and tasks volunteers by selecting the preferred response.
“To be valid, the benchmark needs to measure something specific and have configuration validity. That is, it requires clearly defined configurations and evidence that the measurements are in fact related to the configuration,” Bender said. “Chatbot Arena doesn’t show that votes for one output actually correlate with preferences, but they could be defined.”
Asmelash Teka Hadgu, co-founder of AI company Rathan and a fellow at the Decentralized AI Institute, said he believes benchmarks like Chatbot Arena will “facilitate the adopted claims” that they have been “adopted” by the AI lab. Hadgu pointed to recent controversy involving Meta’s Llama 4 Maverick model. Meta tweaked the version of Maverick to successfully acquire it with Chatbot Arena, but that model withholds in favor of releasing a worse performance version.
“Benchmarks should be dynamic, not static datasets,” Hadgu said.
Hadgu and Kristine Gloria, who previously headed the Aspen Institute’s Emergency Technology Initiative, argued that model evaluators should be compensated for their work. Gloria said AI Labs should learn from mistakes in the data labeling industry, known for their exploitative practices. (Some labs have been accused of the same.)
“In general, crowdsourced benchmarking processes are valuable and reminding us of citizen science initiatives,” Gloria said. “Ideally, it helps to bring additional perspectives to provide some depth to both valuing and fine-tuning the data. But benchmarks should not be the only metric of evaluation. As industry and innovation move quickly, benchmarks become rapidly unreliable.”
Matt Frederikson, CEO of Grace One AI, has been running a crowdsourced red team campaign for models, and said volunteers will be attracted to Grace One’s platform for a variety of reasons, including “learning and practicing new skills.” (Gracewan also awards cash prizes for several tests.) Still, he acknowledged that the public benchmark is not a “alternative” for “private” ratings.
“(d)Evelopers should sign up with internal benchmarks, algorithmic red teams, and red teams that can take a more open-ended approach or bring expertise in a particular domain,” Frederikson says. “It’s important that both model developers and benchmark creators, both crowdsourcing or other benchmark creators, communicate clearly to those following their results and respond when they are questioned.”
Alex Atallah, CEO of Model Marketplace OpenRouter, recently partnered with Openai to grant users early access to Openai’s GPT-4.1 model, saying that open testing and benchmarking of models alone is “not enough.” So was Wei-Lin Chiang, an AI doctoral student in Berkeley, California and one of the founders of Lmarena, who maintains a chatbot arena.
“We certainly support the use of other tests,” Chen said. “Our goal is to create a reliable open space that measures community preferences regarding different AI models.”
Chen said cases such as the inconsistency in the Maverick Benchmark are not a result of flaws in the design of the chatbot arena, but rather the result of the lab misinterpretation of its policies. LM Arena has taken steps to prevent future inconsistencies, said Chiang will include updating its policy “strengthening commitment to fair and reproducible assessments.”
“Our community is not here as volunteers or model testers,” Chen said. “People use LM Arenas because we give them an open and transparent place to engage with AI and give collective feedback. As long as the leaderboard faithfully reflects the voices of our community, we are welcome to share it.”