The new AI model meta benchmark is a bit misleading

One of the new flagship AI model meta released on Saturday, Maverick ranks second in the LM Arena. This is a test in which a human evaluator compares the output of the model and selects preferences. However, the version of Maverick that Meta deployed in LM Arena appears to be different from the version widely available to developers.

As some AI researchers pointed out in X, Meta said that Maverick of LM Arena has announced that it is an “experimental chat version.” Meanwhile, the chart on the official Llama website reveals that Meta’s LM Arena test was conducted using “Llama 4 Maverick optimized for conversation.”

As I wrote before, for a variety of reasons, LM arena was not the most reliable measure of AI models’ performance. However, AI companies generally do not customize or tweak their models, or at least allow them to do so, in order to score better at LM Arena.

The problem with adjusting the model to its benchmark, withholding it, then releasing a “vanilla” variant of the same model is that it becomes difficult for developers to accurately predict the performance of the model in a given context. That’s also misleading. Ideally, the benchmark is as badly insufficient as it is – providing a snapshot of the advantages and disadvantages of a single model across a variety of tasks.

In fact, X researchers have observed significant differences in the behavior of publicly available Mavericks compared to models hosted at LM Arena. The LM Arena version seems to use a lot of emojis and provide a very long answer.

OK llama4 is a lol with def cooked.

– Nathan Lambert (@natolambert) April 6, 2025

For some reason, the Arena Lama 4 model uses more emojis

together. ai, it seems better: pic.twitter.com/f74odx4ztt

– Tech Dev Notes (@techdevnotes) April 6, 2025

For comments, we contacted Chatbot Arena with Meta, the organization that maintains LM Arena.

Source link

What's Hot

Bitcoin, Ethereum and Dogecoin continue to collapse

Google’s AI mode allows users to ask complicated questions about images

Why Trump hates the US trade deficit and what it means to you

The new AI model meta benchmark is a bit misleading

Google’s AI mode allows users to ask complicated questions about images

Waymo may use interior camera data to train generated AI models and sell advertisements

IBM will continue to push AI consultant investments by acquiring Hakkoda

Meta releases llama4, a new crop of flagship AI models

India’s chatgpt adopts sky rocket, but may continue to monetize

Github Copilot introduces new restrictions and shows the pricing of the “premium” AI model

Bitcoin, Ethereum and Dogecoin continue to collapse

Google’s AI mode allows users to ask complicated questions about images

Why Trump hates the US trade deficit and what it means to you

Signalfire will raise more than $1 billion as LPS accepts data-driven investments

Cancelling the Joy Reed Show is “mistakes”

Black melodrama has a possibility

The “Facts of Life” star died in 83

Cara Sophia Gascon joins Oscar despite social media controversy

Our Picks

Bitcoin, Ethereum and Dogecoin continue to collapse

Google’s AI mode allows users to ask complicated questions about images

Why Trump hates the US trade deficit and what it means to you

Most Popular

TikTok announces it will go dark on Sunday without ‘definitive’ guarantees

President Trump mints $31 billion in new official $TRUMP crypto meme coin

El Salvador’s secret weapon? Stacey Herbert talks about the company’s extensive Bitcoin education program

Subscribe to Updates

What's Hot

The new AI model meta benchmark is a bit misleading

Related Posts