With AI of the Week: You should ignore AI benchmarks for now

Welcome to TechCrunch’s regular AI newsletter! We take a little break, but you can find all of our AI coverage on TechCrunch, including my columns, daily analysis, broken news articles and more. If you want those stories and more in your inbox every day, sign up for our daily newsletter here.

This week, Xai, the AI startup for billionaire Elon Musk, released its Grok 3, the latest flagship AI model that will boost its Grok Chatbot app. Trained on around 200,000 GPUs, the model beats many other major models from Openai in benchmarks such as mathematics and programming.

But what do these benchmarks really tell you?

Here at TC, we often reluctantly report benchmark numbers as the AI industry is one of the few (relatively) standardized ways to measure model improvements. Popular AI benchmarks tend to test esoteric knowledge and provide a total score that is insufficiently correlated with proficiency in tasks that most people care about.

As Wharton Professor Ethan Morrick noted in a series of posts on the X after the announcement of the Grok 3 on Monday, there is an urgent need for a better battery of testing and an independent testing authority. Self-report benchmarks for AI companies frequently result in the fact that, as Mollick implies, these results have become even more difficult to accept at face value.

“The public benchmark is both “Meh” and saturated and remains to make many AI tests like food reviews based on taste,” writes Mollick. “If AI is important to function, it’s necessary more.”

There is no shortage of organizations proposing independent testing and new benchmarks for AI, but their relative benefits are far from within the industry’s solution. Some AI commentators and experts suggest adjusting benchmarks with economic impacts to ensure usefulness, while others argue that recruitment and utility are the ultimate benchmarks. It’s there.

This argument may be furious until the end of time. Instead, as X-user Roon states, you should not pay much attention to the new models and benchmarks, except for major AI technical breakthroughs. For our collective sanity, even if it steers some degree of AI FOMO, it may not be the worst idea.

As mentioned above, there will be a hiatus with AI this week. Thank you for sticking to us, our readers and our readers through this journey roller coaster. Until next time.

news

Image credits: Nathan Rain/Bloomberg/Getty Images

Openai “Uncensor” ChatGpt: Max wrote about how Openai changes its AI development approach to explicitly embrace “intellectual freedom.”

Mira’s New Startup: Thinking Machines Lab, a new startup from former Openai CTO Mira Murati, intends to build a tool called “creating AI for (people’s) unique needs and goals.”

Grok 3 Cometh: Xai, the AI startup for Elon Musk, has released its latest flagship AI model, the Grok 3, and has announced new features in the Grok app for iOS and the web.

Very Lama Conference: Meta will hold its first developer meeting dedicated to generating AI this spring. The Rama family of meta generation AI models is called Ramacon and the meeting is scheduled for April 29th.

AI and European digital sovereignty: Paul has been working to build a “set of fundamental models of European transparent AI” that maintains “linguistic and cultural diversity” in all EU languages, among approximately 20 organizations We introduced Openeurollm, a collaboration between.

This week’s research paper

The Openai ChatGpt website displayed on the laptop screen is shown in the photo in this illustration. — Image credits: Jakub Porzycki / Nurphoto / Getty Images

OpenAI researchers have created a new AI benchmark, SWE Lancer, which aims to assess the coding capabilities of powerful AI systems. The benchmark consists of over 1,400 freelance software engineering tasks, ranging from bug fixes and feature deployments to suggesting “manager-level” technology implementations.

According to Openai, Anthropic’s Claude 3.5 Sonnet, the best-performing AI model, scored 40.3% on the full SWE-Lancer benchmark. It suggests that there are quite a few ways to AI. It is noteworthy that researchers did not benchmark new models such as Openai’s O3-Mini or the R1 from Chinese AI company Deepseek.

This week’s model

A Chinese AI company named Stepfun has released Step-Audio, a “open” AI model that allows you to understand and generate speeches in several languages. Step-Audio supports Chinese, English and Japanese, allowing users to adjust the emotions and dialects of synthetic audio they create, such as singing.

Stepfun is one of several funded Chinese AI startups releasing models under licensed licenses. Founded in 2023, Stepfun reportedly closed a funding round worth hundreds of millions of dollars from many investors, including China’s state-run private equity firm.

Glove bag

Nous Research Deephermes — Image credit: Nous Research

Nous Research, an AI research group, has released what it claims to be one of the first AI models to unify inference and “intuitive language model capabilities.”

The Deephermes-3 Preview, a model, allows you to turn long “threads of thought” on and off to improve accuracy at the expense of computational weight. In “Inference” mode, like other inference AI models, the deepermes-3 preview shows the thought process to “think” for longer and reach the answer for more difficult problems.

Humanity is reportedly planning to release architecturally similar models soon, and Openai says such models are on its short-term roadmap.

Source link

What's Hot

Bluesky may add a blue check validation soon

Openai’s new inference AI model shows even more hallucinations

Everything you need to know about the AI chatbot

With AI of the Week: You should ignore AI benchmarks for now

Openai’s new inference AI model shows even more hallucinations

ChatGpt refers to users by undeclared names, and some people find them “creepy”

ChatGPT now uses “memory” to personalize web searches

Is the Spack back? | TechCrunch

Openai is reportedly in talks to buy Windsurf for $3 billion, with news forecasts expected later this week

Openai pursued cursor maker before giving a lecture to buy Windsurf for $3 billion

Bluesky may add a blue check validation soon

Openai’s new inference AI model shows even more hallucinations

Everything you need to know about the AI chatbot

Michael Saylor’s company has had an extraordinary return since the 2020 Bitcoin Romance

Cancelling the Joy Reed Show is “mistakes”

Black melodrama has a possibility

The “Facts of Life” star died in 83

Cara Sophia Gascon joins Oscar despite social media controversy

Our Picks

Bluesky may add a blue check validation soon

Openai’s new inference AI model shows even more hallucinations

Everything you need to know about the AI chatbot

Most Popular

TikTok announces it will go dark on Sunday without ‘definitive’ guarantees

President Trump mints $31 billion in new official $TRUMP crypto meme coin

El Salvador’s secret weapon? Stacey Herbert talks about the company’s extensive Bitcoin education program

Subscribe to Updates

What's Hot

With AI of the Week: You should ignore AI benchmarks for now

news

This week’s research paper

This week’s model

Glove bag

Related Posts