Welcome to TechCrunch’s regular AI newsletter! We take a little break, but you can find all of our AI coverage on TechCrunch, including my columns, daily analysis, broken news articles and more. If you want those stories and more in your inbox every day, sign up for our daily newsletter here.
This week, Xai, the AI startup for billionaire Elon Musk, released its Grok 3, the latest flagship AI model that will boost its Grok Chatbot app. Trained on around 200,000 GPUs, the model beats many other major models from Openai in benchmarks such as mathematics and programming.
But what do these benchmarks really tell you?
Here at TC, we often reluctantly report benchmark numbers as the AI industry is one of the few (relatively) standardized ways to measure model improvements. Popular AI benchmarks tend to test esoteric knowledge and provide a total score that is insufficiently correlated with proficiency in tasks that most people care about.
As Wharton Professor Ethan Morrick noted in a series of posts on the X after the announcement of the Grok 3 on Monday, there is an urgent need for a better battery of testing and an independent testing authority. Self-report benchmarks for AI companies frequently result in the fact that, as Mollick implies, these results have become even more difficult to accept at face value.
“The public benchmark is both “Meh” and saturated and remains to make many AI tests like food reviews based on taste,” writes Mollick. “If AI is important to function, it’s necessary more.”
There is no shortage of organizations proposing independent testing and new benchmarks for AI, but their relative benefits are far from within the industry’s solution. Some AI commentators and experts suggest adjusting benchmarks with economic impacts to ensure usefulness, while others argue that recruitment and utility are the ultimate benchmarks. It’s there.
This argument may be furious until the end of time. Instead, as X-user Roon states, you should not pay much attention to the new models and benchmarks, except for major AI technical breakthroughs. For our collective sanity, even if it steers some degree of AI FOMO, it may not be the worst idea.
As mentioned above, there will be a hiatus with AI this week. Thank you for sticking to us, our readers and our readers through this journey roller coaster. Until next time.
news

Openai “Uncensor” ChatGpt: Max wrote about how Openai changes its AI development approach to explicitly embrace “intellectual freedom.”
Mira’s New Startup: Thinking Machines Lab, a new startup from former Openai CTO Mira Murati, intends to build a tool called “creating AI for (people’s) unique needs and goals.”
Grok 3 Cometh: Xai, the AI startup for Elon Musk, has released its latest flagship AI model, the Grok 3, and has announced new features in the Grok app for iOS and the web.
Very Lama Conference: Meta will hold its first developer meeting dedicated to generating AI this spring. The Rama family of meta generation AI models is called Ramacon and the meeting is scheduled for April 29th.
AI and European digital sovereignty: Paul has been working to build a “set of fundamental models of European transparent AI” that maintains “linguistic and cultural diversity” in all EU languages, among approximately 20 organizations We introduced Openeurollm, a collaboration between.
This week’s research paper
OpenAI researchers have created a new AI benchmark, SWE Lancer, which aims to assess the coding capabilities of powerful AI systems. The benchmark consists of over 1,400 freelance software engineering tasks, ranging from bug fixes and feature deployments to suggesting “manager-level” technology implementations.
According to Openai, Anthropic’s Claude 3.5 Sonnet, the best-performing AI model, scored 40.3% on the full SWE-Lancer benchmark. It suggests that there are quite a few ways to AI. It is noteworthy that researchers did not benchmark new models such as Openai’s O3-Mini or the R1 from Chinese AI company Deepseek.
This week’s model
A Chinese AI company named Stepfun has released Step-Audio, a “open” AI model that allows you to understand and generate speeches in several languages. Step-Audio supports Chinese, English and Japanese, allowing users to adjust the emotions and dialects of synthetic audio they create, such as singing.
Stepfun is one of several funded Chinese AI startups releasing models under licensed licenses. Founded in 2023, Stepfun reportedly closed a funding round worth hundreds of millions of dollars from many investors, including China’s state-run private equity firm.
Glove bag
Nous Research, an AI research group, has released what it claims to be one of the first AI models to unify inference and “intuitive language model capabilities.”
The Deephermes-3 Preview, a model, allows you to turn long “threads of thought” on and off to improve accuracy at the expense of computational weight. In “Inference” mode, like other inference AI models, the deepermes-3 preview shows the thought process to “think” for longer and reach the answer for more difficult problems.
Humanity is reportedly planning to release architecturally similar models soon, and Openai says such models are on its short-term roadmap.