AI models still struggle to debug software, according to Microsoft research

AI models from OpenAI, humanity, and other top AI labs are increasingly being used to aid programming tasks. Google CEO Sundar Pichai said in October that 25% of the company’s new codes were generated by AI, and Meta CEO Mark Zuckerberg expressed his ambition to broadly deploy the AI coding model to the social media giant.

But even some of today’s best models have struggled to solve software bugs that don’t trip experienced developers up.

A new study from Microsoft Research, the research and development division of Microsoft, reveals that models including Anthropic’s Claude 3.7 Sonnet and Openai’s O3-Mini cannot debug many problems with a software development benchmark called Swe-Bench Lite. The results are a reminder that despite bold declarations from companies like Openai, AI still doesn’t match human experts in domains such as coding.

The research co-authors tested nine different models as the backbone of a “single prompt-based agent” that has access to many debugging tools, including Python debuggers. They appointed this agent to resolve a curation set of 300 software debugging tasks from SWE-Bench Lite.

According to co-authors, agents rarely complete more than half of their debugging tasks successfully, even when equipped with more powerful and recent models. Claude 3.7 Sonnet had the highest success rate (48.4%), followed by Openai’s O1 (30.2%) and O3-Mini (22.1%).

Microsoft AI Debug Benchmark — Charts from the research. “Relative increase” refers to the boost model obtained from equipping a debugging tool.Image credits: Microsoft

Why is there such an overwhelming performance? Some models struggled to use the available debugging tools and understood how different tools can help with different issues. However, according to the co-authors, the bigger problem was the lack of data. They speculate that the training data in the current model does not have enough data to represent the “sequential decision-making process,” or human debug traces.

“I strongly believe that training or fine-tuning (models) can become better interactive debuggers,” co-authors of their study wrote. “However, this requires specialized data to meet such model training, such as trajectory data that records agents interacting with the debugger to gather the necessary information before suggesting bug fixes.”

The findings are not shocking at all. Many studies have shown that code generation AI tends to introduce security vulnerabilities and errors due to weaknesses in areas such as the ability to understand programming logic. One recent review of Devin, a popular AI coding tool, discovered that he could complete only three of the 20 programming tests.

But Microsoft’s work is one of the more detailed appearances in the persistent problem area of the model. While it’s likely not to undermine investors’ enthusiasm for AI-powered support coding tools, if you’re lucky, developers and their tops will often think about allowing AI to run coding shows.

Because of its value, more and more engineers are challenging the concept that AI automates coding jobs. Microsoft co-founder Bill Gates says he believes programming is here to stay as a profession. So are Replit CEO Amjad Masad, Okta CEO Todd McKinnon and IBM CEO Arvind Krishna.

Source link

What's Hot

Russia-Ukraine War: List of Major Events, Day 1,150 | Political News

Infowars: Chinese AI Memes and US Media Barb | Donald Trump

3 coins to continue purchasing during a crypto slump

AI models still struggle to debug software, according to Microsoft research

Openai’s new inference AI model shows even more hallucinations

ChatGpt refers to users by undeclared names, and some people find them “creepy”

ChatGPT now uses “memory” to personalize web searches

Is the Spack back? | TechCrunch

Openai is reportedly in talks to buy Windsurf for $3 billion, with news forecasts expected later this week

Openai pursued cursor maker before giving a lecture to buy Windsurf for $3 billion

Russia-Ukraine War: List of Major Events, Day 1,150 | Political News

Infowars: Chinese AI Memes and US Media Barb | Donald Trump

3 coins to continue purchasing during a crypto slump

The US Supreme Court orders temporary suspension of deportation under antique law | Donald Trump News

Cancelling the Joy Reed Show is “mistakes”

Black melodrama has a possibility

The “Facts of Life” star died in 83

Cara Sophia Gascon joins Oscar despite social media controversy

Our Picks

Russia-Ukraine War: List of Major Events, Day 1,150 | Political News

Infowars: Chinese AI Memes and US Media Barb | Donald Trump

3 coins to continue purchasing during a crypto slump

Most Popular

TikTok announces it will go dark on Sunday without ‘definitive’ guarantees

President Trump mints $31 billion in new official $TRUMP crypto meme coin

El Salvador’s secret weapon? Stacey Herbert talks about the company’s extensive Bitcoin education program

Subscribe to Updates

What's Hot

AI models still struggle to debug software, according to Microsoft research

Related Posts