AI models from OpenAI, humanity, and other top AI labs are increasingly being used to aid programming tasks. Google CEO Sundar Pichai said in October that 25% of the company’s new codes were generated by AI, and Meta CEO Mark Zuckerberg expressed his ambition to broadly deploy the AI coding model to the social media giant.
But even some of today’s best models have struggled to solve software bugs that don’t trip experienced developers up.
A new study from Microsoft Research, the research and development division of Microsoft, reveals that models including Anthropic’s Claude 3.7 Sonnet and Openai’s O3-Mini cannot debug many problems with a software development benchmark called Swe-Bench Lite. The results are a reminder that despite bold declarations from companies like Openai, AI still doesn’t match human experts in domains such as coding.
The research co-authors tested nine different models as the backbone of a “single prompt-based agent” that has access to many debugging tools, including Python debuggers. They appointed this agent to resolve a curation set of 300 software debugging tasks from SWE-Bench Lite.
According to co-authors, agents rarely complete more than half of their debugging tasks successfully, even when equipped with more powerful and recent models. Claude 3.7 Sonnet had the highest success rate (48.4%), followed by Openai’s O1 (30.2%) and O3-Mini (22.1%).

Why is there such an overwhelming performance? Some models struggled to use the available debugging tools and understood how different tools can help with different issues. However, according to the co-authors, the bigger problem was the lack of data. They speculate that the training data in the current model does not have enough data to represent the “sequential decision-making process,” or human debug traces.
“I strongly believe that training or fine-tuning (models) can become better interactive debuggers,” co-authors of their study wrote. “However, this requires specialized data to meet such model training, such as trajectory data that records agents interacting with the debugger to gather the necessary information before suggesting bug fixes.”
The findings are not shocking at all. Many studies have shown that code generation AI tends to introduce security vulnerabilities and errors due to weaknesses in areas such as the ability to understand programming logic. One recent review of Devin, a popular AI coding tool, discovered that he could complete only three of the 20 programming tests.
But Microsoft’s work is one of the more detailed appearances in the persistent problem area of the model. While it’s likely not to undermine investors’ enthusiasm for AI-powered support coding tools, if you’re lucky, developers and their tops will often think about allowing AI to run coding shows.
Because of its value, more and more engineers are challenging the concept that AI automates coding jobs. Microsoft co-founder Bill Gates says he believes programming is here to stay as a profession. So are Replit CEO Amjad Masad, Okta CEO Todd McKinnon and IBM CEO Arvind Krishna.