When Openai unveiled the “inference” AI model for O3 in December, the company partnered with the creators of Arc-Agi, a benchmark designed to test highly capable AI, to showcase the features of O3. A few months later, the results were revised and now looks a little less impressive than they were originally.
Last week, the ARC Awards Foundation, which maintains and manages ARC-AGI, updated the approximate computing cost of O3. The organization originally estimated that the O3 High configuration, the optimal setting for O3 High, was to cost around $3,000 to solve a single ARC-AGI problem. Currently, the ARC Awards Foundation considers it to be much higher. This is probably around $30,000 per task.
This revision is worth noting, at least early on, as it shows that today’s most sophisticated AI models can become specific tasks. Openai hasn’t set a price for O3 yet – or release it. However, the ARC Awards Foundation considers Openai’s O1-Pro model pricing to be a reasonable proxy.
In context, the O1-Pro is Openai’s most expensive model to date.
“We consider the O1-Pro to be a close comparison of true O3 costs due to the amount of test time calculations used,” Mike Knoop, one of the co-founders of the ARC Awards Foundation, told TechCrunch. “But this is still a proxy and I labeled O3 as a preview of the leaderboard to reflect uncertainty until the official pricing was announced.”
Given the amount of computing resources the model is reported to be using, the high price of O3 High is not an issue. According to the ARC Awards Foundation, O3 High used 172 times more computing to tackle ARC-AGI than O3’s lowest computing configuration, O3 Low.
Additionally, rumors have been flying for quite some time about the expensive plans Openai is considering introducing it to its corporate customers. In early March, the information reported that, like software developer agents, they may be planning to charge $20,000 a month against a specialized AI “agent.”
Some may argue that even Openai’s most expensive model is costly enough under what is led by a typical human contractor or staff. However, as AI researcher Toby Ord pointed out in a post on X, the model may not be that efficient. For example, O3 High required 1,024 attempts to achieve the highest score on each ARC-AGI task.