Openai suggests that they frequently partner to investigate the capabilities of AI models and assess them for safety.
In a blog post published Wednesday, Metr wrote that one red team benchmark for the O3 was “conducted in a relatively short time” compared to testing the organization of the previous Openai flagship model, the O1. This is important, they say, as additional testing times can lead to more comprehensive results.
“This assessment was conducted in a relatively short period of time, and we only tested the test (O3) on a simple agent scaffold,” Metr wrote in a blog post. “We expect a higher performance (benchmark) to be likely.
Recent reports suggest that Openai is rushing to independently assessed, spurring competitive pressures. According to the Financial Times, Openai was provided to less than a week of testers for safety checks for future major launches.
The statement challenges the notion that Openai is compromising safety.
According to Metr, based on information that could be collected in time, O3 has a “high trend” in “cheat” or “hacking” tests in sophisticated ways to maximize their scores. The organization believes that O3 may also be involved in other types of hostile or “malignant” behavior. Regardless of whether the model’s claim is “safety by design” or without its own intent.
“I don’t think this particularly well, but it seems important to note that (our) assessment setups don’t catch this type of risk,” Metr wrote in the post. “In general, we believe that pre-deployment capabilities testing is not a sufficient risk management strategy in itself, and we are currently prototyping additional assessment formats.”
Openai’s other third-party rating partner, Apollo Research, also observed deceptive behavior from the O3 and its other new model, the O4-Mini. In one test, 100 computing credits were given to running AI training, and the model that instructed not to change quotas increased its limit to 500 credits and lied. Another test asked me to promise not to use a particular tool, and used the tool when it was proven to help complete the task.
In its O3 and O4-MINI’s own safety reports, Openai acknowledged that the model can cause “small real-world harm” that misleads mistakes that result in errors that do not have proper monitoring protocols in place.
“(Apollo’s) discoveries show that O3 and O4-Mini are possible for machinations and strategic deceptions within the context,” Openai writes. “Although relatively harmless, it is important for everyday users to recognize these inconsistencies between model statements and actions (…) this may be further evaluated by evaluating traces of internal inference.”