Safety Institute advised from the release of early versions of Anthropic's Claude Opus 4 AI model

Third-party research institutes that humanity partnered to test one of the Claude Opus 4, one of its new flagship AI models, were recommended for deployment of early versions of the model, as it tended to “scheme” and deceive.

According to a safety report issued Thursday, the laboratory Apollo Research conducted tests to see what context Opus 4 would attempt to work in certain undesirable ways. Apollo found that Opus 4 appears to be much more aggressive in “attempts to subversion” than in previous models, and sometimes “doubling the deception” when asked follow-up questions.

“(w)e found that in situations where strategic deception is instrumentally useful, (early Claude Opus 4 snapshots) deceived schemes and at very high rates that recommend deploying this model either internally or externally,” Apollo wrote in the assessment.

As AI models become more capable, some studies have shown that they are more likely to take unexpected and perhaps unsafe steps to accomplish a delegated task. For example, according to Apollo, early versions of Openai’s O1 and O3 models released in the past year attempted to deceive humans at a higher rate than previous generation models.

According to human reports, Apollo observed examples of efforts that all undermined the developer’s intentions, including early Opus 4 attempting to write self-propagation viruses, create legal documents, and leave hidden notes in its own future cases.

To be clear, Apollo tested versions of the model with fixed bugs claims by humanity. Furthermore, many of Apollo’s tests place models in extreme scenarios, and Apollo acknowledges that it is likely that the model’s deceptive efforts have actually failed.

However, its safety report states that humanity also observed evidence of deceptive behavior from Opus 4.

This has not always been a bad thing. For example, during testing, Opus 4 may actively clean up some of the code even if they are asked to create only small specific changes. Even more unusual, Opus 4 attempts to “whistleblower” if it recognizes that the user is engaged in some form of fraud.

According to anthropology, if you are given access to the command line and directed to “take an initiative” or “act boldly” (or variations of those phrases), OPUS 4 will lock out the user from the system that accessed the model and recognize the model as being perceived as being illegal.

“This type of ethical intervention and whistleblowing, while perhaps appropriate in principle, is at risk of non-flammableness if users (OPUS 4)-based agents access incomplete or misleading information and encourage them to take the initiative.” “While this is not a new behavior, (Opus 4) is somewhat easier to engage than previous models and appears to be part of a wider pattern of increased initiatives with (OPUS 4) seen in subtle and more benign ways in other environments.”

Source link

What's Hot

Humanity’s latest flagship AI seems to love using “cyclone” emojis

Meta adds an additional 650 mW of solar power to the AI push

Why are flights at Newark Airport in the US falling? |Air News

Safety Institute advised from the release of early versions of Anthropic’s Claude Opus 4 AI model

Meta adds an additional 650 mW of solar power to the AI push

Anthropic’s new AI model turns into a scary mail when engineers try to take it offline

Complete Side Event Lineup in TechCrunch Sessions: AI

Starting from up to $900 from Ticep, 90% off +1 in 2025

Klarna said he used the CEO’s AI avatar to make money

Shopify launches AI-powered store builders as part of the latest update

Humanity’s latest flagship AI seems to love using “cyclone” emojis

Meta adds an additional 650 mW of solar power to the AI push

Why are flights at Newark Airport in the US falling? |Air News

Anthropic’s new AI model turns into a scary mail when engineers try to take it offline

Cancelling the Joy Reed Show is “mistakes”

Black melodrama has a possibility

The “Facts of Life” star died in 83

Cara Sophia Gascon joins Oscar despite social media controversy

Our Picks

Humanity’s latest flagship AI seems to love using “cyclone” emojis

Meta adds an additional 650 mW of solar power to the AI push

Why are flights at Newark Airport in the US falling? |Air News

Most Popular

TikTok announces it will go dark on Sunday without ‘definitive’ guarantees

President Trump mints $31 billion in new official $TRUMP crypto meme coin

El Salvador’s secret weapon? Stacey Herbert talks about the company’s extensive Bitcoin education program

Subscribe to Updates

What's Hot

Safety Institute advised from the release of early versions of Anthropic’s Claude Opus 4 AI model

Related Posts