Humanity CEO Dario Amodei published an essay on Thursday highlighting researchers’ understanding of the inner workings of the world’s leading AI model. To address that, Amodei has set ambitious human goals to ensure that most AI model problems are detected by 2027.
Amodei acknowledges future challenges. In “The Urgentness of Interpretability,” the CEO says that humanity made early breakthroughs in tracking how models reach answers, but emphasizes that as these systems become more powerful, much more research is needed to decode them.
“I am very concerned about deploying such a system without better handling of interpretability,” Amodei writes in her essay. “These systems are absolutely central to economic, technological and national security, allowing more autonomy to be fundamentally unacceptable to being totally ignorant about how humanity works.”
Humanity is one of the pioneering companies in mechanical interpretability, an area aimed at opening a black box of AI models and understanding why they make decisions. Despite the rapid performance improvements in AI models in the tech industry, it is still relatively rare how these systems will reach decisions.
For example, Openai recently launched new inference AI models O3 and O4-Mini that not only perform well on some tasks, but also hallucinates more than others. The company doesn’t know why that’s happening.
“When a generative AI system does something, such as summarizing a financial document, at a certain or accurate level, I don’t know why it chooses, why it’s a particular word over others, or why it makes a mistake even though it’s usually accurate,” Amodei wrote in the essay.
In the essay, Amodei points out that humanity co-founder Chris Olah says that the AI model is “growing more than it’s built.” In other words, AI researchers have found ways to improve AI model intelligence, but I’m not sure why.
In the essay, Amodei says it can be dangerous to reach AGI. This is called “the land of data center geniuses” without understanding how these models work. In a previous essay, Amody argued that the tech industry could reach such milestones by 2026 or 2027, but considers it far out of understanding these AI models in full.
In the long run, Amodei wants humanity to perform “brain scans” or “MRI” of essentially cutting-edge AI models. These tests help identify a wide range of issues in the AI model. This has a tendency to lie, seek power, or other weaknesses. This could take five to ten years, but he added that these measures are necessary to test and deploy future AI models for humanity.
Humanity has created several research breakthroughs to help them better understand how AI models work. For example, the company recently found ways to go through the AI model’s thinking path, what the company calls, and how to track the circuit. Humanity has identified one circuit that will help the AI model understand which cities the US city is. The company has found only a small portion of these circuits, but estimates that there are millions within its AI model.
Humanity has invested in the research of interpretability itself, and has recently made its first investment in startups working on interpretability. While interpretability is considered today’s field of safety research, Amodei ultimately points out that explaining how AI models reach answers can provide commercial benefits.
In the essay, Amodei called on Openai and Google Deepmind to increase research efforts in this area. Beyond friendly nudges, Anthropic CEO urged governments to encourage interpretability research, including requirements for companies to disclose safety and security practices, to impose “touch of light” regulations on the government. In the essay, Amodei also states that the US should have export controls on chips to China.
Humanity has always stood out from Openai and Google because it focused on safety. Other tech companies have pushed back California’s controversial AI safety bill, SB 1047, but humanity has issued modest support and recommendations for bills setting safety reporting standards for frontier AI model developers.
In this case, humanity appears to be driving industry-wide efforts to not only improve capabilities, but also to better understand AI models.