Until relatively recently, the organization developing mathematical benchmarks for AI did not disclose that it received funding from OpenAI, leading some in the AI community to suspect fraud.
Epoch AI, a nonprofit organization primarily funded by research and grant-making organization Open Philanthropy, revealed on December 20 that OpenAI helped create FrontierMath. FrontierMath, a test with expert-level questions designed to measure the mathematical skills of AI, was one of the benchmarks OpenAI used to demonstrate o3, its next flagship AI.
An Epoch AI contractor who goes by the username “Meemi” says in a post on the forum LessWrong that many contributors to the FrontierMath benchmark were not informed of OpenAI’s involvement until it was made public. .
“Communication regarding this was opaque,” Meemi wrote. “In my view, Epoch AI should disclose its funding to OpenAI, and contractors should seek transparency about how their work may be used for features when choosing whether to work on benchmarks. I should have had some information.”
On social media, some users expressed concern that this secrecy could damage FrontierMath’s reputation as an objective benchmark. In addition to FrontierMath’s assistance, OpenAI provided visibility into many of the benchmark’s problems and solutions. This fact was not disclosed by Epoch AI until December 20th, when o3 was announced.
Karina Hong, a mathematics PhD student at Stanford University, also wrote in a post on He claimed that it gave him pleasure.
“Six mathematicians who have contributed significantly to the FrontierMath benchmark have admitted (to me) that they do not know that OpenAI has exclusive access to this benchmark (and no one else has).” Mr. Hong said. “Most people said they were unsure whether they would have contributed if they had known.”
In a response to Meemi’s post, Tamay Besiroglu, associate director of Epoch AI and one of the organization’s co-founders, insisted that the integrity of FrontierMath was intact, but that Epoch AI Transparency admitted that it “made a mistake” in not going beyond that.
“By the time o3 was launched, we were restricted from disclosing our partnerships. In hindsight, we should have negotiated tougher negotiations to ensure transparency with benchmark contributors as soon as possible.” I should have,” Besiroglu wrote. “Our mathematicians had a right to know who had access to their work. Even if we were contractually limited in what we could say, we wanted transparency with contributors to be part of our agreement with OpenAI. It should have been a non-negotiable part.”
Besiroglu added that while OpenAI has access to FrontierMath, it has a “verbal agreement” with Epoch AI not to use FrontierMath problem sets to train the AI. (Training an AI with FrontierMath is similar to teaching to a test.) Epoch AI includes a “separate holdout set” that acts as an additional safeguard to independently validate FrontierMath benchmark results. There are also, Besiroglu said.
“OpenAI…fully supports our decision to maintain an independent and invisible holdout set,” Besiroglu wrote.
But what makes things confusing is that Epoch AI’s lead mathematician Ellot Glazer pointed out in a Reddit post that Epoch AI has not been able to independently verify OpenAI’s FrontierMath o3 results.
“My personal opinion is that (OpenAI’s) scores are legitimate (i.e. not trained on the dataset) and there is no incentive to lie about internal benchmark performance,” Glazer said. Ta. “However, we cannot guarantee this until an independent evaluation is completed.”
This story is another example of the challenge of developing empirical benchmarks for evaluating AI and securing the necessary resources for benchmark development without raising the perception of a conflict of interest.