Have researchers discovered a new “scaling method” for AI? That’s what some of the social media talks suggest, but experts are skeptical.
Scaling methods for AI are a bit of an informal concept, explaining how AI models’ performance improves with the size of datasets and the increasing resources used to train them. Until almost a year ago, expanding “pretraining” (a model for constant models on constant datasets) was a much more dominant law, at least in the sense that most frontier AI labs accepted it.
Although it has not disappeared before training, two additional scaling methods emerged to complement it: post-training scaling and test-time scaling. Scaling after training essentially adjusts the behavior of the model, while test time scaling involves applying more computing to inference to drive the form of “inference” (see models like R1).
Researchers from Google and UC Berkeley recently proposed in their paper that some online commentators described the fourth law as “inference time search.”
For inference time search, the model generates many possible answers to the query in parallel, selecting “best” for the bunch. Researchers argue that the performance of a year-old model, like Google’s Gemini 1.5 Pro, can be improved to a level that surpasses Openai’s O1-Preview’s “inference” model in science and mathematics benchmarks.
Our paper focuses on this search axis and its scaling trends. For example, simply sample 200 responses and self-validation at random, and Gemini 1.5 (early 2024 model!) beats O1-PREVIEW and approaches O1. This does not have Finetuning, RL, or Ground-Truth Verifiers. pic.twitter.com/hb5fo7ifnh
– March 17, 2025
“(b)y randomly sampled 200 responses and self-validation, defeating Gemini 1.5 – Early Ancient Early 2024 Model – O1-Preview and approaching O1,” wrote Eric Zhao, one of the co-authors of the Google PhD and the paper, in a series of posts on X. But if the opposite is true, that’s true! ”
Some experts say the results are not surprising and inference time searches may not be useful in many scenarios.
Matthew Guzdial, an AI researcher and assistant professor at the University of Alberta, told TechCrunch that the approach works best when there is a good “assessment feature,” that is, when the best answer to a question can be easily seen. However, most queries are not that cut and dry.
“(i)f we can’t write code to define what we want, and we can’t use (inference time) search,” he said. “You can’t do this for things like general language interactions (…) not a great approach to actually solving most problems.”
Eric Zhao, a Google researcher and one of the research co-authors, was slightly pushed back to Guzdial’s claims.
“(o)Ur Paper actually focuses on when you don’t have access to the “evaluation feature” or “code to define what we want.” “Instead, we’re researching when the assessment is something that (the model) needs to grasp by trying to verify ourselves. In fact, the main point of our paper is that the gap between this administration and the administration you have (…) can be neatly reduced on scale.”
However, Mike Cook, a researcher at King’s College London, specializing in AI, agreed to the assessment of Guzdial, adding that he emphasized the delta between the meaning of AI and the “inference” of human thought processes.
“(Inference Time Search) does not ‘enhance the inference process’ of a model,” Cook said. “(i)T is a way to avoid the limitations of technology that tends to intuitively support a highly confident and supported mistake when a model makes a mistake in 5% of the time.
While there may be limitations to that inference time search, it should be unwelcome news for the AI industry, which is trying to scale up the “inference” of models computationally efficiently. As a paper co-author, today’s inference models can acquire thousands of dollars of computing on a single mathematical problem.
It appears that searches for new scaling techniques continue.
Updated 3/20 5:12 AM Pacific: Added a comment from Eric Chao, co-author of the study.