These researchers have benchmarked the AI "inference" model using NPR Sunday puzzle questions

Every Sunday, we quizzes thousands of listeners in a long-term segment called NPR host Will Shortz, The Sunday Puzzle, a leading figure in the New York Times crossword puzzles. It is said to be resolved without too much foresight, but blenders are usually challenging even for skilled contestants.

Therefore, some experts see it as a promising way to test the limitations of AI problem-solving capabilities.

In the new study, a team of researchers from Wellesley College, Overlin College, University of Texas at Austin, Northeastern University and Startup Cursor used the mystery of Sunday’s puzzle episode to create AI benchmarks. The team says their tests reveal surprising insights like that so-called inference model (such as Openai’s O1).

“We wanted to develop a benchmark with problems that humans can understand with just general knowledge,” said Arjun Guha, a Northeastern computer science undergraduate and one of the research’s co-authors. He told TechCrunch.

The AI industry is currently a bit benchmarked. It is commonly used to assess AI model probe skills, such as the ability to mathematics and science questions at PHD levels that are not relevant to the average user. On the other hand, many benchmarks are quickly approaching saturation points, even relatively recently released benchmarks.

The advantage of public radio quiz games like Sunday puzzles is that it doesn’t test the esoteric knowledge, and the challenges are expressed so that the model cannot draw “memory memory” to solve them. Guha explained that he was there.

“What makes these problems difficult is that it’s really hard to make meaningful progress until you solve the problem. That’s when it’s all when it clicks at once,” Guha said. “That requires a combination of insight and exclusion processes.”

Of course, there is no perfect benchmark. Sunday puzzles are mainly in the US and are in English only. Also, since the quiz is public, models may train them and in a way “cheat” them, but Guha says he has never seen this evidence.

“New questions are released every week, so you can expect the latest questions to be truly invisible,” he added. “We’re going to keep our benchmarks fresh and track how the performance of our models changes over time.”

In the researcher’s benchmark, which consists of around 600 Sunday puzzle mysteries, reasoning models such as O1 and Deepseek’s R1 far outweigh the rest. Inference models thoroughly fact-check the model before producing results. This avoids some of the pitfalls that usually trip down AI models. The trade-off is that it takes a little longer for the inference model to reach the solution – usually seconds to minutes longer.

At least one model, Deepseek’s R1, offers solutions that we know are wrong for some of the Sunday puzzle questions. R1 says verbatim “I give up,” followed by a seemingly randomly chosen false answer. This person is certainly related.

The model makes other strange choices, tease the better ones, and tries to fail again, such as giving the wrong answer just to retract it. They also stop “thinking” forever and give a meaningless explanation of the answer or quickly arrive at the correct answer, but consider alternative answers without obvious reasons.

“On difficult issues, I say R1 is literally “frustrated,” Guha said. “It was interesting to see how models emulate what humans say. It’s not yet known how “frustration” in reasoning affects the quality of model results. ”

NPR Benchmark — R1 I’m “irritated” when I ask questions about the Puzzle Challenge Set on Sunday.Image credit: Guha et al.

The current best performance model on the benchmark is O1 with a score of 59%, following the recently released “inference effort” (47%). (R1 won 35%.) As a next step, researchers plan to expand the test to an additional inference model.

“It’s possible to design inference benchmarks that do not require PHD level knowledge because they are good at reasoning, so it’s possible to design inference benchmarks that do not require PHD level knowledge,” Guha said. “Benchmarks with broader access allow a wider range of researchers to understand and analyze results, potentially leading to better solutions in the future. Furthermore, cutting-edge models are As it is increasingly deployed in settings that affect everyone, we believe that everyone can intuitively in what these models are.

Source link

What's Hot

“Easter Stop” in the Russian Ukrainian War Characterized by accusations of Violation | News of the Russian-Ukraine War

Small businesses that have hit tariffs fear being crushed by corporate rivals

Uncovered emails showed how Meta struggled with making Facebook culturally relevant

These researchers have benchmarked the AI ”inference” model using NPR Sunday puzzle questions

Uncovered emails showed how Meta struggled with making Facebook culturally relevant

Congress has questions about 23 and ME bankruptcy

The robot slowly runs a half marathon

Review Week: Google loses major antitrust cases

Read what Mark Zuckerberg and Facebook executives said before purchasing Instagram

TechStars will increase startup funding to $220,000, reflecting the YC structure

“Easter Stop” in the Russian Ukrainian War Characterized by accusations of Violation | News of the Russian-Ukraine War

Small businesses that have hit tariffs fear being crushed by corporate rivals

Uncovered emails showed how Meta struggled with making Facebook culturally relevant

Hawaii, Israel: How Trump justified his long-standing vision of Israel | Israeli-Palestinian conflict

Cancelling the Joy Reed Show is “mistakes”

Black melodrama has a possibility

The “Facts of Life” star died in 83

Cara Sophia Gascon joins Oscar despite social media controversy

Our Picks

“Easter Stop” in the Russian Ukrainian War Characterized by accusations of Violation | News of the Russian-Ukraine War

Small businesses that have hit tariffs fear being crushed by corporate rivals

Uncovered emails showed how Meta struggled with making Facebook culturally relevant

Most Popular

TikTok announces it will go dark on Sunday without ‘definitive’ guarantees

President Trump mints $31 billion in new official $TRUMP crypto meme coin

El Salvador’s secret weapon? Stacey Herbert talks about the company’s extensive Bitcoin education program

Subscribe to Updates

What's Hot

These researchers have benchmarked the AI ​​”inference” model using NPR Sunday puzzle questions

Related Posts

These researchers have benchmarked the AI ”inference” model using NPR Sunday puzzle questions