A team of researchers has created an AI benchmark using NPR’s Sunday Puzzle to assess reasoning models, revealing surprising limitations in problem-solving. The study, conducted by experts from Wellesley College, Oberlin College, the University of Texas at Austin, Northeastern University, Charles University, and startup Cursor, found that some AI models, including OpenAI’s o1, occasionally “give up” and provide answers they know are incorrect. These puzzles, designed for human solvers with general knowledge, prevent AI from relying on memorization, making them an effective test for reasoning abilities. The AI industry faces challenges in benchmarking, as many tests focus on PhD-level questions rather than practical reasoning skills relevant to everyday users.
The Sunday Puzzle, however, offers a unique test that requires insight and elimination rather than factual recall. The researchers’ benchmark consists of around 600 riddles, with reasoning models like o1 and DeepSeek’s R1 outperforming others. These models thoroughly fact-check before generating answers, reducing errors but increasing response times. Despite their advantages, some AI models displayed unusual behaviors, such as repeating incorrect answers, getting stuck in loops, or even expressing “frustration” when struggling with complex problems. DeepSeek’s R1, for instance, sometimes states, “I give up,” followed by a random incorrect answer, mimicking human behavior under pressure.
The study highlights the need for reasoning benchmarks that don’t require advanced academic knowledge. Current best-performing models include o1, which scored 59%, and o3-mini with high “reasoning effort” at 47%, while R1 lagged at 35%. Researchers plan to expand testing to identify improvements in reasoning AI. Since these puzzles are publicly available, concerns exist about models being trained on them, but researchers believe new weekly questions keep the benchmark fresh. They argue that accessible benchmarks will help both researchers and the public understand AI capabilities and limitations, contributing to better AI reasoning development in the future.