In a recent blog post, Anthropic revealed that it used the classic Game Boy game Pokémon Red as a benchmark for its latest AI model, Claude 3.7 Sonnet. The company equipped the model with basic memory, screen pixel input, and function calls to press buttons and navigate through the game, allowing it to play continuously.
This test showcased the model’s ability to engage in what Anthropic calls extended thinking, meaning that it can take extra time and use more computing power to solve complex tasks. Unlike its predecessor, Claude 3.0 Sonnet, which struggled to leave the starting area of Pallet Town, the new Claude 3.7 Sonnet managed to battle three gym leaders and win their badges, demonstrating significant improvements in reasoning and planning.
Anthropic noted that the model performed 35,000 actions to reach the final gym leader, Surge, emphasizing its capacity to execute a vast number of operations to overcome challenges. Although the exact computing resources and time required for each step were not disclosed, this achievement clearly illustrates the model’s progress and its enhanced problem-solving skills.
Using Pokémon Red as a test may seem like a playful choice, but it is part of a long tradition of employing video games as benchmarks for AI performance. Recently, several platforms have emerged that evaluate AI abilities in games ranging from fighting games like Street Fighter to creative puzzles like Pictionary. Anthropic’s experiment is both a nod to gaming culture and a practical demonstration of advanced AI capabilities. With its improved extended thinking feature, Claude 3.7 Sonnet is now better prepared to handle intricate challenges beyond simple tasks. This experiment serves as a clear example of the rapid evolution of AI technology. It is likely that other developers will adopt similar gaming benchmarks to test their systems, further pushing the boundaries of artificial intelligence.