A computer programmer working intently on coding with multiple screens showing code and game interfaces in an office setting.

LLMs Do Not Succeed in Games by Default. The Benchmark and API Layer Is Doing Much of the Work

Recent game-playing results from large language models are easy to overread. The stronger finding is not that LLMs can simply be dropped into games, but that their performance changes sharply when researchers add task-specific evaluation harnesses, game interfaces, and supporting modules that compensate for weak planning, action formatting, or memory. LMGAME-BENCH shows the gap between…

Read More
A group of AI researchers collaborating in a lab with multiple computer screens showing neural network data and AI models.

Google DeepMind’s AGI Framework Shifts the Debate From Bigger Models to Measured Cognitive Abilities

Google DeepMind is trying to make AGI progress harder to overstate. Its new framework replaces vague milestone talk and single benchmark scores with a structured test of ten cognitive abilities, then asks a stricter question: how those abilities combine, and how the result compares with demographically representative human baselines. Ten abilities instead of one headline…

Read More