Google DeepMind is trying to make AGI progress harder to overstate. Its new framework replaces vague milestone talk and single benchmark scores with a structured test of ten cognitive abilities, then asks a stricter question: how those abilities combine, and how the result compares with demographically representative human baselines.
Ten abilities instead of one headline number
DeepMind breaks intelligence into perception, generation, attention, learning, memory, reasoning, metacognition, executive functions, problem solving, and social cognition. In this taxonomy, problem solving and social cognition are not treated as isolated tricks; they are composite capacities that depend on several underlying abilities working together.
That matters because many current claims about AGI progress still lean on narrow benchmark wins or on the assumption that larger models imply broader intelligence. DeepMind’s framework explicitly pushes against that reading by measuring capability as a profile rather than a single score, which makes it easier to see where a system is strong, where it is brittle, and whether progress in one area is actually transferring to others.
The three-stage protocol is the real change
The core contribution is not just the taxonomy but the evaluation sequence attached to it. DeepMind proposes testing each ability independently first, then testing whether the system can coordinate multiple abilities, and only after that comparing the results with human performance.
This order closes an important gap in existing evaluations. A model can look impressive on isolated tasks and still fail when attention, memory, reasoning, and planning have to be used together under less controlled conditions; the integration stage is meant to catch that. The final human comparison then normalizes scores against a real cognitive distribution rather than an arbitrary benchmark ceiling.
| Stage | What gets measured | Why it matters |
|---|---|---|
| Independent ability testing | Perception, learning, memory, reasoning, metacognition and the other defined abilities in separate tasks | Shows specific strengths and deficits instead of hiding them inside an average score |
| Integration testing | How well multiple abilities work together in combined tasks | Separates narrow task competence from more general coordinated capability |
| Human baseline comparison | AI scores normalized against demographically representative adults | Creates a defensible reference point for claims about human-level performance |
Where the measurement problem is still unresolved
DeepMind is also acknowledging that some of the most important abilities are the least well measured. The weak spots include metacognition, attention, learning, and social cognition, and many existing tests are already public enough that they may have leaked into training data, which undermines their value as clean evaluations.
To fill that gap, the company launched a $200,000 Kaggle hackathon running from March 17 to April 16, 2026, asking researchers and developers to build new benchmarks for under-evaluated abilities. Results are scheduled for June 1, and submissions will be run through Kaggle’s Community Benchmarks platform against leading AI models, which makes this less of a conceptual paper exercise and more of a live test of whether these abilities can be measured at scale.
The next verified checkpoint is straightforward but demanding: do the new benchmarks actually distinguish systems on complex abilities such as metacognition and social cognition, and do better scores on those tests line up with better real-world performance? If the answer is weak or inconsistent, then the framework remains useful as a map of missing measurements, but not yet as a strong scoreboard for AGI progress.
Why a technology-agnostic framework matters for deployment and governance
DeepMind’s approach is technology-agnostic: it focuses on what a system can do, not whether it is a language model, a multimodal system, or some later architecture. That makes it more usable across a field where OpenAI, Microsoft, Meta, and Google are all pushing toward broader systems without a shared measurement standard, and it gives regulators and enterprise buyers a cleaner basis for comparing capability claims.
For deployment decisions, this changes the threshold question. Instead of asking whether a model topped another benchmark or whether it was trained with more compute, teams can ask whether it shows stable gains in learning, executive functions, or social cognition, and whether those gains survive integration tests. That is a more practical lens for deciding where a system can be trusted as an adaptive assistant, where it still needs tight workflow boundaries, and where polished outputs may be masking weak self-monitoring or poor coordination across tasks.
Metacognition is the hardest claim to fake and the hardest ability to verify
One reason this framework matters beyond research is its treatment of metacognition. Models can already produce language that sounds reflective or self-correcting, but sounding aware of one’s own reasoning is not the same as reliably monitoring limits, detecting uncertainty, and adjusting behavior under pressure.
If benchmark designers can measure that difference well, it would affect more than AGI debates. It would give developers and auditors a clearer way to separate systems that merely narrate confidence from systems that show usable self-monitoring, which is a much more relevant property for high-stakes deployment than another narrow benchmark gain.
