The path to achieving artificial general intelligence (AGI), AI systems with capabilities at least equivalent to humans in most tasks, remains a subject of debate among scientists. Opinions range from AGI being a long way off, possibly emerging within a decade, to “sparks of AGI” already visible in current large language models (LLMs). Some researchers even argue that today’s LLMs are AGI.

In an effort to bring clarity to the discussion, a team of scientists at Google DeepMind, including Chief AGI Scientist Shane Legg, have proposed a new framework for classifying the capabilities and behavior of AGI systems and their predecessors.

The authors write in their paper, “We argue that it is important for the AI ​​research community to reflect explicitly on what we mean by ‘AGI’ and aspire to measure characteristics such as performance, extensibility, and autonomy of AI systems.” is important.”

One of the major challenges of AGI is to establish a clear definition of what AGI includes. In their paper, DeepMind researchers analyze nine different AGI definitions, including the Turing test, coffee test, consciousness measures, economic measures, and work-related capabilities. They highlight the shortcomings of each definition in understanding the essence of AGI.

For example, current LLMs can pass the Turing test, but generating solid text alone is clearly inadequate for AGI, as shown by the shortcomings of current language models. Determining whether machines have the properties of consciousness remains a vague and elusive goal. Furthermore, while failing certain tasks (e.g. making coffee in a random kitchen) may indicate that a system is not AGI, passing them does not necessarily confirm its AGI status.

To provide a more comprehensive framework for AGI, researchers have proposed six criteria for measuring artificial intelligence:

AGI measures should focus on capabilities rather than human-like properties such as understanding, consciousness or emotion. AGI measures should consider both comprehensiveness and performance levels. This ensures that AGI systems are not only capable of performing a wide variety of tasks but also excel in their execution. AGI should require cognitive and meta-cognitive functions, but embodiment and physical functions should not be considered prerequisites for AGI. The capability of a system is sufficient to perform AGI-level tasks, even if it is not deployable. “Requiring deployment as a condition for measuring AGI introduces non-technical hurdles, such as legal and social considerations, as well as potential ethical and security concerns,” the researchers write. AGI metrics should focus on real-world functions that people value, which researchers describe as “ecologically valid.” Finally, scientists emphasize that AGI is not a single endpoint, but rather a path, with different levels of AGI along the way.

depth and breadth of intelligence

DeepMind presents a metric that measures “performance” and “generality” at five levels, ranging from no AI at all, to superhuman AGI, a general AI system that outperforms all humans across all tasks. Performance refers to how the capabilities of an AI system compare to those of humans, while comprehensiveness refers to the breadth of an AI system’s capabilities or the range of tasks for which it reaches the performance level specified in the matrix.

image Source: arXiv

The matrix also differentiates between narrow and general AI. For example, we already have superhuman narrow AI systems like AlphaZero and AlphaFold, which excel at very specific tasks. This matrix enables classification of AI systems at different levels. Advanced language models like ChatGPT, Bard, and Llama 2 are “competent” (level 2) in some narrow tasks, such as short essay writing and simple coding, and “emerging” (level 1) in others such as mathematical abilities and tasks, such as reasoning and planning. Needed.

“Overall, current frontier language models will be considered Level 1 Gen AI (‘Emerging AGI’) until performance levels increase for a broader set of tasks (at which point Level 2 Gen AI, ‘Emerging AGI, ‘ criteria will be met),” the researchers write.

The researchers also note that while AGI metrics rate systems according to their performance, they may not match their levels in practice when the systems are deployed. For example, text-to-image systems produce higher quality images than most people, but they produce false artifacts that prevent them from achieving the “virtuous” level that puts them in the 99th percentile of skilled individuals. Let’s put it.

“Although theoretically an ‘expert’ level system, in practice the system may only be ‘competent’, as the prompting interfaces are too complex for most end-users to achieve optimal performance,” the researchers write.

DeepMind suggests that AGI benchmarks will encompass a broad suite of cognitive and metacognitive tasks, measuring diverse traits including linguistic intelligence, mathematical and logical reasoning, spatial reasoning, interpersonal and intrapersonal social intelligence, the ability to learn new skills, and creativity .

However, they also acknowledge that it is impossible to enumerate all the tasks that can be achieved by sufficient general intelligence. “Thus, the AGI benchmark should be a living benchmark. “Such benchmarks should therefore include a framework for designing and agreeing on new actions,” they write.

autonomy and risk

DeepMind offers a separate metric to measure autonomy and risk in AI systems. AI systems start at Level 0, where a human performs all tasks, up to Level 5, which represents fully autonomous AI, with various levels in between where humans and AI share tasks and authority.

Image source: arXiv

The risks associated with AI systems vary depending on their level of autonomy. At lower levels, where AI acts as an enhancer of human skills, risks include deskilling and disruption of existing industries. As autonomy increases, risks may include targeted manipulation via personalized content, widespread societal disruption, and more serious harm caused by misalignment of fully autonomous agents with human values.

The DeepMind framework, like all things AGI related, will have its shortcomings and critics. But it stands as a comprehensive guide to assessing where we stand on the journey to developing AI systems capable of surpassing human capabilities.

