Gemini can think, AGI benchmarks measure progress, and ‘powerful AI systems are going to arrive in the next few years’
Welcome to today’s edition of “tech people say the darnedest things”
Google launched Gemini Pro Experimental 2.5, a “thinking model,” none of today’s generative AI models have scored higher than 4% against a human average of 60% on the new ARC-AGI-2 test. And Jack Clark is becoming increasingly convinced that AGI is right around the corner.
Good grief, it’s only Tuesday. Let’s break down the news.
Gemini can think?
Google launched Gemini Pro Experimental 2.5 today to pro subscribers and other users with access to “advanced” tier models.
According to Google, “Gemini 2.5 models are thinking models, capable of reasoning through their thoughts before responding, resulting in enhanced performance and improved accuracy.”
It also scores highly across a bunch of benchmarks (keep reading for more on AGI benchmarks) and has, “achieved a new level of performance by combining a significantly enhanced base model with improved post-training.”
Here at the Center, we use the Google work suite. It comes with email, Docs, Sheets, Slides, and access to the advanced Gemini models. We tested the new model to see what made the folks at Google think it was capable of thought.
Tl;dr: It’s almost certainly not.
There are two so-called “reasoning” models: 2.0 Flash Thinking (experimental) and 2.5 Pro (experimental). The big difference between these two and regular 2.0 Flash, as far as we can tell, is that they output the response to your prompt twice.
Gemini 2.0 Flash Thinking (experimental) responding with “thinking” window expanded
Gemini 2.5 Pro (experimental) responding with “thinking” window expanded
You can click on the blue up-arrow beside the text “show thinking” to collapse that window. The response to your prompt is displayed beneath it.
Gemini 2.0 Flash Thinking (experimental) with “show thinking” window collapsed.
Gemini 2.5 Pro (experimental) with “show thinking” window collapsed.
Gemini Flash 2.0
There doesn’t appear to be any significant difference between the answers given by the three models. They show some performance diversion when consistently given the same prompts with “Pro” seeming the most coherent.
But, after a few hours of testing, it feels like giving the same model the same question multiple times produces similarly erratic results as feeding the same prompt to three different models. We haven’t had time to test this at any meaningful scale.
At the end of the day, if the “show thinking” window is supposed to be an example of “thought,” this feels like a big old swing-and-a-miss.
We suspect there’s nothing special happening at all. There’s probably a hidden prompt somewhere in the loop that instructs the model to preface outputs with verbiage like “you are a thinking model, please be sure to answer all prompts with messaging that indicates you are reasoning through the steps to come up with your answer.”
To us, this is sort of like prompting a machine to “act like a human who is stuck inside of a black box” and then freaking out when it does.
There’s still no evidence to suggest that using a large-language model to generate text is an example of “thinking.” What we do see, however, is that the “thinking models” generate about twice as much text when answering questions and take about twice as long.
Word counts for responses to the above prompts:
2.0 Flash: 888 words
2.0 Flash Thinking (experimental): Thinking window - 901 Response - 1,301
2.5 Pro (experimental): Thinking window - 764 Response - 915
You can believe these models are “thinking” if you want, but we’re pretty sure they’re just doing the same trick twice in hopes that you'll be doubly-impressed.
AGI benchmark foundation says AGI benchmarks measure progress toward AGI
World-renowned computer scientist Francois Chollet’s ARC Prize, a foundation working to encourage friendly competition to advance the field of generative AI development, has finally begun sharing some of the stats surrounding its new AGI benchmark.
The good news is that none of today’s models are even close to human-level performance. The top scores from AI models reached 3% and 4% against a human average of 60%. The bad news is that nobody knows what that means.
The ARC-AGI-2 benchmark, as it’s called, purportedly addresses the problem of memorization in benchmarking. Essentially, when you design a test for AI you have to worry that it’s either already got the answers to the test in its training dataset (whether explicitly or implicitly) or that, after a few attempts, it will learn the specific answers to the test without learning to solve the problems presented by it.
This is sort of like a human using the process of elimination to determine which multiple choice answers are correct (A, B, C, or D) without ever actually reading the corresponding answers. You could feasibly ace the test given enough attempts, but your answers would have no meaning.
ARC-AGI-2 is designed to challenge AI with tasks that are easy for humans but hard for AI. Essentially, every task on the test was vetted against thousands of humans and at least two of them (per each task) were able to solve the challenge within two attempts.
According to Arc Prize, its definition for measuring AGI is “the gap between the set of tasks that are easy for humans and hard for AI. When this gap is zero, when there are no remaining tasks, we can find that challenge AI, we will have achieved AGI.”
While we agree with a lot of what Chollet and ARC Prize have to say, we’re not onboard with the idea that the existence of AGI can be confirmed by giving it word and math problems.
Here’s what we propose instead: Let’s use ARC-AGI-2 and similar benchmarking tools to gauge progress toward better AI models. This serves two purposes:
It helps developers who are stuck in their own labs to gauge how their techniques compare. This is a rising tide that leads to better models for everyone.
It helps us identify when a new class of models demonstrates a significant leap in performance.
But let’s not make the same mistake with AGI that big tech made with driverless cars. No matter how close we are to 100%, 99.9% isn’t good enough when it comes to describing “human level.”
And, if AGI isn’t a “human level” artificial intelligence, then what, exactly is it?
Learn: What is artificial general intelligence? — Center for AGI Investigations
Let’s be clear: We need benchmarking. And the ARC-AGI-2 test is a good example of a great benchmark. But it doesn't measure progress toward AGI. It measures progress away from AlexNet, GPT 1.0, etcetera.
As far as we can tell, there’s still no way to determine whether AGI — human-level AI — has definitively emerged using the scientific method. But, we do believe that claims made without the scientific method can be falsified with it.
Show me a chatbot that can drive a car at a human level and then we can start talking AGI.
This is no time for panic, unless it is
Jack Clark (we love Jack, by the way, he’s a treasure) is out there scaring the shit out of everybody. And maybe he’s right. If you don’t know Jack, he’s a technology journalist, philosopher, leader, and all-around curious person. He was one of the cofounders at Anthropic and has contributed to a significant amount of meaningful research in the field of AI development, steering, and alignment.
So, when you consider his background, it’s concerning to say the least when you see that he’s openly wondering whether it’s time to consider advocating for the kinds of policy changes that could have severe consequences if they end up being “knee jerk” in nature.
However, as Clark pointed out in Import AI 405, “under very short timelines, you may want to take more extreme actions. These are actions which are likely ‘regretful actions’ if your timeline bets are wrong.”
Specifically, he’s talking about whether or not he should consider recommending government policymakers pass legislation requiring technology companies to undergo mandatory “pre-deployment” testing and taking immediate steps toward “massively increasing the security of frontier labs.”
Clark’s concern is that there’s a greater than zero chance that the AGI maximalists’ timeline prediction of 2026/2027 for AGI’s arrival — during the current US president’s term — will come true. And, in a post on X, he expressed that he was “increasingly convinced” that this timeline was correct.
Meanwhile, Cecilia Kang of the New York Times writes that many of Clark’s contemporaries in the advanced AI sector are supposedly cozying up to the president’s office in hopes of getting the go-ahead to ignore state laws.
Per Kang:
“In recent weeks, Meta, Google, OpenAI and others have asked the Trump administration to block state A.I. laws and to declare that it is legal for them to use copyrighted material to train their A.I. models. They are also lobbying to use federal data to develop the technology, as well as for easier access to energy sources for their computing demands. And they have asked for tax breaks, grants and other incentives.”
If we take a gander at the numbers, The American Prospect figures OpenAI has to make about $100 billion in revenue to break even. If you don’t think too hard, it makes sense. At its last valuation it was the third most valuable startup in the world, up there with SpaceX.
But OpenAI doesn’t make revenue. It loses money. It’s projected to lose $5 billion this year and the Information recently exposed documents that indicate it's on track to lose $14 billion next year.
AGI can’t arrive soon enough for that company or its shareholders.
Art by Nicole Greene