
OpenAI’s o3 shows remarkable progress on ARC-AGI, sparking debate on AI reasoning
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. learn more
OpenAI latest news o3 model A breakthrough has been made that surprised the artificial intelligence research community. o3 achieved an unprecedented score of 75.7% in the ultra-difficult ARC-AGI benchmark test under standard computing conditions, and the high-computing version reached 87.5%.
While ARC-AGI’s achievement is impressive, it does not yet prove that code general artificial intelligence (AGI) has been cracked.
abstract reasoning corpus
The ARC-AGI benchmark is based on abstract reasoning corpuswhich tests the ability of artificial intelligence systems to adapt to new tasks and demonstrate fluid intelligence. ARC consists of a set of visual puzzles that require understanding basic concepts such as objects, boundaries, and spatial relationships. While humans can easily solve the ARC puzzle with very few demonstrations, current artificial intelligence systems struggle with it. ARC has long been considered one of the most challenging measures of artificial intelligence.
ARC is designed in such a way that it cannot be fooled by training the model on millions of examples in an attempt to cover all possible puzzle combinations.
This benchmark consists of a public training set of 400 simple examples. The training set is supplemented by a public evaluation set, which contains 400 puzzles that are more challenging as a means of assessing generality. artificial intelligence system. The ARC-AGI Challenge consists of private and semi-private test sets of 100 puzzles each, which are not shared with the public. They are used to evaluate candidate artificial intelligence systems without the risk of leaking data to the public and contaminating future systems with prior knowledge. In addition, the competition limits the amount of computing that participants can use to ensure that the puzzles are not solved through brute force methods.
Breakthroughs in solving new tasks
o1-preview and o1 have the highest score of 32% on ARC-AGI. Another method developed by researchers Jeremy Berman Using a hybrid approach, combining Claude 3.5 Sonnet with a genetic algorithm and a code interpreter, it achieved 53%, the highest score before o3.
in a Blog articleFrançois Chollet, the creator of ARC, described o3’s performance as “a surprising and important step function enhancement in AI capabilities, demonstrating novel task adaptability never seen in the GPT family of models.”
It is worth noting that these results could not be achieved using more calculations on previous generations of models. For context, it took the model 4 years to grow from 0% on GPT-3 in 2020 to 5% on GPT-4o in early 2024. Its predecessor was orders of magnitude larger.
“This is not just an incremental improvement, but a real breakthrough, marking a qualitative shift in AI capabilities compared to the LL.M.’s previous limitations,” Chollet wrote. “o3 is a platform that can adapt to what has previously been limited. For systems that have never encountered tasks before, it can be said that the performance in the ARC-AGI field is close to human level.”
It is worth noting that o3’s performance on ARC-AGI comes at a high cost. On the low-compute configuration, the model spent $17 to $20 and 33 million tokens per problem, while on the high-compute budget, the model used approximately 172 times more computation and billions of tokens per problem. However, as the cost of inference continues to decrease, we can expect these numbers to become more reasonable.
A new paradigm for LLM reasoning?
The key to solving new problems is what Chollet and other scientists call “program synthesis.” The thinking system should be able to develop small programs to solve very specific problems and then combine these programs to solve more complex problems. Classic language models absorb a large amount of knowledge and contain rich internal programs. But they lack compositionality, which makes them unable to solve difficult problems beyond their training distribution.
Unfortunately, there is little information about how o3 works behind the scenes, and scientists are divided in their opinions. Chollet speculates that o3 uses a program synthesis that uses chain of ideas (CoT) reasoning and search mechanisms are combined with the reward model to evaluate and refine solutions as the model generates tokens. This is similar to Open source inference model Been exploring for months.
Other scientists such as Nathan Lambert Experts from the Allen Institute for Artificial Intelligence suggested that “o1 and o3 can actually be just forward propagation of a language model.” On the day o3 was announced, OpenAI researcher Nat McAleese said: Posted on X o1 “Just a Master of Laws with RL training. o3 surpasses o1 by further expanding RL.”
On the same day, Denny Zhou of Google’s DeepMind inference team called the combination of search and current reinforcement learning methods a “dead end.”
“The beauty of LLM inference is that the thinking process is generated in an autoregressive manner, rather than relying on searches in the generated space (such as mcts), whether through carefully tuned models or carefully designed prompts,” he Posted on X.
While the details of o3 reasoning may seem trivial compared to the ARC-AGI breakthrough, it could well define the next paradigm shift in LL.M. training. There is debate over whether the law of extending the LL.M. through training materials and calculations has hit a brick wall. Whether test time expansion depends on better training material or a different inference architecture can determine the next way forward.
Not general artificial intelligence
The name ARC-AGI is misleading, and some people equate it with solving the AGI problem. However, Chollet emphasized, “ARC-AGI is not a acid test for AGI.”
“Through ARC-AGI is not equivalent to achieving AGI. In fact, I think o3 is not AGI yet,” he wrote. “The fact that o3 still fails on some very simple tasks suggests a fundamental difference from human intelligence.”
Furthermore, he noted that o3 cannot learn these skills autonomously, relying on external verifiers during inference and human-labeled inference chains during training.
Other scientists pointed out flaws in OpenAI’s reported results. For example, the model is fine-tuned on the ARC training set to achieve state-of-the-art results. “The solver does not require much specific ‘training’, either in the domain itself or on each specific task,” the scientists wrote Melanie Mitchell.
To verify whether these models have the abstraction and reasoning capabilities that the ARC benchmark is designed to measure, Mitchell proposed “seeing whether these systems can adapt to variations of a specific task or reasoning tasks that use the same concepts, but in other domains outside of ARC.”
Chollet and his team are currently working on a new benchmark that is challenging for o3, making it possible to drop its score below 30% even with a high computing budget. At the same time, humans can solve 95% of the puzzles without any training.
“You’ll know AGI has arrived when it becomes simply impossible to create tasks that are easy for humans but hard for artificial intelligence,” Chollet wrote.
2024-12-24 19:40:51