OpenAI logo, a white circle with rings, pictured against grey background” width=”970″ height=”647″ data-caption=’OpenAI’s o3 model emphasizes high level reasoning. <span class=”lazyload media-credit”>Photo by Dima Solomin on Unsplash</span>’>
Measuring the intelligence of artificial intelligence is, ironically, a pretty difficult task. That’s why the tech industry has come up with benchmarks like ARC-AGI, which tests the capabilities of the new technology through a series of visual tasks that are particularly challenging for A.I. models. In December, OpenAI’s o3 reasoning model became the first A.I. system to pass the test with an 87.5 percent score.
But that win didn’t come without a price. At the time, the Arc Prize Foundation—which administers the ARC-AGI benchmark—estimated that the cost of testing OpenAI’s model ran at about $3,400 per task. For a higher-efficiency version of o3 that scored 75.7 percent on the test, that figure totaled at $20 per task.
As it turns out, the actual costs could be significantly higher—ten times higher, to be exact. While the Arc Prize Foundation’s o3 pricing was originally drawn from the costs of OpenAI’s o1 model, the reasoning predecessor to o3, the nonprofit is now pricing it in line with OpenAI’s newly released 01-pro. Unveiled last month, o1-pro model is ten times more costly to run than 01, making it OpenAI’s most expensive model to date.
Based on the new o1-pro pricing, o3 could potentially cost upwards of $30,000 per task. The more efficient o3 strain has since been listed at costing $200 per task.
“Our belief, and this has not been validated by OpenAI, is that o3 pricing will be closer to o1-pro pricing than it will be to o1 pricing that we were told in December,” Greg Kamradt, president of the Arc Prize Foundation, told Observer. “Given that, we’ve updated our metrics.”
The Arc Prize Foundation has edited its ARC-AGI leadership board to exclude the more compute-intensive version of o3, noting that “only systems which required less than $10,000 to run are shown” on the board.
What is ARC-AGI?
Founded in 2019 by the researcher François Chollet, the ARC-AGI benchmark relies upon a series of puzzles that track how close A.I. systems are to human-level intelligence. As opposed to simply analyzing a model’s ability to draw from datasets, it examines whether or not it can adapt to new problems and learn new task-specific skills. “Think of it as a test that measures the ability to learn new things,” said Kamradt.
OpenAI’s o3 was particularly successful at the test because the model is able to pause and consider numerous potential prompts before responding with the most accurate answer. While o3’s pricing has yet to confirmed by OpenAI, the Arc Prize Foundation’s estimates will remain anchored closer to o1-pro costs until official pricing is released. “It may go even higher, but we’re not sure,” said Kamradt. “We’re just doing the best that we can with the available information that we have.” OpenAI did not respond to requests for comment from Observer.
Although recent A.I. releases have gotten closer to the 100 percent mark on ARC-AGI, they’ve largely been stumped by a newer version of the test released last month. Known as ARC-AGI-2, the test contains tasks that are even more difficult for A.I. systems and especially designed for those that specialize in reasoning. So far, no model has been able to reach the 5 percent mark.