Debates over AI benchmarks — and the way they’re reported by means of AI labs — are spilling out into public view.
This week, an OpenAI worker accused Elon Musk’s AI corporate, xAI, of publishing deceptive benchmark effects for its newest AI fashion, Grok 3. One of the most co-founders of xAI, Igor Babushkin, insisted that the corporate used to be in the fitting.
The reality lies someplace in between.
In a put up on xAI’s weblog, the corporate printed a graph appearing Grok 3’s efficiency on AIME 2025, a choice of difficult math questions from a up to date invitational arithmetic examination. Some mavens have wondered AIME’s validity as an AI benchmark. Nonetheless, AIME 2025 and older variations of the check are often used to probe a fashion’s math skill.
xAI’s graph confirmed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAI’s best-performing to be had fashion, o3-mini-high, on AIME 2025. However OpenAI staff on X had been fast to indicate that xAI’s graph didn’t come with o3-mini-high’s AIME 2025 rating at “cons@64.”
What’s cons@64, you could ask? Smartly, it’s brief for “consensus@64,” and it mainly provides a fashion 64 tries to reply to every drawback in a benchmark and takes the solutions generated maximum regularly as the overall solutions. As you’ll believe, cons@64 has a tendency to spice up fashions’ benchmark rankings reasonably a bit of, and omitting it from a graph may make it seem as regardless that one fashion surpasses every other when in truth, that’s isn’t the case.
Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s rankings for AIME 2025 at “@1” — that means the primary rating the fashions were given at the benchmark — fall beneath o3-mini-high’s rating. Grok 3 Reasoning Beta additionally trails ever-so-slightly in the back of OpenAI’s o1 fashion set to “medium” computing. But xAI is promoting Grok 3 because the “global’s smartest AI.”
Babushkin argued on X that OpenAI has printed in a similar way deceptive benchmark charts prior to now — albeit charts evaluating the efficiency of its personal fashions. A extra impartial birthday party within the debate put in combination a extra “correct” graph appearing just about each fashion’s efficiency at cons@64:
Hilarious how some folks see my plot as assault on OpenAI and others as assault on Grok whilst in truth it’s DeepSeek propaganda
(I in truth imagine Grok appears just right there, and openAI’s TTC chicanery in the back of o3-mini-*excessive*-pass@”””1″”” merits extra scrutiny.) %.twitter.com/3WH8FOUfic
— Teortaxes (DeepSeek 推特
铁粉 2023 – ∞) (@teortaxesTex) February 20, 2025
However as AI researcher Nathan Lambert identified in a put up, possibly crucial metric stays a thriller: the computational (and fiscal) value it took for every fashion to succeed in its excellent rating. That simply is going to turn how little maximum AI benchmarks be in contact about fashions’ barriers — and their strengths.