• Home
  • About
  • Privacy Policy
  • Disclaimer
  • Contact
Fast News Way
  • Home
  • USA News
  • Health
  • Technology
    • Automobiles
  • UK News
  • Australia News
  • Sports
  • Fashion
  • Entertainment
No Result
View All Result
  • Home
  • USA News
  • Health
  • Technology
    • Automobiles
  • UK News
  • Australia News
  • Sports
  • Fashion
  • Entertainment
No Result
View All Result
Fast News Way
No Result
View All Result
Home Technology

Did xAI lie about Grok 3’s benchmarks?

admin by admin
February 22, 2025
in Technology
0
Did xAI lie about Grok 3’s benchmarks?
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


Debates over AI benchmarks — and the way they’re reported by AI labs — are spilling out into public view.

This week, an OpenAI worker accused Elon Musk’s AI firm, xAI, of publishing deceptive benchmark outcomes for its newest AI mannequin, Grok 3. One of many co-founders of xAI, Igor Babushkin, insisted that the corporate was in the correct.

The reality lies someplace in between.

In a publish on xAI’s weblog, the corporate printed a graph displaying Grok 3’s efficiency on AIME 2025, a group of difficult math questions from a current invitational arithmetic examination. Some specialists have questioned AIME’s validity as an AI benchmark. Nonetheless, AIME 2025 and older variations of the take a look at are generally used to probe a mannequin’s math means.

xAI’s graph confirmed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAI’s best-performing out there mannequin, o3-mini-high, on AIME 2025. However OpenAI staff on X have been fast to level out that xAI’s graph didn’t embody o3-mini-high’s AIME 2025 rating at “cons@64.”

What’s cons@64, you would possibly ask? Effectively, it’s brief for “consensus@64,” and it principally offers a mannequin 64 tries to reply every drawback in a benchmark and takes the solutions generated most ceaselessly as the ultimate solutions. As you possibly can think about, cons@64 tends to spice up fashions’ benchmark scores fairly a bit, and omitting it from a graph would possibly make it seem as if one mannequin surpasses one other when in actuality, that’s isn’t the case.

Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s scores for AIME 2025 at “@1” — that means the primary rating the fashions bought on the benchmark — fall beneath o3-mini-high’s rating. Grok 3 Reasoning Beta additionally trails ever-so-slightly behind OpenAI’s o1 mannequin set to “medium” computing. But xAI is promoting Grok 3 because the “world’s smartest AI.”

Babushkin argued on X that OpenAI has printed equally deceptive benchmark charts previously — albeit charts evaluating the efficiency of its personal fashions. A extra impartial celebration within the debate put collectively a extra “correct” graph displaying practically each mannequin’s efficiency at cons@64:

Hilarious how some folks see my plot as assault on OpenAI and others as assault on Grok whereas in actuality it’s DeepSeek propaganda
(I really imagine Grok seems good there, and openAI’s TTC chicanery behind o3-mini-*excessive*-pass@”””1″”” deserves extra scrutiny.) https://t.co/dJqlJpcJh8 pic.twitter.com/3WH8FOUfic

— Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex) February 20, 2025

However as AI researcher Nathan Lambert identified in a publish, maybe a very powerful metric stays a thriller: the computational (and financial) value it took for every mannequin to attain its greatest rating. That simply goes to indicate how little most AI benchmarks talk about fashions’ limitations — and their strengths.



Tags: benchmarksGrokliexAI
Previous Post

Bridgerton Season 4 Is All About ‘Forbidden Love’, Teases Showrunner

Next Post

Kia PV5 Electrical Van Unvield In A number of Configuration

admin

admin

Related Posts

Uzbek fintech and e-commerce firm Uzum raised $131.5M led by Oman’s sovereign funds, with $81.5M fairness, at a $2.3B valuation, up from $1.5B in August 2025 (Jagmeet Singh/TechCrunch)
Technology

Uzbek fintech and e-commerce firm Uzum raised $131.5M led by Oman’s sovereign funds, with $81.5M fairness, at a $2.3B valuation, up from $1.5B in August 2025 (Jagmeet Singh/TechCrunch)

by admin
March 10, 2026
5 Hidden YouTube Premium Options You Ought to Be Utilizing
Technology

5 Hidden YouTube Premium Options You Ought to Be Utilizing

by admin
March 9, 2026
T20 Cricket World Cup 2026 Closing Livestream: The best way to Watch India vs. New Zealand From Wherever for Free
Technology

T20 Cricket World Cup 2026 Closing Livestream: The best way to Watch India vs. New Zealand From Wherever for Free

by admin
March 8, 2026
Tech Life – Quantum computer systems are coming – do we want moral pointers?
Technology

Tech Life – Quantum computer systems are coming – do we want moral pointers?

by admin
March 7, 2026
This Jammer Desires to Block All the time-Listening AI Wearables. It Most likely Gained’t Work
Technology

This Jammer Desires to Block All the time-Listening AI Wearables. It Most likely Gained’t Work

by admin
March 7, 2026
Next Post
Kia PV5 Electrical Van Unvield In A number of Configuration

Kia PV5 Electrical Van Unvield In A number of Configuration

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Premium Content

Police arrest 14 over viral ‘honour killing’ video in Pakistan | World Information

Police arrest 14 over viral ‘honour killing’ video in Pakistan | World Information

July 22, 2025
Small Packages, Huge Issues: Tokyo Auto Salon 2025’s Prime Kei Automobile Creations

Small Packages, Huge Issues: Tokyo Auto Salon 2025’s Prime Kei Automobile Creations

January 19, 2025
UK nationwide going through deportation for alleged show of Nazi symbols, Burke confirms

UK nationwide going through deportation for alleged show of Nazi symbols, Burke confirms

December 24, 2025

Category

  • Australia News
  • Automobiles
  • Entertainment
  • Fashion
  • Health
  • Sports
  • Technology
  • UK News
  • Uncategorized
  • USA News

About Us

At Fast News Way, we are committed to delivering breaking news, trending stories, and in-depth analysis across a wide range of topics. Whether you’re passionate about Australia, USA, or UK news, a sports enthusiast, a fashion aficionado, a tech lover, or someone seeking health and automobile updates, we’ve got you covered.

Categories

  • Australia News
  • Automobiles
  • Entertainment
  • Fashion
  • Health
  • Sports
  • Technology
  • UK News
  • Uncategorized
  • USA News

Recent Posts

  • Chanel Spring 2026 Is Right here—Store The Viral Should-Haves
  • Pressing fundraiser for Gold Coast toddler with uncommon most cancers
  • Mercedes VLE is VW ID Buzz rival with 435-mile vary

© 2024 fastnewsway.com. All rights reserved.

No Result
View All Result
  • Home
  • USA News
  • Health
  • Technology
    • Automobiles
  • UK News
  • Australia News
  • Sports
  • Fashion
  • Entertainment

© 2024 fastnewsway.com. All rights reserved.