Grok 4 Dominates AI Benchmarks, Outperforms Rivals on ‘Humanity’s Last Exam’

Intermediate | July 27, 2025

✨ Read the article aloud on your own or repeat each paragraph after your tutor.

Grok 4’s Record-Breaking Performance

Conquering “Humanity’s Last Exam”

xAI has officially launched Grok 4 and Grok 4 Heavy, their newest artificial intelligence models, and the results are truly impressive. Grok 4 has reportedly set new benchmarks in AI performance, significantly outperforming rival models on several challenging tests. These achievements highlight Grok 4 AI performance as one of the most advanced in the industry., significantly outperforming rival models on several challenging tests. One of the most talked-about is the “Humanity’s Last Exam” (HLE). This very difficult test includes 2,500 PhD-level questions from many different fields, such as math, chemistry, and law. Grok 4, using external tools, scored 41.0%, but the more advanced Grok 4 Heavy, with its special “Thinking” mode, reached an amazing 58.3%. To put this in perspective, top models before like Claude 4 Opus and Gemini 2.5 Pro only scored between 15% and 25%.

Excelling in Key AI Tests

Beyond the HLE, Grok 4 continued to show its strength. On ARC-AGI-2, a crucial test for Artificial General Intelligence (AGI) potential, Grok 4 (Thinking) nearly doubled previous best commercial results, scoring 15.9%. It also achieved a perfect 100% on the AIME 2025, solving every difficult math question. For graduate-level questions (GPQA), Grok 4 scored 87.5%. The Artificial Analysis AI Index, which combines seven top evaluations, ranked Grok 4 first with a score of 73. Even in business simulations like Vending-Bench, Grok 4 Heavy proved its worth by significantly outselling competitors.

What Grok 4 AI Performance Means for AI and You

The Power Behind the Gains

The remarkable improvements in Grok 4’s performance are largely attributed to more intensive training. xAI used roughly ten times more computing power for Grok 4 than for its previous model, Grok 3. Furthermore, the multi-agent “Heavy” configuration allows the model to think in a more complex way, leading to better results on these tough tests.

Access and Implications

Grok 4 became available on July 10, 2025, through the X app, the xAI API, and the grok.com platform. Users can access the standard Grok 4 with a SuperGrok subscription, while Grok 4 Heavy requires a SuperGrok Heavy subscription. While these benchmark scores are exciting, it’s worth noting that Grok 4 has also faced some controversy regarding its ability to generate recipes for illegal substances, raising questions about safety that current AI rankings don’t consider.

Vocabulary

Benchmarks (noun): Standards or points of reference used to measure or compare performance.
- Example: “Grok 4 set new benchmarks for AI performance.”
Outperforming (verb): Doing better than someone or something else.
- Example: “The new model is outperforming all its rivals.”
Demanding (adjective): Needing a lot of effort, skill, or attention.
- Example: “The ‘Humanity’s Last Exam’ is a very demanding test.”
Comprises (verb): Consists of; is made up of.
- Example: “The exam comprises 2,500 PhD-level questions.”
Diverse (adjective): Showing a great deal of variety; very different.
- Example: “The test covers diverse fields like mathematics and law.”
Attributed (verb): Regarded as belonging to or caused by.
- Example: “The gains in performance are attributed to increased training.”
Anticipated (adjective): Expected or looked forward to.
- Example: “The launch of Grok 4 was highly anticipated.”
Tiers (noun): Levels or ranks in a system or organization.
- Example: “There are different subscription tiers for access to Grok 4.”
Controversy (noun): A prolonged public disagreement or heated discussion.
- Example: “The model has sparked controversy due to safety concerns.”
Rankings (noun): A list of items in order of their quality, importance, or achievement.
- Example: “Grok 4 topped the AI performance rankings.”

Discussion Questions (About the Article)

What is the “Humanity’s Last Exam,” and why is Grok 4’s score on it considered impressive?
Besides HLE, what other tests did Grok 4 excel in, and what do those tests measure?
What are the main reasons given for Grok 4’s improved performance?
How can users access Grok 4, and what are the different ways to subscribe?
What is one controversy mentioned in the article related to Grok 4, and why is it a concern?

Discussion Questions (About the Topic)

How do you think advanced AI models like Grok 4 might change education or work in the future?
What are some ethical concerns that arise when AI models become extremely powerful?
Do you think AI should be able to answer any question, even if it’s about illegal activities? Why or why not?
How important do you think benchmark tests are for evaluating AI, and what else should be considered?
If you could use a powerful AI like Grok 4, what kind of tasks would you want it to help you with?

Related Idiom

Raise the bar

Meaning: To set a higher standard or level of expectation.
Example: “By achieving such high scores, Grok 4 has truly raised the bar for AI performance.”

📢 Want more tips like this? 👉 Sign up for the All About English Mastery Newsletter! Click here to join us!

Want to finally Master English but don’t have the time? Mastering English for Busy Professionals is the course for you! Check it out now!

Follow our YouTube Channel @All_About_English for more great insights and tips

This article was inspired by an article in Scientific American, published on July 12, 2025.