Video thumbnail

Grok 4 is really smart... Like REALLY SMART

Matthew Berman10 de julho de 2025

Grok 4 has been released, and according to Elon Musk, it is currently the most intelligent AI model available, marking a significant advancement over previous frontier models. The progression of the Grok series illustrates a substantial leap, particularly with Grok 4, which heavily leverages reinforcement learning with verifiable rewards. This method involves training the model on problems with known solutions, rewarding correct answers, and iterating this process to enhance its problem-solving and "thinking" capabilities. This approach has proven so effective that XAI, the developer, began running out of pre-defined problems, leading to Musk's idea of using real-world interactions, such as those with humanoid robots, for unlimited verifiable rewards. Grok 4 has demonstrated superior performance across various benchmarks, including the "Humanity's Last Exam," GPQA, Amy 2025, Live Codebench, Math Arena, USA Mo, and ARC AGI, often outperforming its competitors by a substantial margin. A key feature of Grok 4 Heavy is its ability to spawn multiple agents that collaborate, share insights, and converge on the best solution, showcasing a new paradigm in AI problem-solving. While Grok 4 exhibits impressive capabilities in areas like sports prediction, scientific visualization, and real-time data analysis, and even game development, its real-world application in complex tasks like creating AAA video games remains a subject of ongoing development and debate regarding AI's current limitations in subjective areas like "taste." The model is available through a subscription, with Grok 4 Heavy being more expensive due to its advanced multi-agent capabilities. Future developments include a coding-specific model, a multimodal agent, and a video generation model.

Progression of Grok Models and Reinforcement Learning

The evolution of the Grok series of models showcases XAI's commitment to advancing artificial intelligence. The presenter highlighted a slide from a recent live stream, illustrating the significant strides made over a short period. Grok 2, released approximately two years prior, was primarily focused on next token prediction, a fundamental task in language modeling where the AI predicts the next word in a sequence. With Grok 3, XAI dramatically increased its pre-training compute by tenfold, resulting in a significantly more capable model. The subsequent iteration, Grok 3 Reasoning, incorporated additional reinforcement learning compute, a crucial step towards more sophisticated AI behavior.

The most substantial leap, however, came with Grok 4 Reasoning. This version places a massive emphasis on reinforcement learning (RL), particularly with verifiable rewards. This paradigm involves training the model on problems where the correct solution is already known. The model attempts to solve the problem, and if it provides the correct answer, it receives a positive reward. This process is repeated countless times across a vast array of complex problems, enabling the model to learn and improve its problem-solving abilities significantly. This method is what elicits "thinking behavior" from these models, allowing them to reason and strategize. The presenter emphasized that previously, there might have been a perceived "wall" with RLVW (Reinforcement Learning with Verifiable Rewards), but Grok 4 has clearly broken through it. In fact, RL with verifiable rewards became so integral to XAI's workflow that they began to run out of pre-defined problems with known solutions. This limitation led Elon Musk to speculate about reality being the ultimate test for AI. By placing models into the real world, often via humanoid robots or other physical interactors, they gain access to an essentially unlimited supply of verifiable rewards through direct interaction with physics and the environment.

Grok 4 Benchmarks and Performance

Grok 4's capabilities were put to the test across a series of rigorous benchmarks, consistently demonstrating superior performance compared to its contemporaries.

Humanity's Last Exam

This benchmark is designed to be exceptionally challenging, comprising frontier knowledge questions that typically only a team of experts in a specific domain could answer. The exam spans diverse fields such as mathematics, physics, biology, social science, computer science, engineering, and chemistry. Grok 4's performance was revealed progressively, showcasing its enhanced abilities with added functionalities.

No Tool Usage: Without any external tools, Grok 4 achieved a score of 26.9%, already significantly surpassing other frontier models like Gemini 2.5 Pro (21.6%), GPT-3 (20%), and GPT-4 Mini (18%).
With Tool Usage: When equipped with tools such as web browsing, advanced memory, and an environment for code execution, Grok 4's score surged to 41%, a massive improvement and double the score of the next best model.
Scaled Test Time Compute: By scaling up test time compute, Grok 4 reached an impressive 50.7%, breaking the 50% barrier and unequivocally outperforming all other models tested on this benchmark.

The increase in test time compute for Grok 4, specifically the "Heavy" version, signifies a novel approach. Instead of simply extending thinking time, Grok 4 Heavy spawns multiple agents, each independently attempting to solve the problem. These agents then collaborate, sharing notes and insights. If one agent discovers a working solution or a partial breakthrough, it shares this knowledge, benefiting the others. Finally, the best solution among all agents is chosen. This multi-agent collaboration mechanism is a cornerstone of Grok 4 Heavy's enhanced reasoning capabilities.

Other Noteworthy Benchmarks

GPQA: Grok 4 (no tool) scored 87%, and Grok 4 Heavy (with tools) scored 88.9%, slightly ahead of the next best model at 86%.
Amy 2025: Grok 4 Heavy achieved a perfect 100%, an "insane" achievement given these are some of the hardest math questions globally. GPT-3 also performed well with 98.4%.
Live Codebench: Grok 4 scored 79.4%, indicating strong coding abilities. For comparison, Gemini 2.5 Pro scored 74%.
Math Arena: Grok 4 scored 96.7%.
USA Mo (Math Olympiad Test): Grok 4 Heavy "demolished" other models, showcasing its exceptional mathematical reasoning.
ARC AGI: This test is designed to be easy for humans but difficult for AI, focusing on pattern recognition, skill learning from patterns, and applying them to new scenarios. Grok 4 crushed this test, with V1 at 66.6% (compared to GPT-3's 60.8%) and V2 at 15.9%, double the score of Opus 4, which was 8%.

The independent testing results for ARC AGI were particularly compelling. Greg Cameron, President of the ARC Prize, confirmed Grok 4 as the top-performing publicly available model, even surpassing purpose-built solutions on Kaggle. He noted that scores below 10% are considered noisy, and Grok 4's 15.9% "breaks through that noise barrier," demonstrating "nonzero levels of fluid intelligence" and true generalization.

Vending Bench

Recognizing the need for real-world evaluations beyond theoretical benchmarks, XAI introduced Vending Bench. This test places AI models in charge of managing a vending machine in a simulated real-world environment, providing them with a budget and inventory. Grok 4 significantly outperformed other models in this practical application, ending the test with a net worth of $4,700. By comparison, GPT-3 achieved $1,800, Gemini 2.5 Pro $789, a human operator $844, and Claude Opus 4 $2,000. This benchmark highlights Grok 4's impressive ability to interact and perform effectively in a real-world scenario.

Real-time Information and Creative Applications

Grok 4 capabilities extend beyond benchmarks, showcasing impressive real-time data processing and creative generation.

World Series Prediction

During a live demo, Grok 4 was given tools and compute to predict the winner of the World Series. The model browsed various odds sites and calculated its own probabilities, identifying potential "alpha" or edge. After about 4.5 minutes of computation, it predicted the Los Angeles Dodgers had a 21.6% chance of winning, demonstrating its ability to process vast amounts of real-time market data and derive actionable insights.

Black Hole Collision Visualization

The model was tasked with generating a visualization of two black holes colliding. While taking some artistic liberties for visual clarity, Grok 4 was explicit in its reasoning for these simplifications. For instance, to make the gravitational waves visible, it exaggerated their scale and amplitude. The visualization accurately depicted the core phases of a black hole collision: the inspiral, merger, and ringdown. It used post-Newtonian approximations instead of full general relativistic calculations near the black hole's center, a compromise it clearly stated. This demo showcased Grok 4's capacity for complex scientific visualization and its ability to communicate its methodological choices.

Timeline Generation

Grok 4 demonstrated its real-time data search prowess by creating a timeline of model scores released for "Humanity's Last Exam" based on X feeds. It extracted information about who announced scores, the dates, and even the public reactions at those times, including leaked benchmarks. This highlights Grok 4's utility in real-time information retrieval and summarization from dynamic sources.

AI in Game Development

The XAI team has frequently discussed AI's role in creating video games, with Elon Musk predicting the creation of AAA video games in the near future. To test this, Grok 4 was given to a game designer, Denny, for a few hours. Within this short timeframe, Denny was able to create a first-person shooter game. A significant aspect highlighted was Grok 4's ability to automate asset sourcing (textures, files, etc.), which is often a challenging and time-consuming part of game development. This allows developers to focus on core game logic, effectively enabling a "game studio of one" where Grok 4 handles the tedious, maintenance tasks.

Despite this impressive demonstration, the presenter expressed skepticism about Elon Musk's timeline for AAA AI-generated games.

I would expect the first really good AI video game to be next year. I don't really believe that. I think these games are fun, but they're definitely like one-off games. We're not going to see an Assassin's Creed. We're not going to see the next Halo being created by AI. Not yet. And certainly not by the end of next year.

This reservation stems from the challenge of AI understanding and curating aspects like "taste" in games—assessing whether a game is genuinely fun or engaging. The presenter argued that taste remains primarily within the realm of human expertise for the foreseeable future, suggesting that humans will likely remain in the loop for a considerable time, especially in creative and subjective domains.

Grok 4 Availability and Pricing

Grok 4 is currently available for public use and via API. It boasts a 256K context window, multimodal reasoning, real-time data search, and enterprise-grade security. However, it is not an inexpensive service.

Model Version	Monthly Price	Annual Price	Key Features
Super Grok	$30	N/A	Grok 4 access, standard features
Super Grok Heavy	$300	$3,000	All Super Grok features, Grok 4 Heavy (multi-agent), higher rate limits, early access to new features

For context, Super Grok is more expensive than subscriptions for ChatGPT and Claude. The "Heavy" version, with its multi-agent capabilities, represents a significant investment but offers enhanced problem-solving power and early access to cutting-edge features.

Future of Grok Models

Elon Musk revealed that Grok 4 is based on their foundational model version 6, with an in-progress training version 7 expected to be completed by the end of the month. Version 7 aims to further improve multimodal reasoning and understanding. Looking ahead, XAI has outlined an ambitious roadmap for new model releases:

August: A coding-specific model.
September: A multimodal agent.
October: A video generation model.

The presenter acknowledged these ambitious timelines but expressed excitement, indicating that these developments could significantly expand the capabilities and applications of the Grok series.

Takeaways

Reinforcement Learning with Verifiable Rewards: Grok 4's exceptional intelligence is largely attributed to its heavy reliance on reinforcement learning with verifiable rewards, enabling it to learn from known solutions and exhibit "thinking" behavior.
Multi-Agent Collaboration: Grok 4 Heavy introduces a groundbreaking multi-agent system where multiple AI agents work collaboratively, share knowledge, and converge on the optimal solution, significantly boosting its problem-solving abilities.
Dominant Benchmark Performance: Grok 4 consistently outperforms other frontier models across diverse benchmarks, including "Humanity's Last Exam," GPQA, Amy 2025, and ARC AGI, demonstrating superior reasoning, coding, and mathematical prowess.
Real-World Application Success: The Vending Bench test highlights Grok 4's strong performance in practical, real-world scenarios, effectively managing resources and demonstrating significant net worth increases compared to competitors.
Advanced Creative and Real-time Capabilities: Grok 4 exhibits impressive abilities in tasks like predicting World Series outcomes, generating scientific visualizations, and creating real-time data timelines, showcasing its potential in dynamic and creative applications.
Game Development Automation: Grok 4 can significantly accelerate game development by automating tedious tasks like asset sourcing, allowing human developers to focus on core game logic.
Subscription-Based Access: Grok 4 is available through a subscription model, with the more advanced, multi-agent Grok 4 Heavy commanding a significantly higher price point, reflecting its enhanced capabilities.

References

This article was AI generated. It may contain errors and should be verified with the original source.

ClarifyTube

© 2025 ClarifyTube. All rights reserved.