What is Grok 4 and why does it matter?
Grok 4 is xAI's frontier large language model, announced by Elon Musk and the xAI team in a livestream on July 9, 2025. It is the fourth-generation Grok model and the first to break into the very top tier of LLM benchmarks β at launch it set state-of-the-art scores on Humanity's Last Exam, ARC-AGI-2, and the LMArena Text leaderboard.
The model ships in three flavors: standard Grok 4, the Grok 4 Heavy multi-agent variant, and a thinking mode that streams visible reasoning. xAI publicly stated grok-4 used roughly 100Γ the reinforcement-learning compute of Grok 2, trained on the Memphis Colossus supercomputer after expansion past 200,000 H100-class GPUs in early 2025 β one of the largest disclosed RL runs in the industry.
The headline numbers matter because they cluster on the hardest benchmarks the field has β HLE explicitly resists the kind of memorization that flatters older evaluations like MMLU, and ARC-AGI-2 is designed to be intractable for retrieval-style shortcuts. Grok 4 doubling Gemini 2.5 Pro's ARC-AGI-2 score isn't a marketing rounding error; it's a real signal that xAI's scaling and RL recipe closed the frontier gap.
How does Grok 4 score on real benchmarks?
Every score below comes from xAI's official Grok 4 announcement (Jul 9, 2025) and is cross-checked against independent reviewers at Simon Willison's Weblog and Artificial Analysis. Higher is better on every row except LMArena Elo where the maximum is the leaderboard ceiling at the time.
| Benchmark | Grok 4 | Grok 4 Heavy | GPT-5 | Claude Opus | Gemini 2.5 Pro |
|---|---|---|---|---|---|
| Humanity's Last Exam | 25.4% | 44.4% | 24.8% | 21.7% | 21.6% |
| ARC-AGI-2 | 16.2% | 16.2% | 9.9% | 8.6% | 4.9% |
| GPQA Diamond | 87.5% | 88.9% | 88.4% | 79.6% | 84% |
| AIME 2025 | 95% | 100% | 94.6% | 90% | 88% |
| USAMO 2025 | 37.5% | 61.9% | 25.5% | 21.7% | 49.4% |
| LMArena Elo | 1429 | 1429 | 1411 | 1421 | 1404 |
What can Grok 4 actually do?
Four headline capabilities, all confirmed in xAI's release materials and validated by independent reviewers.
Grok 4 ships a 256,000-token context window via the xAI API β 2Γ Grok 3's chat-tier limit, large enough for full codebases or 800-page documents in a single call.
source βA multi-agent variant where Grok 4 spawns 16 parallel reasoning agents that critique each other's work, then synthesize. Boosts AIME 2025 to 100% and HLE to 44.4% per xAI's release.
source βSub-second streaming voice with five voice presets, plus live access to public X posts as a search backend β uniquely useful for breaking-news prompts.
source βGrok 4 used roughly 100Γ the RL compute of Grok 2 on xAI's Memphis Colossus cluster, expanded to 200,000+ Nvidia H100-class GPUs by mid-2025.
source βGrok 4 vs GPT-5 vs Claude Opus vs Gemini 2.5 Pro
The frontier is crowded. Here are the four flagships side-by-side on the dimensions that decide a real-world choice.
| Feature | Grok 4 | GPT-5 | Claude Opus 4.1 | Gemini 2.5 Pro |
|---|---|---|---|---|
| Released | Jul 9, 2025 | Aug 7, 2025 | Sep 29, 2025 | Mar 25, 2025 |
| Context window | 256K | 400K | 200K | 1M |
| API in / out (per M) | $3 / $15 | $1.25 / $10 | $15 / $75 | $1.25 / $10 |
| HLE (text-only) | 25.4% | 24.8% | 21.7% | 21.6% |
| ARC-AGI-2 | 16.2% | 9.9% | 8.6% | 4.9% |
| Real-time web | X + Web | Web (paid) | Web (paid) | Google Search |
| Multi-agent variant | Heavy (16x) | β | β | Deep Think |
Practical rule four years into the post-GPT-4 frontier: Grok 4 for hardest-tier reasoning prompts and real-time X data; GPT-5 for cheaper bulk inference and the broadest tool ecosystem; Claude Opus 4.x for code and careful long-form prose; Gemini 2.5 Pro when raw context size or multimodal grounding wins. See deeper breakdowns at GPT-5, Gemini 2.5 Pro, Grok 3, and the forward-looking Claude Opus 5 watch-list.
What did reviewers say about Grok 4?
The independent reception was surprisingly positive β even from reviewers usually skeptical of xAI's benchmark claims.
βGrok 4 looks like a genuinely frontier-class model. The 25.4% on Humanity's Last Exam is the highest published number I've seen on that benchmark, and the price-to-quality ratio is competitive with anything OpenAI or Anthropic have shipped.β
βWe're at a moment where every six months the frontier moves. Grok 4's HLE and ARC-AGI-2 numbers are a real step β xAI has fully closed the gap with OpenAI and Anthropic on the benchmarks that matter.β
Andrej Karpathy's post-launch early Grok 4 vibe-check described the model as βat or beyond the state of the art on most prompts I triedβ, particularly strong on math and physics β consistent with xAI's published HLE and AIME numbers. The LMArena public leaderboard placed Grok 4 at #1 within two weeks of launch.
How can you try Grok 4 today?
Three paths. The fastest is ZeroTwo: sign in, open multi-model chat, pick Grok 4 from the model selector, and prompt. The free tier includes Grok 4; ZeroTwo Pro at $29.99/month unlocks unlimited Grok 4 plus 60+ other frontier models β less than X Premium+, with GPT-5, Claude, and Gemini bundled.
Second path: Grok 4 on consumer surfaces β included in X Premium+ ($22/month) and SuperGrok ($30/month) at grok.com. Grok 4 Heavy requires SuperGrok Heavy at $300/month.
Third: the xAI API at $3.00 / $15.00 per million tokens (in/out), or via OpenRouter for pay-as-you-go developers.
For broader context, see our deep dives on the full Grok family, using Grok free on ZeroTwo, and the ChatGPT-alternative landscape.
What to remember about Grok 4
- Grok 4 is xAI's frontier model β released July 9, 2025, the first Grok model to top Humanity's Last Exam, ARC-AGI-2, and the LMArena leaderboard.
- Grok 4 scores 25.4% on Humanity's Last Exam (text-only), 16.2% on ARC-AGI-2, 87.5% on GPQA Diamond, and 95% on AIME 2025.
- Grok 4 Heavy β a 16-agent multi-agent variant β pushes those to 44.4% HLE, 100% AIME 2025, and 61.9% USAMO 2025.
- API pricing is $3 / $15 per million input/output tokens; consumer access is included in X Premium+ ($22/mo) or SuperGrok ($30/mo).
- Trained on the 200K-H100 Colossus cluster with roughly 100Γ the RL compute of Grok 2.
- Run Grok 4 without X Premium+ β ZeroTwo bundles Grok 4 with GPT-5, Claude Opus, and Gemini 2.5 Pro, side-by-side, in one subscription.
Frequently asked about Grok 4
Nine direct answers, sourced from the official xAI Grok 4 announcement and independent benchmarks.
What is Grok 4?
Grok 4 is xAI's frontier large language model, released on July 9, 2025. It is the fourth-generation Grok model and the first to enter the very top tier of LLM benchmarks: 25.4% on Humanity's Last Exam (text-only), 87.5% on GPQA Diamond, and 16.2% on ARC-AGI-2 β outscoring GPT-5, Claude Opus 4.1, and Gemini 2.5 Pro on the hardest reasoning evaluations at launch. Grok 4 ships with a 256,000-token context window, native real-time access to public X posts, voice mode with five presets, and a multi-agent variant called Grok 4 Heavy that spawns 16 parallel reasoning agents.
How does Grok 4 compare to GPT-5?
Grok 4 narrowly beats GPT-5 on Humanity's Last Exam (25.4% vs 24.8%), GPQA Diamond is roughly tied (87.5% vs 88.4%), and Grok 4 wins ARC-AGI-2 by a wide margin (16.2% vs 9.9%). GPT-5 has a larger context window (400K vs 256K) and cheaper API pricing ($1.25 / $10 vs $3 / $15 per million tokens). LMArena Elo at the end of 2025 placed Grok 4 at 1429 vs GPT-5 at 1411, a small but consistent lead. Practical rule: Grok 4 for hardest-tier reasoning and real-time X data, GPT-5 for cheaper bulk inference and broader integrations.
What is Grok 4 Heavy?
Grok 4 Heavy is xAI's multi-agent reasoning system. Instead of a single Grok 4 instance, it spawns 16 parallel agents that each work on the same problem, share notes, critique each other's reasoning, and then synthesize a single answer. xAI reported that Grok 4 Heavy hits 44.4% on Humanity's Last Exam, 100% on AIME 2025, and 61.9% on USAMO 2025 β substantially above the single-agent Grok 4. Heavy is gated behind the SuperGrok Heavy plan at $300/month.
What is the Grok 4 context window?
Grok 4 ships a 256,000-token context window via the xAI API β large enough for an entire mid-size codebase, an 800-page PDF, or several hours of meeting transcripts in a single prompt. The consumer chat surface on grok.com and X exposes a smaller window (around 128K tokens) consistent with other chat-tier deployments. Later 4-series models like Grok 4.1 Fast extend that to 2,000,000 tokens.
How much does Grok 4 cost?
API pricing is $3.00 per million input tokens and $15.00 per million output tokens β identical to Grok 3 and Claude Opus 4 at launch, more expensive than GPT-5's $1.25 / $10 standard tier. Consumer access starts at X Premium+ ($22/month) which includes Grok 4. SuperGrok ($30/month) and SuperGrok Heavy ($300/month) unlock higher rate limits and the 16-agent Grok 4 Heavy mode. On ZeroTwo, Grok 4 is included on the free tier and unlimited on Pro at $29.99/month β bundled with GPT-5, Claude Opus, Gemini 2.5 Pro, and 60+ other models.
Where was Grok 4 trained?
Grok 4 was trained on xAI's Memphis, Tennessee supercomputer Colossus. Colossus came online in September 2024 with 100,000 Nvidia H100 GPUs and was expanded past 200,000 H100-class GPUs by early 2025, according to Reuters reporting. xAI publicly stated Grok 4 used roughly 100Γ the reinforcement-learning compute of Grok 2, making it one of the largest disclosed RL training runs in industry history.
What benchmarks does Grok 4 win?
At launch on July 9, 2025, Grok 4 set state-of-the-art scores on Humanity's Last Exam (25.4% text-only / 41.0% with tools), ARC-AGI-2 (16.2%, more than double the next-best frontier model), and tied or beat GPT-5 on GPQA Diamond and AIME 2025. The Grok 4 Heavy variant pushes those numbers further: 44.4% HLE, 100% AIME 2025, 61.9% USAMO 2025. Independent verification by Artificial Analysis confirmed Grok 4 took the top spot on their Intelligence Index in July 2025.
Can I use Grok 4 without X Premium+?
Yes β three options. First, the xAI API at docs.x.ai is pay-as-you-go and does not require any X account. Second, OpenRouter resells Grok 4 with the same pricing. Third, ZeroTwo includes Grok 4 on the free tier and unlimited on Pro at $29.99/month β alongside GPT-5, Claude Opus, Gemini 2.5 Pro, and 60+ other frontier models. ZeroTwo also lets you run Grok 4 side-by-side against any other frontier model on the same prompt.
Is Grok 4 good at coding?
Grok 4 is competitive but not category-leading on coding. xAI reported strong SWE-bench Verified scores at launch, but Anthropic's Claude Opus 4.6 and OpenAI's GPT-5 still dominate the real-world coding-agent benchmarks (Aider Polyglot, Cursor evals, SWE-bench Verified at the top of the leaderboard). For day-to-day code work, Claude is the default. Grok 4 shines on hardest-tier reasoning prompts (HLE, ARC-AGI-2) and tasks that benefit from real-time X data.
The ZeroTwo editorial team tracks every frontier model release, runs benchmark comparisons across 60+ models, and updates these pages with primary-source numbers β never marketing copy.
Published Β· Updated
Run Grok 4 without the lock-in.
Grok 4, GPT-5, Claude Opus 4.6, and Gemini 2.5 Pro β side by side, same prompt, one subscription. No X Premium+ required.
Try Grok 4 on ZeroTwo free