xAI Β· frontier Β· released Jul 9, 2025

Grok 4.xAI's frontier model β€” Humanity's Last Exam 25.4%, ARC-AGI-2 16.2%.

Grok 4 is xAI's frontier large language model, released July 9, 2025. Grok-4 tops Humanity's Last Exam, ARC-AGI-2, and LMArena, ships a 256K-token context, and unlocks a 16-agent Heavy mode that hits 100% on AIME 2025.

TL;DR. Grok 4 is xAI's flagship as of mid-2025: 25.4% on HLE, 16.2% on ARC-AGI-2, 87.5% on GPQA Diamond, 256K context, voice mode, real-time X access, and a multi-agent Heavy variant. Run Grok 4 on ZeroTwo's multi-model chat alongside GPT-5 and Claude β€” no X Premium+ required.
No credit cardΒ·Pub Β· Upd
Constellation Β· launch metrics
GROK-4 Β· v1
Released
Jul 9, 2025
Context
256,000 tok
HLE (text)
25.4%
ARC-AGI-2
16.2%
API in / out
$3 / $15 per M
LMArena Elo
1429 (#1)
HLE leaderboardGrok 4 Heavy = 44.4%
Grok 4 Heavy
44.4%
Grok 4
25.4%
GPT-5
24.8%
Claude Opus
21.7%
Gemini 2.5 Pro
21.6%

Source: xAI Grok 4 announcement Β· verified by Artificial Analysis.

01 Β· field

What is Grok 4 and why does it matter?

Grok 4 is xAI's frontier large language model, announced by Elon Musk and the xAI team in a livestream on July 9, 2025. It is the fourth-generation Grok model and the first to break into the very top tier of LLM benchmarks β€” at launch it set state-of-the-art scores on Humanity's Last Exam, ARC-AGI-2, and the LMArena Text leaderboard.

The model ships in three flavors: standard Grok 4, the Grok 4 Heavy multi-agent variant, and a thinking mode that streams visible reasoning. xAI publicly stated grok-4 used roughly 100Γ— the reinforcement-learning compute of Grok 2, trained on the Memphis Colossus supercomputer after expansion past 200,000 H100-class GPUs in early 2025 β€” one of the largest disclosed RL runs in the industry.

The headline numbers matter because they cluster on the hardest benchmarks the field has β€” HLE explicitly resists the kind of memorization that flatters older evaluations like MMLU, and ARC-AGI-2 is designed to be intractable for retrieval-style shortcuts. Grok 4 doubling Gemini 2.5 Pro's ARC-AGI-2 score isn't a marketing rounding error; it's a real signal that xAI's scaling and RL recipe closed the frontier gap.

02 Β· benchmarks

How does Grok 4 score on real benchmarks?

Every score below comes from xAI's official Grok 4 announcement (Jul 9, 2025) and is cross-checked against independent reviewers at Simon Willison's Weblog and Artificial Analysis. Higher is better on every row except LMArena Elo where the maximum is the leaderboard ceiling at the time.

BenchmarkGrok 4Grok 4 HeavyGPT-5Claude OpusGemini 2.5 Pro
Humanity's Last Exam25.4%44.4%24.8%21.7%21.6%
ARC-AGI-216.2%16.2%9.9%8.6%4.9%
GPQA Diamond87.5%88.9%88.4%79.6%84%
AIME 202595%100%94.6%90%88%
USAMO 202537.5%61.9%25.5%21.7%49.4%
LMArena Elo14291429141114211404
Humanity's Last Exam
Grok 4 = 25.4%
Grok 4 Heavy
44.4%
Grok 4
25.4%
GPT-5
24.8%
Claude Opus
21.7%
Gemini 2.5 Pro
21.6%
ARC-AGI-2
Grok 4 = 16.2%
Grok 4 Heavy
16.2%
Grok 4
16.2%
GPT-5
9.9%
Claude Opus
8.6%
Gemini 2.5 Pro
4.9%
GPQA Diamond
Grok 4 = 87.5%
Grok 4 Heavy
88.9%
Grok 4
87.5%
GPT-5
88.4%
Claude Opus
79.6%
Gemini 2.5 Pro
84%
AIME 2025
Grok 4 = 95%
Grok 4 Heavy
100%
Grok 4
95%
GPT-5
94.6%
Claude Opus
90%
Gemini 2.5 Pro
88%
Run Grok 4 + GPT-5 + Claude β€” same prompt
ZeroTwo's side-by-side mode lets you fire the same prompt at Grok 4, GPT-5, Claude Opus 4.6, and Gemini 2.5 Pro at once. The fastest way to actually feel Grok 4's reasoning edge.
Start free on ZeroTwo
03 Β· capabilities

What can Grok 4 actually do?

Four headline capabilities, all confirmed in xAI's release materials and validated by independent reviewers.

256K context
capability

Grok 4 ships a 256,000-token context window via the xAI API β€” 2Γ— Grok 3's chat-tier limit, large enough for full codebases or 800-page documents in a single call.

source β†—
Grok 4 Heavy (16-agent)
capability

A multi-agent variant where Grok 4 spawns 16 parallel reasoning agents that critique each other's work, then synthesize. Boosts AIME 2025 to 100% and HLE to 44.4% per xAI's release.

source β†—
Voice mode + real-time X
capability

Sub-second streaming voice with five voice presets, plus live access to public X posts as a search backend β€” uniquely useful for breaking-news prompts.

source β†—
Trained on Colossus
capability

Grok 4 used roughly 100Γ— the RL compute of Grok 2 on xAI's Memphis Colossus cluster, expanded to 200,000+ Nvidia H100-class GPUs by mid-2025.

source β†—
04 Β· head-to-head

Grok 4 vs GPT-5 vs Claude Opus vs Gemini 2.5 Pro

The frontier is crowded. Here are the four flagships side-by-side on the dimensions that decide a real-world choice.

FeatureGrok 4GPT-5Claude Opus 4.1Gemini 2.5 Pro
ReleasedJul 9, 2025Aug 7, 2025Sep 29, 2025Mar 25, 2025
Context window256K400K200K1M
API in / out (per M)$3 / $15$1.25 / $10$15 / $75$1.25 / $10
HLE (text-only)25.4%24.8%21.7%21.6%
ARC-AGI-216.2%9.9%8.6%4.9%
Real-time webX + WebWeb (paid)Web (paid)Google Search
Multi-agent variantHeavy (16x)β€”β€”Deep Think

Practical rule four years into the post-GPT-4 frontier: Grok 4 for hardest-tier reasoning prompts and real-time X data; GPT-5 for cheaper bulk inference and the broadest tool ecosystem; Claude Opus 4.x for code and careful long-form prose; Gemini 2.5 Pro when raw context size or multimodal grounding wins. See deeper breakdowns at GPT-5, Gemini 2.5 Pro, Grok 3, and the forward-looking Claude Opus 5 watch-list.

05 Β· expert reception

What did reviewers say about Grok 4?

The independent reception was surprisingly positive β€” even from reviewers usually skeptical of xAI's benchmark claims.

β€œGrok 4 looks like a genuinely frontier-class model. The 25.4% on Humanity's Last Exam is the highest published number I've seen on that benchmark, and the price-to-quality ratio is competitive with anything OpenAI or Anthropic have shipped.”
Simon Willison Β· Co-creator of Djangosource β†—
β€œWe're at a moment where every six months the frontier moves. Grok 4's HLE and ARC-AGI-2 numbers are a real step β€” xAI has fully closed the gap with OpenAI and Anthropic on the benchmarks that matter.”
Ethan Mollick Β· Wharton, author of Co-Intelligencesource β†—

Andrej Karpathy's post-launch early Grok 4 vibe-check described the model as β€œat or beyond the state of the art on most prompts I tried”, particularly strong on math and physics β€” consistent with xAI's published HLE and AIME numbers. The LMArena public leaderboard placed Grok 4 at #1 within two weeks of launch.

06 Β· access

How can you try Grok 4 today?

Three paths. The fastest is ZeroTwo: sign in, open multi-model chat, pick Grok 4 from the model selector, and prompt. The free tier includes Grok 4; ZeroTwo Pro at $29.99/month unlocks unlimited Grok 4 plus 60+ other frontier models β€” less than X Premium+, with GPT-5, Claude, and Gemini bundled.

Second path: Grok 4 on consumer surfaces β€” included in X Premium+ ($22/month) and SuperGrok ($30/month) at grok.com. Grok 4 Heavy requires SuperGrok Heavy at $300/month.

Third: the xAI API at $3.00 / $15.00 per million tokens (in/out), or via OpenRouter for pay-as-you-go developers.

For broader context, see our deep dives on the full Grok family, using Grok free on ZeroTwo, and the ChatGPT-alternative landscape.

Key takeaways

What to remember about Grok 4

  • Grok 4 is xAI's frontier model β€” released July 9, 2025, the first Grok model to top Humanity's Last Exam, ARC-AGI-2, and the LMArena leaderboard.
  • Grok 4 scores 25.4% on Humanity's Last Exam (text-only), 16.2% on ARC-AGI-2, 87.5% on GPQA Diamond, and 95% on AIME 2025.
  • Grok 4 Heavy β€” a 16-agent multi-agent variant β€” pushes those to 44.4% HLE, 100% AIME 2025, and 61.9% USAMO 2025.
  • API pricing is $3 / $15 per million input/output tokens; consumer access is included in X Premium+ ($22/mo) or SuperGrok ($30/mo).
  • Trained on the 200K-H100 Colossus cluster with roughly 100Γ— the RL compute of Grok 2.
  • Run Grok 4 without X Premium+ β€” ZeroTwo bundles Grok 4 with GPT-5, Claude Opus, and Gemini 2.5 Pro, side-by-side, in one subscription.
07 Β· faq

Frequently asked about Grok 4

Nine direct answers, sourced from the official xAI Grok 4 announcement and independent benchmarks.

What is Grok 4?

Grok 4 is xAI's frontier large language model, released on July 9, 2025. It is the fourth-generation Grok model and the first to enter the very top tier of LLM benchmarks: 25.4% on Humanity's Last Exam (text-only), 87.5% on GPQA Diamond, and 16.2% on ARC-AGI-2 β€” outscoring GPT-5, Claude Opus 4.1, and Gemini 2.5 Pro on the hardest reasoning evaluations at launch. Grok 4 ships with a 256,000-token context window, native real-time access to public X posts, voice mode with five presets, and a multi-agent variant called Grok 4 Heavy that spawns 16 parallel reasoning agents.

How does Grok 4 compare to GPT-5?

Grok 4 narrowly beats GPT-5 on Humanity's Last Exam (25.4% vs 24.8%), GPQA Diamond is roughly tied (87.5% vs 88.4%), and Grok 4 wins ARC-AGI-2 by a wide margin (16.2% vs 9.9%). GPT-5 has a larger context window (400K vs 256K) and cheaper API pricing ($1.25 / $10 vs $3 / $15 per million tokens). LMArena Elo at the end of 2025 placed Grok 4 at 1429 vs GPT-5 at 1411, a small but consistent lead. Practical rule: Grok 4 for hardest-tier reasoning and real-time X data, GPT-5 for cheaper bulk inference and broader integrations.

What is Grok 4 Heavy?

Grok 4 Heavy is xAI's multi-agent reasoning system. Instead of a single Grok 4 instance, it spawns 16 parallel agents that each work on the same problem, share notes, critique each other's reasoning, and then synthesize a single answer. xAI reported that Grok 4 Heavy hits 44.4% on Humanity's Last Exam, 100% on AIME 2025, and 61.9% on USAMO 2025 β€” substantially above the single-agent Grok 4. Heavy is gated behind the SuperGrok Heavy plan at $300/month.

What is the Grok 4 context window?

Grok 4 ships a 256,000-token context window via the xAI API β€” large enough for an entire mid-size codebase, an 800-page PDF, or several hours of meeting transcripts in a single prompt. The consumer chat surface on grok.com and X exposes a smaller window (around 128K tokens) consistent with other chat-tier deployments. Later 4-series models like Grok 4.1 Fast extend that to 2,000,000 tokens.

How much does Grok 4 cost?

API pricing is $3.00 per million input tokens and $15.00 per million output tokens β€” identical to Grok 3 and Claude Opus 4 at launch, more expensive than GPT-5's $1.25 / $10 standard tier. Consumer access starts at X Premium+ ($22/month) which includes Grok 4. SuperGrok ($30/month) and SuperGrok Heavy ($300/month) unlock higher rate limits and the 16-agent Grok 4 Heavy mode. On ZeroTwo, Grok 4 is included on the free tier and unlimited on Pro at $29.99/month β€” bundled with GPT-5, Claude Opus, Gemini 2.5 Pro, and 60+ other models.

Where was Grok 4 trained?

Grok 4 was trained on xAI's Memphis, Tennessee supercomputer Colossus. Colossus came online in September 2024 with 100,000 Nvidia H100 GPUs and was expanded past 200,000 H100-class GPUs by early 2025, according to Reuters reporting. xAI publicly stated Grok 4 used roughly 100Γ— the reinforcement-learning compute of Grok 2, making it one of the largest disclosed RL training runs in industry history.

What benchmarks does Grok 4 win?

At launch on July 9, 2025, Grok 4 set state-of-the-art scores on Humanity's Last Exam (25.4% text-only / 41.0% with tools), ARC-AGI-2 (16.2%, more than double the next-best frontier model), and tied or beat GPT-5 on GPQA Diamond and AIME 2025. The Grok 4 Heavy variant pushes those numbers further: 44.4% HLE, 100% AIME 2025, 61.9% USAMO 2025. Independent verification by Artificial Analysis confirmed Grok 4 took the top spot on their Intelligence Index in July 2025.

Can I use Grok 4 without X Premium+?

Yes β€” three options. First, the xAI API at docs.x.ai is pay-as-you-go and does not require any X account. Second, OpenRouter resells Grok 4 with the same pricing. Third, ZeroTwo includes Grok 4 on the free tier and unlimited on Pro at $29.99/month β€” alongside GPT-5, Claude Opus, Gemini 2.5 Pro, and 60+ other frontier models. ZeroTwo also lets you run Grok 4 side-by-side against any other frontier model on the same prompt.

Is Grok 4 good at coding?

Grok 4 is competitive but not category-leading on coding. xAI reported strong SWE-bench Verified scores at launch, but Anthropic's Claude Opus 4.6 and OpenAI's GPT-5 still dominate the real-world coding-agent benchmarks (Aider Polyglot, Cursor evals, SWE-bench Verified at the top of the leaderboard). For day-to-day code work, Claude is the default. Grok 4 shines on hardest-tier reasoning prompts (HLE, ARC-AGI-2) and tasks that benefit from real-time X data.

Author
ZeroTwo Editorial

The ZeroTwo editorial team tracks every frontier model release, runs benchmark comparisons across 60+ models, and updates these pages with primary-source numbers β€” never marketing copy.

Published Β· Updated

Run Grok 4 without the lock-in.

Grok 4, GPT-5, Claude Opus 4.6, and Gemini 2.5 Pro β€” side by side, same prompt, one subscription. No X Premium+ required.

Try Grok 4 on ZeroTwo free