How does Grok 3 compare on real benchmarks?
Every number below is pulled from xAI's official Grok 3 announcement (Feb 17, 2025) and cross-checked against independent reviewers at Simon Willison's Weblog and Artificial Analysis. Grok 3 (Think) is the row shown.
| Benchmark | Grok 3 | GPT-4o | Claude 3.5 | Gemini 2.0 |
|---|---|---|---|---|
| AIME 2024 (math) | 93.3% | 9.3% | 16% | 77.3% |
| GPQA Diamond (science) | 84.6% | 53.6% | 65% | 62.1% |
| LiveCodeBench (code) | 79.4% | 32.3% | 40.9% | 68.9% |
| MMLU-Pro (reasoning) | 79.9% | 73.9% | 78% | 75.8% |
| LMArena Elo (Feb 2025) | 1402 | 1377 | 1304 | 1380 |
On the AIME 2024 (math), Grok 3 beats GPT-4o by 84.0% and Claude 3.5 Sonnet by 77.3%.
Hover a row to switch focus.
What is Grok 3 and how was it built?
Grok 3 is a frontier mixture-of-experts language model from xAI, released February 17, 2025. It was trained on Colossus, xAI's Memphis supercomputer β 100,000 Nvidia H100 GPUs at launch in September 2024, expanded to 200,000 H100s by the time Grok 3 finished training. Musk publicly stated Grok 3 used roughly 10Γ the compute of Grok 2, which would make it one of the largest frontier training runs disclosed to date (reported by The Verge).
The family ships in four variants plus an agentic research mode. Compared to its predecessor Grok and successor Grok 4 (see the full Grok family overview), Grok 3 is where xAI first hit genuine frontier-class reasoning scores.
Grok 3 (Think) streams visible reasoning steps before answering β similar to OpenAI's o-series and DeepSeek R1. Boosts AIME from 52% β 93.3% and GPQA to 84.6%.
source βAn escalated Think variant that allocates additional test-time compute for the hardest prompts. xAI calls it the highest-quality mode the model can run in.
source βGrok 3's agent loop: browse the live web + X, synthesize, cite. Launched alongside the Feb 2025 release as part of the Grok 3 reasoning suite.
source βSmaller Grok 3 variant with the same 1M-token context. Strong for agentic pipelines and batch RAG at a fraction of the flagship price.
source βWhat do reviewers say about Grok 3?
Independent reviewers broadly confirmed xAI's benchmark story in the weeks after release, with most calling Grok 3 a real leap forward for xAI β even while noting the usual benchmark-vs-vibes gap.
βGrok 3 is genuinely impressive β the AIME and GPQA numbers are not a rounding error. xAI has caught up faster than I thought possible.β
βGrok 3's Think mode is a credible reasoning model. The 93% AIME score in particular is a step-change from anything xAI has shipped before.β
Additional commentary from Andrej Karpathy's early Grok 3 vibe-check thread described the model as βroughly at the state of the artβ on everyday prompts, particularly strong on science and math β consistent with xAI's published AIME and GPQA numbers.
How can you try Grok 3 today?
Three paths. The fastest is ZeroTwo: sign in, open the multi-model chat, pick Grok 3 (or Grok 3 Mini, or Grok 3 Think) from the model selector, and prompt. The free tier includes Grok 3; ZeroTwo Pro unlocks every Grok model plus 60+ others at $29.99/month β less than X Premium+, with Claude, GPT-5, and Gemini bundled.
Second path: Grok 3 is free with daily limits on grok.com and included in X Premium+. Third: the xAI API at $3.00 / $15.00 per million tokens (in/out), or via OpenRouter for pay-as-you-go builders.
For broader context, see our ChatGPT alternative comparison, the Perplexity overview, and the GPT-family guide.
What to remember about Grok 3
- Grok 3 is xAI's frontier model released February 17, 2025 β trained on Colossus, the 200K-H100 Memphis cluster.
- Grok 3 (Think) hits 93.3% on AIME 2024 and 84.6% on GPQA Diamond, outperforming GPT-4o and Claude 3.5 Sonnet on xAI's published benchmarks.
- Four variants ship: Grok 3, Grok 3 Mini, Grok 3 (Think), and Big Brain β plus DeepSearch for agentic research.
- API pricing: $3.00 / M input, $15.00 / M output, with a 1,000,000-token context window.
- Run Grok 3 without X Premium+ β ZeroTwo bundles Grok 3 with Claude, GPT-5, and Gemini in one subscription.
Frequently asked about Grok 3
Eight direct answers, sourced from the official xAI Grok 3 announcement and independent benchmarks.
What is Grok 3?
Grok 3 is xAI's flagship large language model, announced by Elon Musk and the xAI team on February 17, 2025. It was trained on Colossus, xAI's Memphis supercomputer with 200,000 Nvidia H100 GPUs β roughly ten times the compute used for Grok 2. Grok 3 ships in four variants: the flagship Grok 3, the smaller Grok 3 Mini, a Think mode with visible chain-of-thought reasoning, and a Big Brain mode that allocates extra test-time compute. xAI positions Grok 3 as a frontier model that matches or beats GPT-4o and Claude 3.5 Sonnet on math, science, and code benchmarks.
How does Grok 3 compare to GPT-4o and Claude 3.5 Sonnet?
On xAI's published benchmarks, Grok 3 (Think) scores 93.3% on AIME 2024 versus GPT-4o's 9.3% and Claude 3.5 Sonnet's 16.0%. On GPQA Diamond it scores 84.6% versus GPT-4o's 53.6% and Claude 3.5 Sonnet's 65.0%. On LiveCodeBench it scores 79.4% versus GPT-4o's 32.3% and Claude 3.5 Sonnet's 40.9%. On the LMArena chatbot leaderboard in late February 2025, Grok 3 briefly held the #1 slot at 1402 Elo, narrowly ahead of GPT-4o and Gemini 2.0 Pro.
What are the Grok 3 Think and Big Brain modes?
Think mode is Grok 3's reasoning variant β it streams visible chain-of-thought before producing a final answer, trained with reinforcement learning in the same family as OpenAI's o-series and DeepSeek R1. Big Brain mode is an escalated Think variant that allocates additional test-time compute for the hardest prompts; xAI describes it as the highest-quality configuration Grok 3 can run in. Both modes boost Grok 3's AIME score from about 52% (base) to 93.3% (Think).
What is the context window of Grok 3?
Grok 3's API exposes a 1,000,000-token context window, one of the largest of any frontier model in early 2025 β matched only by Gemini 2.0 Pro's 2M token window. The consumer interface on X and grok.com typically exposes a smaller context, around 128K tokens, consistent with other chat-tier deployments.
How much does Grok 3 cost?
Three tiers. API: Grok 3 is priced at $3.00 per million input tokens and $15.00 per million output tokens on the xAI API; Grok 3 Mini is cheaper. Consumer: Grok 3 is free with daily limits on grok.com and X, and included in X Premium+ starting at $22/month. On ZeroTwo, Grok 3 is included alongside Grok 4, Claude, GPT-5, and Gemini on the free tier and unlimited on Pro at $29.99/month.
What was Colossus and how was Grok 3 trained?
Colossus is xAI's Memphis, Tennessee supercomputer, brought online in September 2024. It started at 100,000 Nvidia H100 GPUs β at launch the largest single AI training cluster in the world β and was expanded to 200,000 H100s before Grok 3 finished training. Musk publicly stated Grok 3 used roughly ten times the compute of Grok 2, making it the most compute-intensive open frontier model training run disclosed at the time.
What are Grok 3's agentic capabilities?
Grok 3 launched with DeepSearch, xAI's agentic research mode that browses the live web and X, synthesizes findings, and returns cited answers β conceptually similar to OpenAI's Deep Research and Perplexity Pro. Combined with Think or Big Brain, it can run multi-step research plans, cross-check sources, and produce long-form reports. API developers can also wire Grok 3 into tool-calling agents via the xAI function-calling endpoints.
Is Grok 3 available without X Premium+?
Yes. You can use Grok 3 on grok.com with a free daily-limited tier, on the xAI API (billed pay-as-you-go), via OpenRouter, or inside ZeroTwo's multi-model workspace. On ZeroTwo you get Grok 3, Grok 3 Mini, Grok 4, Claude Sonnet 4.6, GPT-5, and Gemini 3 in one subscription β cheaper than X Premium+ and without requiring an X account.
Is Grok 3 still worth using now that Grok 4 is out?
Yes, for three reasons. First, price: Grok 3 remains the cheaper tier on the xAI API and is often the better cost-performance pick for chat, extraction, and RAG. Second, availability: Grok 3 Mini sees wider integration across OpenRouter, LangChain, and aggregators than the latest Grok 4 variants. Third, reasoning: Grok 3 (Think)'s AIME and GPQA scores are still competitive with most frontier reasoning tiers. For bulk agentic work, Grok 3 Mini often beats Grok 4 on $/task.
The ZeroTwo editorial team tracks every frontier model release, runs benchmark comparisons across 60+ models, and updates these pages with primary-source numbers β never marketing copy.
Published Β· Updated
Run Grok 3 without the lock-in.
Grok 3, Grok 3 Mini, Grok 3 Think, and Grok 4 β side by side with Claude, GPT-5, and Gemini. One subscription. No X Premium+ required.
Try Grok 3 on ZeroTwo free