Grok-3 Outperforms DeepSeek-R1, Matches OpenAI o1 Pro

February 18, 2025118

xAI, the AI research company founded by Elon Musk, has unveiled its latest model, Grok-3, and early benchmarks suggest it outperforms several competitors.

The model has reportedly surpassed the 1400 mark on Chatbot Arena, making it one of the most capable AI models available today.

Grok-3’s Strengths in Reasoning and Research

One of Grok-3’s standout features is its advanced reasoning (Think) capabilities and a deep research function called DeepSearch. AI researcher Andrej Karpathy, founder of Eureka Labs and a former OpenAI and Tesla researcher, was given early access to the model.

In a post on X (formerly Twitter), Karpathy shared his experience, stating that Grok-3 successfully handled complex tasks, such as creating a hex grid for Settlers of Catan, a challenge that only OpenAI’s top-tier models like o1 Pro ($200/month) have mastered. In contrast, he noted that models like DeepSeek-R1, Gemini 2.0 Flash Thinking, and Claude failed in this area.

Karpathy also tested Grok-3’s ability to analyze the technical specifications of AI models, uploading OpenAI’s GPT-2 paper to estimate the required flops for training. He noted that Grok-3, with its reasoning feature, solved the task flawlessly, while OpenAI’s o1 Pro and GPT-4o failed.

How Grok-3 Compares to Other Leading AI Models

Karpathy described Grok-3’s performance as being on par with OpenAI’s o1 Pro and superior to DeepSeek-R1. However, he acknowledged that further evaluation is needed to determine its true ranking in the AI race.

He also tested Grok-3’s DeepSearch capabilities, which are designed to enhance research. While he found them comparable to Perplexity AI’s deep research, he noted that Grok-3 still struggles with hallucinating URLs and lacks accurate citations.

In one test, the model listed 12 major AI labs but failed to include xAI itself, highlighting some remaining gaps in its research abilities.

Experts React to Grok-3’s Performance

After two hours of testing, Karpathy concluded that Grok-3 feels close to the state-of-the-art AI models, calling it “slightly better than DeepSeek-R1 and Gemini 2.0 Flash Thinking.”

Other AI experts, including Lex Fridman, also praised the model. In his own post on X, he said, “My mind is blown, very impressive model.”

With xAI aggressively improving Grok-3, the AI landscape is heating up, and competition with OpenAI, Google, and other AI leaders is fiercer than ever.