OpenAI SWELancer $1M Benchmark - Deep Research Comparison: OpenAI vs Google vs xAI

I tasked the 3 Deep Research AI Agents with the same task of doing research and extracting requirements from OpenAI's SWE Lancer Benchmark issues, from their GitHub repository

Repo: https://github.com/openai/SWELancer-Benchmark

TL;DR: OpenAI Deep Research won, very convincingly

See them researching: Link in the comments

I wanted to know more about the issues used in the $1 million dollar benchmark. The benchmark tests LLMs and AI Agents' ability to solve real world Software Engineering tasks, taken from freelance websites like Upwork and Freelancer. Here are the findings:

- Average time between them to research the first 10 tasks in the repository was 4 minutes

- Grok hallucinated the most

- OpenAI was very accurate

- Google Gemini Deep Research seemed to be more confused than hallucinate, though it hallucinated

- I took a look at the first 2 issues myself and was able to extract the requirements in around 20 seconds

- Google Gemini Deep Research got 0/2 right

- OpenAI Deep Research got 2/2 right

- Grok Deep Search got 0/2 right

This should help with expectation management of each offering, though the topic and content of the prompt might produce different results for each - I prefer to use non-verbose, human-like prompts, an intelligent AI should be able to understand. Any thoughts in the comments section please, that would be appreciated so we learn more and don't waste time

Gemini Deep Research:

https://preview.redd.it/6nwaqy112ipe1.png?width=2110&format=png&auto=webp&s=eafd4e1716146033f145fa071e76fe0bfc4fb5d1

OpenAI Deep Research:

https://preview.redd.it/c5znrlm32ipe1.png?width=1690&format=png&auto=webp&s=f8a79d6b4b88db63d7697afa8a2033495372129e

Grok Deep Search:

https://preview.redd.it/pxz0dsna2ipe1.png?width=1523&format=png&auto=webp&s=b16365ec2aef00f53c19edfd9d882d5bc1dadb68