OpenAI SWELancer $1M Benchmark - Deep Research Comparison: OpenAI vs Google vs xAI
I tasked the 3 Deep Research AI Agents with the same task of doing research and extracting requirements from OpenAI's SWE Lancer Benchmark issues, from their GitHub repository
Repo: https://github.com/openai/SWELancer-Benchmark
TL;DR: OpenAI Deep Research won, very convincingly
See them researching: Link in the comments
I wanted to know more about the issues used in the $1 million dollar benchmark. The benchmark tests LLMs and AI Agents' ability to solve real world Software Engineering tasks, taken from freelance websites like Upwork and Freelancer. Here are the findings:
- Average time between them to research the first 10 tasks in the repository was 4 minutes
- Grok hallucinated the most
- OpenAI was very accurate
- Google Gemini Deep Research seemed to be more confused than hallucinate, though it hallucinated
- I took a look at the first 2 issues myself and was able to extract the requirements in around 20 seconds
- Google Gemini Deep Research got 0/2 right
- OpenAI Deep Research got 2/2 right
- Grok Deep Search got 0/2 right
This should help with expectation management of each offering, though the topic and content of the prompt might produce different results for each - I prefer to use non-verbose, human-like prompts, an intelligent AI should be able to understand. Any thoughts in the comments section please, that would be appreciated so we learn more and don't waste time
Gemini Deep Research:
OpenAI Deep Research:
Grok Deep Search: