← Back to OpenAI updates ← Terug naar OpenAI-updates

OpenAI ARTICLE ARTIKEL 18 February 2025 18 februari 2025

Introducing the SWE-Lancer benchmark Introducing the SWE-Lancer benchmark

Can frontier LLMs earn $1 million from real-world freelance software engineering? Can frontier LLMs earn $1 million from real-world freelance software engineering?

Updates Updates Videos Video's

Article details Artikelgegevens

AI maker AI-maker OpenAI Type Type Article Artikel Published Gepubliceerd 18 February 2025 18 februari 2025 Updates Updates Videos Video's View original article Bekijk origineel artikel

Why it matters Waarom dit telt

Quick editorial signal Snelle redactionele duiding

1 min

Impact Impact

A product update that may change what people can do with AI this week. Een productupdate die kan veranderen wat mensen deze week met AI kunnen doen.

Audience Voor wie Creators Creators

Level Niveau Expert Expert

Track this as a OpenAI update, not just a standalone headline. Bekijk dit als OpenAI-update, niet alleen als losse headline.
Relevant for creators comparing tools for images, audio, video, or publishing. Relevant voor creators die tools vergelijken voor beeld, audio, video of publicatie.
Likely worth revisiting after people have used the release in practice. Waarschijnlijk de moeite waard om opnieuw te bekijken zodra mensen het in praktijk gebruiken.

model apps creative

Read paper(opens in a new window)Access repository(opens in a new window)

We introduce SWE-Lancer, a benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. SWE-Lancer encompasses both independent engineering tasks—ranging from $50 bug fixes to $32,000 feature implementations—and managerial tasks, where models choose between technical implementation proposals. Independent tasks are graded with end-to-end tests triple-verified by experienced software engineers, while managerial decisions are assessed against the choices of the original hired engineering managers. We evaluate model performance and find that frontier models are still unable to solve the majority of tasks. To facilitate future research, we open-source a unified Docker image and a public evaluation split, SWE-Lancer Diamond. By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development.

We introduce SWE-Lancer, a benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. SWE-Lancer encompasses both independent engineering tasks—ranging from $50 bug fixes to $32,000 feature implementations—and managerial tasks, where models choose between technical implementation proposals. Independent tasks are graded with end-to-end tests triple-verified by experienced software engineers, while managerial decisions are assessed against the choices of the original hired engineering managers. We evaluate model performance and find that frontier models are still unable to solve the majority of tasks. To facilitate future research, we open-source a unified Docker image and a public evaluation split, SWE-Lancer Diamond. By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development.

_Update on July 28, 2025:_ _Dataset and results updated as of July 17, 2025, available at:https://github.com/openai/preparedness_⁠(opens in a new window)_⁠_⁠(opens in a new window)_and in our system cards. The updated dataset removes the requirement for Internet connectivity during execution, eliminating a primary source of variability in model performance._

Authors

Samuel Miserendino, Michele Wang, Tejal Patwardhan, Johannes Heidecke

Samuel Miserendino, Michele Wang, Tejal Patwardhan, Johannes Heidecke

Help shape what we cover next Help bepalen wat we hierna volgen

Anonymous feedback, no frontend account needed. Anonieme feedback, zonder front-end account.

Share article Deel artikel

More from OpenAI Meer van OpenAI

All updates Alle updates

27 Apr 2026 27 apr. 2026

OpenAI available at FedRAMP Moderate OpenAI available at FedRAMP Moderate

Expanding secure AI for government. Expanding secure AI for government.

Open article → Open artikel →

27 Apr 2026 27 apr. 2026

Choco automates food distribution with AI agents Choco automates food distribution with AI agents

Using OpenAI APIs, Choco processes millions of orders, reducing manual work and enabling always-on operations across global food supply chains. Using OpenAI APIs, Choco processes millions of orders, reducing manual work and enabling always-on operations across global food supply chains.

Open article → Open artikel →

27 Apr 2026 27 apr. 2026

An open-source spec for Codex orchestration: Symphony. An open-source spec for Codex orchestration: Symphony.

Title: An open-source spec for Codex orchestration: Symphony. Title: An open-source spec for Codex orchestration: Symphony.

Open article → Open artikel →

27 Apr 2026 27 apr. 2026

The next phase of the Microsoft OpenAI partnership The next phase of the Microsoft OpenAI partnership

Amended agreement provides long-term clarity. Amended agreement provides long-term clarity.

Open article → Open artikel →

Gemini komt eraan