← Back to OpenAI updates ← Terug naar OpenAI-updates

OpenAI ARTICLE ARTIKEL 7 August 2025 7 augustus 2025

Introducing GPT‑5 for developers Introducing GPT‑5 for developers

The best model for coding and agentic tasks. The best model for coding and agentic tasks.

Updates Updates Videos Video's

Article details Artikelgegevens

AI maker AI-maker OpenAI Type Type Article Artikel Published Gepubliceerd 7 August 2025 7 augustus 2025 Updates Updates Videos Video's View original article Bekijk origineel artikel

Why it matters Waarom dit telt

Quick editorial signal Snelle redactionele duiding

14 min

Impact Impact

Worth checking before choosing or changing a subscription. Handig om te checken voordat je een abonnement kiest of wijzigt.

Audience Voor wie Developers Developers

Level Niveau Expert Expert

Track this as a OpenAI update, not just a standalone headline. Bekijk dit als OpenAI-update, niet alleen als losse headline.
Check plan details before changing subscriptions or advising a team. Controleer plandetails voordat je abonnementen wijzigt of een team adviseert.
Likely worth revisiting after people have used the release in practice. Waarschijnlijk de moeite waard om opnieuw te bekijken zodra mensen het in praktijk gebruiken.

model apps video pricing

Introduction

Today, we’re releasing GPT‑5 in our API platform—our best model yet for coding and agentic tasks.

GPT‑5 is state-of-the-art (SOTA) across key coding benchmarks, scoring 74.9% on SWE-bench Verified and 88% on Aider polyglot. We trained GPT‑5 to be a true coding collaborator. It excels at producing high-quality code and handling tasks such as fixing bugs, editing code, and answering questions about complex codebases. The model is steerable and collaborative—it can follow very detailed instructions with high accuracy and can provide upfront explanations of its actions before and between tool calls. The model also excels at front-end coding, beating OpenAI o3 at frontend web development 70% of the time in internal testing.

We trained GPT‑5 on real-world coding tasks in collaboration with early testers across startups and enterprises. Cursor says GPT‑5 is “the smartest model [they’ve] used” and “remarkably intelligent, easy to steer, and even has a personality [they] haven’t seen in other models.” Windsurf shared GPT‑5 is SOTA on their evals and “has half the tool calling error rate over other frontier models.” Vercelsays “it’s the best frontend AI model, hitting top performance across both the aesthetic sense and the code quality, putting it in a category of its own.”

GPT‑5 also excels at long-running agentic tasks—achieving SOTA results on τ 2-bench telecom (96.7%), a tool-calling benchmark released just 2 months ago. GPT‑5’s improved tool intelligence lets it reliably chain together dozens of tool calls—both in sequence and in parallel—without losing its way, making it far better at executing complex, real-world tasks end to end. It also follows tool instructions more precisely, is better at handling tool errors, and excels at long-context content retrieval. Manus says GPT‑5 “achieved the best performance [they’ve] ever seen from a single model on [their] internal benchmarks.”Notion says “[the model’s] rapid responses, especially in low reasoning mode, make GPT‑5 an ideal model when you need complex tasks solved in one shot.” Inditex shared “what truly sets [GPT‑5] apart is the depth of its reasoning: nuanced, multi-layered answers that reflect real subject-matter understanding.”

We’re introducing new features in our API to give developers more control over model responses. GPT‑5 supports a new verbosity parameter (values: low, medium, high) to help control whether answers are short and to the point or long and comprehensive. GPT‑5’s reasoning_effort parameter can now take a minimal value to get answers back faster, without extensive reasoning first. We’ve also added a new tool type—custom tools—to let GPT‑5 call tools with plaintext instead of JSON. Custom tools support constraining by developer-supplied context-free grammars.

We’re releasing GPT‑5 in three sizes in the API—gpt-5, gpt-5-mini, and gpt-5-nano—to give developers more flexibility to trade off performance, cost, and latency. While GPT‑5 in ChatGPT is a system of reasoning, non-reasoning, and router models, GPT‑5 in the API platform is the reasoning model that powers maximum performance in ChatGPT. Notably, GPT‑5 with minimal reasoning is a different model than the non-reasoning model in ChatGPT, and is better tuned for developers. The non-reasoning model used in ChatGPT is available as gpt-5-chat-latest.

To read about GPT‑5 in ChatGPT, and learn more about other ChatGPT improvements, see our research blog. For more on how enterprises are excited to use GPT‑5, see our enterprise blog⁠.

Coding

GPT‑5 is the strongest coding model we’ve ever released. It outperforms o3 across coding benchmarks and real-world use cases, and has been fine-tuned to shine in agentic coding products like Cursor, Windsurf, GitHub Copilot, and Codex CLI. GPT‑5 impressed our alpha testers, setting records on many of their private internal evals.

Early feedback on GPT‑5 for real-world coding tasks

Cursor Windsurf Vercel JetBrains Factory Lovable GitLab Augment Code GitHub Cognition

> “GPT-5 is the smartest coding model we've used. Our team has found GPT-5 to be remarkably intelligent, easy to steer, and even to have a personality we haven’t seen in any other model. It not only catches tricky, deeply-hidden bugs but can also run long, multi-turn background agents to see complex tasks through to the finish—the kinds of problems that used to leave other models stuck. It’s become our daily driver for everything from scoping and planning PRs to completing end-to-end builds.”

Michael Truell, Co-Founder & CEO at Cursor

On SWE-bench Verified, an evaluation based on real-world software engineering tasks, GPT‑5 scores 74.9%, up from o3’s 69.1%. Notably, GPT‑5 achieves its high score with greater efficiency and speed: relative to o3 at high reasoning effort, GPT‑5 uses 22% fewer output tokens and 45% fewer tool calls.

In SWE-bench Verified⁠, a model is given a code repository and issue description, and must generate a patch to solve the issue. Text labels indicate the reasoning effort. Our scores omit 23 of 500 problems whose solutions did not reliably pass on our infrastructure. GPT‑5 was given a short prompt that emphasized verifying solutions thoroughly; the same prompt did not benefit o3.

On Aider polyglot, an evaluation of code editing, GPT‑5 sets a new record of 88%, a one-third reduction in error rate compared to o3.

In Aider polygot⁠(opens in a new window) (diff), a model is given a coding exercise from Exercism and must write its solution as a code diff. Reasoning models were run with high reasoning effort.

We’ve also found GPT‑5 to be excellent at digging deep into codebases to answer questions about how various pieces work or interoperate. In a codebase as complicated as OpenAI’s reinforcement learning stack, we’re finding that GPT‑5 can help us reason about and answer questions about our code, accelerating our own day-to-day work.

Frontend engineering

When producing frontend code for web apps, GPT‑5 is more aesthetically-minded, ambitious, and accurate. In side-by-side comparisons with o3, GPT‑5 was preferred by our testers 70% of the time.

Here are some fun, cherry-picked examples of what GPT‑5 can do with a single prompt:

Espresso Lab website Audio step sequencer app Outer space game

Prompt:Please generate a beautiful, realistic landing page for a service that provides the ultimate coffee enthusiast a $200/month subscription that provides equipment rental and coaching for coffee roasting and creating the ultimate espresso. The target audience is a bay area middle-aged person who might work in tech and is educated, has disposable income, and is passionate about the art and science of coffee. Optimize for conversion for a 6 month signup.

See more examples by GPT‑5 in our gallery here⁠(opens in a new window).

Coding collaboration

GPT‑5 is a better collaborator, particularly in agentic coding products like Cursor, Windsurf, GitHub Copilot, and Codex CLI. While it works, GPT‑5 can output plans, updates, and recaps in between tool calls. Relative to our past models, GPT‑5 is more proactive at completing ambitious tasks without pausing for your go-ahead or balking at high complexity.

Here’s an example of how GPT‑5 can look while tackling a complex task (in this case, creating a website for a restaurant):

After the user asks for a website for their restaurant, GPT‑5 shares a quick plan, scaffolds the app, installs dependencies, creates the site content, runs a build to check for compilation errors, summarizes its work, and suggests potential next steps. This video has been sped up ~3x to save you the wait; the full duration to create the website was about three minutes.

Agentic tasks

Beyond agentic coding, GPT‑5 is better at agentic tasks generally. GPT‑5 sets new records on benchmarks of instruction following (69.6% on Scale MultiChallenge, as graded by o3‑mini) and tool calling (96.7% on τ 2-bench telecom). Improved tool intelligence allows GPT‑5 to more reliably chain together actions to accomplish real-world tasks.

Early feedback on GPT‑5 for agentic tasks

Manus Mercado Libre Notion Genspark Inditex Zendesk Canva Atlassian Harvey BBVA Clay Uber

> “GPT-5 is a big step up. It achieved the best performance we’ve ever seen from a single model on our internal benchmarks. GPT-5 excelled across various agentic tasks—even before we tweaked a single line of code or tailored a prompt. The new preambles and more precise control over tool use enabled a significant leap in the stability and steerability of our agents.”

Yichao ‘Peak’ Ji, Co-Founder & Chief Scientist at Manus

Instruction following

GPT‑5 follows instructions more reliably than any of its predecessors, scoring highly on COLLIE, Scale MultiChallenge, and our internal instruction following eval.

In COLLIE⁠(opens in a new window), models must write text that meets various constraints. In Scale MultiChallenge⁠(opens in a new window), models are challenged on multi-turn conversations to properly use four types of information from previous messages. Our scores come from using o3‑mini as a grader, which was more accurate than GPT‑4o. In our internal OpenAI API instruction following eval, models must follow difficult instructions derived from real developer feedback. Reasoning models were run with high reasoning effort.

Tool calling

We worked hard to improve tool calling in the ways that matter to developers. GPT‑5 is better at following tool instructions, better at dealing with tool errors, and better at proactively making many tool calls in sequence or in parallel. When instructed, GPT‑5 can also output preamble messages before and between tool calls to update users on progress during longer agentic tasks.

Two months ago, τ 2-bench telecom was published by Sierra.ai as a challenging tool use benchmark that highlighted how language model performance drops significantly when interacting with an environment state that can be changed by users. In their publication⁠(opens in a new window), no model scored above 49%. GPT‑5 scores 97%.

In τ2-bench⁠(opens in a new window), a model must use tools to accomplish a customer service task, where there may be a user who can communicate and can take actions on the world state. Reasoning models were run with high reasoning effort.

GPT‑5 shows strong improvements to long-context performance as well. On OpenAI-MRCR, a measure of long-context information retrieval, GPT‑5 outperforms o3 and GPT‑4.1, by a margin that grows substantially at longer input lengths.

In OpenAI-MRCR⁠(opens in a new window)(multi-round co-reference resolution), multiple identical “needle” user requests are inserted into long “haystacks” of similar requests and responses, and the model is asked to reproduce the response to i-th needle. Mean match ratio measures the average string match ratio between the model’s response and the correct answer. The points at 256k max input tokens represent averages over 128k–256k input tokens, and so forth. Here, 256k represents 256 * 1,024 = 262,114 tokens. Reasoning models were run with high reasoning effort.

We’re also open sourcing BrowseComp Long Context⁠(opens in a new window), a new benchmark for evaluating long-context Q&A. In this benchmark, the model is given a user query, a long list of relevant search results, and must answer the question based on the search results. We designed BrowseComp Long Context to be realistic, difficult, and have reliably correct ground truth answers. On inputs that are 128K–256K tokens, GPT‑5 gives the correct answer 89% of the time.

In the API, all GPT‑5 models can accept a maximum of 272,000 input tokens and emit a maximum of 128,000 reasoning & output tokens, for a total context length of 400,000 tokens.

Factuality

GPT‑5 is more trustworthy than our prior models. On prompts from LongFact and FactScore benchmarks, GPT‑5 makes ~80% fewer factual errors than o3. This makes it better suited for agentic use cases where correctness matters—especially in code, data, and decision-making.

Higher scores are worse. LongFact⁠(opens in a new window) and FActScore⁠(opens in a new window) consist of open-ended fact-seeking questions. We use an LLM-based grader with browsing to fact-check responses on prompts from these benchmarks and measure the fraction of factually incorrect claims. Implementation and grading details can be found in the system card⁠. Reasoning models used high reasoning effort. Search was not enabled.

Generally, GPT‑5 has been trained to be more self-aware of its own limitations and better able to handle unexpected curveballs. We also trained GPT‑5 to be much more accurate on health questions (read more in our research blog). As with all language models, we recommend you verify GPT‑5’s work when the stakes are high.

New features

Minimal reasoning effort

Developers can control GPT‑5’s thinking time via the reasoning_effort parameter in the API. In addition to the prior values—low, medium (default), and high—GPT‑5 also supports minimal, which minimizes GPT‑5’s reasoning to return an answer quickly.

Higher reasoning_effort values maximize quality and lower values maximize speed. Not all tasks benefit equally from additional reasoning, so we recommend experimenting to see which works best for the use cases you care about.

For example, reasoning above low adds little to relatively simple long-context retrieval, but adds quite a few percentage points to CharXiv Reasoning⁠(opens in a new window), a visual reasoning benchmark.

GPT‑5’s reasoning effort yields different benefits on different tasks. For CharXiv Reasoning, GPT‑5 was given access to a python tool.

Verbosity

To help steer the default length of GPT‑5’s answers, we’ve introduced a new API parameter verbosity, which takes values of low, medium (default), and high. If explicit instructions conflict with the verbosity parameters, explicit instructions take precedent. For example, if you ask GPT‑5 to “write a 5 paragraph essay”, the model’s response should always be 5 paragraphs regardless of the verbosity level (however, the paragraphs themselves may be longer or shorter).

Verbosity=low

Why is the sky blue?

Because of Rayleigh scattering. Air molecules are much smaller than the wavelength of visible light, so they scatter short wavelengths (blue/violet) far more than long ones (red). The scattered blue light comes to your eyes from all directions, making the sky look blue. We don’t see violet as much because the Sun emits less of it, our eyes are less sensitive to it, and some is absorbed in the upper atmosphere. At sunrise and sunset, the longer path through air removes more blue, leaving the sky reddish.

Verbosity=medium

Because of Rayleigh scattering. Sunlight contains all colors, but air molecules scatter shorter wavelengths (blue/violet) much more strongly than longer wavelengths (red). The scattering strength goes roughly as 1/(wavelength)^4, so blue light gets redirected across the sky and into your eyes.

We don’t see a violet sky because:

Because of Rayleigh scattering. Sunlight contains all colors, but air molecules scatter shorter wavelengths (blue/violet) much more strongly than longer wavelengths (red). The scattering strength goes roughly as 1/(wavelength)^4, so blue light gets redirected across the sky and into your eyes.

We don’t see a violet sky because:

* The Sun emits slightly less violet than blue.

At sunrise and sunset, sunlight passes through more atmosphere, scattering out most of the blues along the way and leaving the longer red/orange wavelengths to dominate.

Verbosity=high

The sky looks blue because sunlight gets scattered by the tiny molecules in Earth’s atmosphere. This process is called Rayleigh scattering.

Key points:

Why is the sky blue?

The sky looks blue because sunlight gets scattered by the tiny molecules in Earth’s atmosphere. This process is called Rayleigh scattering.

Why not violet?

* Sunlight is made of many colors (wavelengths). Shorter wavelengths, like violet and blue, are scattered much more strongly by air molecules than longer wavelengths like red. The scattering strength increases sharply as wavelength gets shorter.

* As sunlight passes through the atmosphere, the shorter-wavelength light (especially blue) is scattered in all directions, so no matter where you look away from the Sun, you see that scattered blue light — the “blue sky.”

Why not violet?

Together, that makes the scattered light we perceive look mainly blue.

Why are sunsets red/orange?

* Some violet/UV is absorbed higher up by ozone.

Together, that makes the scattered light we perceive look mainly blue.

Other influences:

* Near sunrise and sunset, sunlight travels a much longer path through the atmosphere.

* Much of the blue light gets scattered out of the direct path before it reaches your eyes, leaving the remaining direct sunlight enriched in reds and oranges.

Other influences:

If Earth had no atmosphere, there would be nothing to scatter sunlight, and the sky would appear black even in daytime, with the Sun as a bright disk.

Preamble messages before tool calls

If instructed, GPT‑5 will output user-visible preamble messages before and between tool calls. Unlike hidden reasoning messages, these visible messages allow GPT‑5 to communicate plans and progress to the user, helping end users understand its approach and intent behind the tool calls.

Custom tools

We’re introducing a new tool type—custom tools—that allows GPT‑5 to call a tool with plaintext instead of JSON. To constrain GPT‑5 to follow custom tool formats, developers can supply a regex, or even a more fully specified context-free grammar⁠(opens in a new window).

Previously, our interface for developer-defined tools required them to be called with JSON, a common format used by web APIs and developers generally. However, outputting valid JSON requires the model to perfectly escape all quotation marks, backslashes, newlines, and other control characters. Although our models are well-trained to output JSON, on long inputs like hundreds of lines of code or a 5-page report, the odds of an error creep up. With custom tools, GPT‑5 can write tool inputs as plaintext, without having to escape all of the characters that require escaping.

On SWE-bench Verified using custom tools instead of JSON tools, GPT‑5 scores about the same.

We’re introducing a new tool type—custom tools—that allows GPT‑5 to call a tool with plaintext instead of JSON. To constrain GPT‑5 to follow custom tool formats, developers can supply a regex, or even a more fully specified context-free grammar⁠(opens in a new window).

Previously, our interface for developer-defined tools required them to be called with JSON, a common format used by web APIs and developers generally. However, outputting valid JSON requires the model to perfectly escape all quotation marks, backslashes, newlines, and other control characters. Although our models are well-trained to output JSON, on long inputs like hundreds of lines of code or a 5-page report, the odds of an error creep up. With custom tools, GPT‑5 can write tool inputs as plaintext, without having to escape all of the characters that require escaping.

On SWE-bench Verified using custom tools instead of JSON tools, GPT‑5 scores about the same.

Help shape what we cover next Help bepalen wat we hierna volgen

Anonymous feedback, no frontend account needed. Anonieme feedback, zonder front-end account.

Watch related videos Bekijk gerelateerde video's

Open videos → Open video's →

Introducing GPT-5.5

OpenAI Video Video

23 Apr 2026 23 apr. 2026

Introducing GPT-5.5 GPT-5.5 geïntroduceerd

Introducing GPT-5.5 A new class of intelligence for real work and powering agents, built to understand complex goals, use tools, check its work, and carry more tasks through to completion. It marks a new way of getting computer work done.... GPT-5.5 introduceren: een nieuwe klasse intelligentie voor echt werk en het aansturen van agents, gebouwd om complexe doelen te begrijpen, tools te gebruiken, zijn werk te controleren en meer taken tot voltooiing te brengen. Het markeert een nieuwe manier om computerwerk gedaan te krijgen....

Open video → Open video →

Introducing GPT-5

OpenAI Video Video

8 Aug 2025 8 aug. 2025

Introducing GPT-5 GPT-5 geïntroduceerd

Introducing GPT-5, our best AI system yet! GPT-5 features state-of-the-art performance across coding, math, writing assistance, health, visual perception, and more. Use GPT-5 to build websites, create apps, and tap into its improved writi... Maak kennis met GPT-5, ons beste AI-systeem tot nu toe! GPT-5 biedt toonaangevende prestaties op het gebied van coderen, wiskunde, schrijfondersteuning, gezondheid, visuele waarneming en meer. Gebruik GPT-5 om websites en apps te bouwen, en maak gebruik van de verbeterde schrijfmogelijkheden voor alledaagse taken zoals rapporten, e-mails en redigeren.

Open video → Open video →

Introducing GPT-5

OpenAI Video Video

7 Aug 2025 7 aug. 2025

Introducing GPT-5 Introductie van GPT-5

Sam Altman, Greg Brockman, Sebastien Bubeck, Mark Chen, Yann Dubois, Brian Fioca, Adi Ganesh, Oliver Godement, Saachi Jain, Christina Kaplan, Christina Kim, Elaine Ya Le, Felipe Millon, Michelle Pokrass, Jakub Pachocki, Max Schwarzer, Renni... Sam Altman, Greg Brockman, Sebastien Bubeck, Mark Chen, Yann Dubois, Brian Fioca, Adi Ganesh, Oliver Godement, Saachi Jain, Christina Kaplan, Christina Kim, Elaine Ya Le, Felipe Millon, Michelle Pokrass, Jakub Pachocki, Max Schwarzer, Renni...

Open video → Open video →

Share article Deel artikel

More from OpenAI Meer van OpenAI

All updates Alle updates

27 Apr 2026 27 apr. 2026

OpenAI available at FedRAMP Moderate OpenAI available at FedRAMP Moderate

Expanding secure AI for government. Expanding secure AI for government.

Open article → Open artikel →

27 Apr 2026 27 apr. 2026

Choco automates food distribution with AI agents Choco automates food distribution with AI agents

Using OpenAI APIs, Choco processes millions of orders, reducing manual work and enabling always-on operations across global food supply chains. Using OpenAI APIs, Choco processes millions of orders, reducing manual work and enabling always-on operations across global food supply chains.

Open article → Open artikel →

27 Apr 2026 27 apr. 2026

An open-source spec for Codex orchestration: Symphony. An open-source spec for Codex orchestration: Symphony.

Title: An open-source spec for Codex orchestration: Symphony. Title: An open-source spec for Codex orchestration: Symphony.

Open article → Open artikel →

27 Apr 2026 27 apr. 2026

The next phase of the Microsoft OpenAI partnership The next phase of the Microsoft OpenAI partnership

Amended agreement provides long-term clarity. Amended agreement provides long-term clarity.

Open article → Open artikel →

Gemini komt eraan