Relevant if you build with AI tools, APIs, or coding agents. Relevant als je bouwt met AI-tools, API's of coding agents.
WebGPT: Improving the factual accuracy of language models through web browsing WebGPT: Improving the factual accuracy of language models through web browsing
Title: WebGPT: Improving the factual accuracy of language models through web browsing Title: WebGPT: Improving the factual accuracy of language models through web browsing
Quick editorial signal Snelle redactionele duiding
- Track this as a OpenAI update, not just a standalone headline. Bekijk dit als OpenAI-update, niet alleen als losse headline.
- Useful for builders who need to understand API, coding, or workflow changes. Nuttig voor bouwers die API-, code- of workflowwijzigingen willen begrijpen.
- Likely worth revisiting after people have used the release in practice. Waarschijnlijk de moeite waard om opnieuw te bekijken zodra mensen het in praktijk gebruiken.
WebGPT: Improving the factual accuracy of language models through web browsing | OpenAI
We’ve fine-tuned GPT‑3 to more accurately answer open-ended questions using a text-based web browser.
Listen to article
We’ve fine-tuned GPT‑3 to more accurately answer open-ended questions using a text-based web browser. Our prototype copies how humans research answers to questions online—it submits search queries, follows links, and scrolls up and down web pages. It is trained to cite its sources, which makes it easier to give feedback to improve factual accuracy. We’re excited about developing more truthful AI,1but challenges remain, such as coping with unfamiliar types of questions.
Language models like GPT‑3 are useful for many different tasks, but have a tendency to “hallucinate” information when performing tasks requiring obscure real-world knowledge.2, 3To address this, we taught GPT‑3 to use a text-based web-browser. The model is provided with an open-ended question and a summary of the browser state, and must issue commands such as “Search ...”, “Find in page: ...” or “Quote: …”. In this way, the model collects passages from web pages, and then uses these to compose an answer.
The model is fine-tuned from GPT‑3 usingthesamegeneralmethodswe’ve used previously. We begin by training the model to copy human demonstrations, which gives it the ability to use the text-based browser to answer questions. Then we improve the helpfulness and accuracy of the model’s answers, by training a reward model to predict human preferences, and optimizing against it using either reinforcement learning or rejection sampling.
Our system is trained to answer questions from ELI5,4a dataset of open-ended questions scraped from the “Explain Like I’m Five” subreddit. We trained three different models, corresponding to three different inference-time compute budgets. Our best-performing model produces answers that are preferred 56% of the time to answers written by our human demonstrators, with a similar level of factual accuracy. Even though these were the same kind of demonstrations used to train the model, we were able to outperform them by using human feedback to improve the model’s answers.
Results of human evaluations on the ELI5 test set, comparing our model with human demonstrators. The amount of rejection sampling (the n in best-of-n) was chosen to be compute-efficient. Error bars show ±1 standard error.
For questions taken from the training distribution, our best model’s answers are about as factually accurate as those written by our human demonstrators, on average. However, out-of-distribution robustness is a challenge. To probe this, we evaluated our models on TruthfulQA,5an adversarially-constructed dataset of short-form questions designed to test whether models fall prey to things like common misconceptions. Answers are scored on both truthfulness and informativeness, which trade off against one another (for example, “I have no comment” is considered truthful but not informative).
Our models outperform GPT‑3 on TruthfulQA and exhibit more favourable scaling properties. However, our models lag behind human performance, partly because they sometimes quote from unreliable sources (as shown in the question about ghostsabove). We hope to reduce the frequency of these failures using techniques like adversarial training.
TruthfulQA results. For GPT-3, we used the prompts and automated metric from the TruthfulQA paper. For the web-browsing model, we truncated the long-form answers and used human evaluation, since the answers are out-of-distribution for the automated metric. Error bars show ±1 standard error.
In order to provide feedback to improve factual accuracy, humans must be able to evaluate the factual accuracy of claims produced by models. This can be extremely challenging, since claims can be technical, subjective or vague. For this reason, we require the model to cite its sources.6This allows humans to evaluate factual accuracy by checking whether a claim is _supported by a reliable source_. As well as making the task more manageable, it also makes it less ambiguous, which is important for reducing label noise.
However, this approach raises a number of questions. What makes a source reliable? What claims are obvious enough to not require support? What trade-off should be made between evaluations of factual accuracy and other criteria such as coherence? All of these were difficult judgment calls. We do not think that our model picked up on much of this nuance, since it still makes basic errors. But we expect these kinds of decisions to become more important as AI systems improve, and cross-disciplinary research is needed to develop criteria that are both practical and epistemically sound. We also expect further considerations such as transparency to be important.1
Eventually, having models cite their sources will not be enough to evaluate factual accuracy. A sufficiently capable model would cherry-pick sources it expects humans to find convincing, even if they do not reflect a fair assessment of the evidence. There are already signs of this happening (see the questions about boatsabove). We hope to mitigate this using methods likedebate.
Although our model is generally more truthful than GPT‑3 (in that it generates false statements less frequently), it still poses risks. Answers with citations are often perceived as having an air of authority, which can obscure the fact that our model still makes basic errors. The model also tends to reinforce the existing beliefs of users. We are researching how best to address these and other concerns.
In addition to these deployment risks, our approach introduces new risks _at train time_ by giving the model access to the web. Our browsing environment does not allow full web access, but allows the model to send queries to theMicrosoft Bing Web Search API(opens in a new window)and follow links that already exist on the web, which can have side-effects. From our experience with GPT‑3, the model does not appear to be anywhere near capable enough to dangerously exploit these side-effects. However, these risks increase with model capability, and we are working on establishing internal safeguards against them.
Human feedback and tools such as web browsers offer a promising path towards robustly truthful, general-purpose AI systems. Our current system struggles with challenging or unfamiliar circumstances, but still represents significant progress in this direction.
Help shape what we cover next Help bepalen wat we hierna volgen
Anonymous feedback, no frontend account needed. Anonieme feedback, zonder front-end account.
More from OpenAI Meer van OpenAI
All updates Alle updatesOur principles Our principles
By Sam Altman By Sam Altman
GPT-5.5 Bio Bug Bounty GPT-5.5 Bio Bug Bounty
Title: GPT-5.5 Bio Bug Bounty Titel: GPT-5.5 Bio Bug Bounty
How to get started with Codex Zo begin je met Codex
Tips to set up Codex, create your first project, and start completing real tasks. Tips om Codex in te stellen, je eerste project te maken en echte taken af te ronden.
What is Codex? Wat is Codex?
Understand what Codex is and how it fits into your work Begrijp wat Codex is en hoe het in je werk past