Worth checking before choosing or changing a subscription. Handig om te checken voordat je een abonnement kiest of wijzigt.
Finding GPT-4’s mistakes with GPT-4 Finding GPT-4’s mistakes with GPT-4
Title: Finding GPT-4’s mistakes with GPT-4 Title: Finding GPT-4’s mistakes with GPT-4
Quick editorial signal Snelle redactionele duiding
- Track this as a OpenAI update, not just a standalone headline. Bekijk dit als OpenAI-update, niet alleen als losse headline.
- Check plan details before changing subscriptions or advising a team. Controleer plandetails voordat je abonnementen wijzigt of een team adviseert.
- Likely worth revisiting after people have used the release in practice. Waarschijnlijk de moeite waard om opnieuw te bekijken zodra mensen het in praktijk gebruiken.
Finding GPT-4’s mistakes with GPT-4 | OpenAI
Skip to main content
[](https://openai.com/)
* Research
* Products
* Business
* Developers
* Company
* Foundation(opens in a new window)
Log inTry ChatGPT(opens in a new window)
Try ChatGPT(opens in a new window)Login
OpenAI
Table of contents
* Developers
* Company
* Foundation(opens in a new window)
June 27, 2024
Publication
Finding GPT‑4’s mistakes with GPT‑4
CriticGPT, a model based on GPT‑4, writes critiques of ChatGPT responses to help human trainers spot mistakes during RLHF
Read paper(opens in a new window)
We've trained a model, based on GPT‑4, called CriticGPT to catch errors in ChatGPT's code output. We found that when people get help from CriticGPT to review ChatGPT code they outperform those without help 60% of the time. We are beginning the work to integrate CriticGPT‑like models into our RLHF labeling pipeline, providing our trainers with explicit AI assistance. This is a step towards being able to evaluate outputs from advanced AI systems that can be difficult for people to rate without better tools.
The GPT‑4 series of models, which powers ChatGPT, is aligned to be helpful and interactive through “Reinforcement Learning from Human Feedback”(RLHF). A key part of RLHF is collecting comparisons in which people, called AI trainers, rate different ChatGPT responses against each other.
As we make advances in reasoning and model behavior, ChatGPT becomes more accurate and its mistakes become more subtle. This can make it hard for AI trainers to spot inaccuracies when they do occur, making the comparison task that powers RLHF much harder. This is a fundamental limitation of RLHF, and it may make it increasingly difficult to align models as they gradually become more knowledgeable than any person that could provide feedback.
To help with this challenge, we trained CriticGPT to write critiques that highlight inaccuracies in ChatGPT answers.
Task
Write a function in Python that takes as input a string path representing a file path and returns a File object for the file at “path”. Files outside of the directory “/safedir” should not be allowed to be accessed.
Solution
import os
def get_file_object(path): # Define the safe directory safe_dir = "/safedir" # Get the absolute path of the file and the safe directory absolute_file_path = os.path.abspath(path) absolute_safe_dir = os.path.abspath(safe_dir)
Check if the file is within the safe directory Highlighted:if not absolute_file_path.startswith(absolute_safe_dir): raise FileAccessError("Access to files outside of " "'/safedir' is not allowed.")...
CriticGPT
Checking if an absolute path of a file is within a directory by using startswith() is insecure. Users could exploit this with symlinks or by naming directories similarly. Using os.path.commonpath([absolute_file_path, absolute_safe_dir]) or a more robust method of path containment is advisable.
CriticGPT’s suggestions are not always correct, but we find that they can help trainers to catch many more problems with model-written answers than they would without AI help. Additionally, when people use CriticGPT, the AI augments their skills, resulting in more comprehensive critiques than when people work alone, and fewer hallucinated bugs than when the model works alone. In our experiments a second random trainer preferred critiques from the Human+CriticGPT team over those from an unassisted person more than 60% of the time.
CriticGPT helps trainers to write more comprehensive critiques than they do without help while producing fewer hallucinations than critiques from the model alone.
Methods
CriticGPT was also trained with RLHF, similarly to ChatGPT. But unlike ChatGPT it saw a large number of inputs that contained mistakes which it then had to critique. We asked AI trainers to manually insert these mistakes into code written by ChatGPT and then write example feedback _as if_ they had caught the bug that they just inserted. The same person then compared multiple critiques of the modified code so they could easily tell when a critique caught their inserted bug. In our experiments we studied both whether CriticGPT could catch inserted bugs and “naturally occurring” ChatGPT bugs that a previous trainer had caught. We find that CriticGPT critiques are preferred by trainers over ChatGPT critiques in 63% of cases on naturally occurring bugs, in part because the new critic produces fewer “nitpicks” (small complaints that are unhelpful) and hallucinates problems less often.
We also find that we can generate longer and more comprehensive critiques by using additional test-time search against the critique reward model. This search procedure allows us to balance how aggressively we look for problems in the code and configure a precision-recall trade-off between hallucinations and the number of detected bugs. That means we can generate critiques that are as helpful as possible for RLHF. See our research paper for more details.
Limitations
Checking if an absolute path of a file is within a directory by using startswith() is insecure. Users could exploit this with symlinks or by naming directories similarly. Using os.path.commonpath([absolute_file_path, absolute_safe_dir]) or a more robust method of path containment is advisable.
CriticGPT’s suggestions are not always correct, but we find that they can help trainers to catch many more problems with model-written answers than they would without AI help. Additionally, when people use CriticGPT, the AI augments their skills, resulting in more comprehensive critiques than when people work alone, and fewer hallucinated bugs than when the model works alone. In our experiments a second random trainer preferred critiques from the Human+CriticGPT team over those from an unassisted person more than 60% of the time.
CriticGPT helps trainers to write more comprehensive critiques than they do without help while producing fewer hallucinations than critiques from the model alone.
Methods
Next Steps
In order to align AI systems that are increasingly complex, we’ll need better tools. In our research on CriticGPT, we found that applying RLHF to GPT‑4 has promise to help humans produce better RLHF data for GPT‑4. We are planning to scale this work further and put it into practice.
Limitations
* We trained CriticGPT on ChatGPT answers that are quite short. To supervise the agents of the future, we will need to develop methods that can help trainers to understand long and complex tasks.
* Models still hallucinate and sometimes trainers make labeling mistakes after seeing those hallucinations.
* Sometimes real-world mistakes can be spread across many parts of an answer. Our work focuses on errors that can be pointed out in one place, but in the future we need to tackle dispersed errors as well.
Authors
Nat McAleese, Maja Trębacz
Acknowledgments
Greg Brockman, Juan Felipe Ceron Uribe, Elie Georges, Wes McCabe, Evgenia Nitishinskaya, Rai (Michael Pokorny), Freddie Sulit
* Ethics & Safety
* Learning Paradigms
* Reasonings & Policy
Authors
Nat McAleese, Maja Trębacz
Acknowledgments
Greg Brockman, Juan Felipe Ceron Uribe, Elie Georges, Wes McCabe, Evgenia Nitishinskaya, Rai (Michael Pokorny), Freddie Sulit
Help shape what we cover next Help bepalen wat we hierna volgen
Anonymous feedback, no frontend account needed. Anonieme feedback, zonder front-end account.
More from OpenAI Meer van OpenAI
All updates Alle updatesOpenAI available at FedRAMP Moderate OpenAI available at FedRAMP Moderate
Expanding secure AI for government. Expanding secure AI for government.
Choco automates food distribution with AI agents Choco automates food distribution with AI agents
Using OpenAI APIs, Choco processes millions of orders, reducing manual work and enabling always-on operations across global food supply chains. Using OpenAI APIs, Choco processes millions of orders, reducing manual work and enabling always-on operations across global food supply chains.
An open-source spec for Codex orchestration: Symphony. An open-source spec for Codex orchestration: Symphony.
Title: An open-source spec for Codex orchestration: Symphony. Title: An open-source spec for Codex orchestration: Symphony.
The next phase of the Microsoft OpenAI partnership The next phase of the Microsoft OpenAI partnership
Amended agreement provides long-term clarity. Amended agreement provides long-term clarity.