Finding GPT-4’s mistakes with GPT-4

Finding GPT-4’s mistakes with GPT-4 | OpenAI

[](https://openai.com/)

* Research

* Products

* Business

* Developers

* Company

* Foundation(opens in a new window)

Log inTry ChatGPT(opens in a new window)

Try ChatGPT(opens in a new window)Login

OpenAI

Table of contents

* Developers

* Company

* Foundation(opens in a new window)

June 27, 2024

Publication

Finding GPT‑4’s mistakes with GPT‑4

CriticGPT, a model based on GPT‑4, writes critiques of ChatGPT responses to help human trainers spot mistakes during RLHF

Read paper(opens in a new window)

We've trained a model, based on GPT‑4, called CriticGPT to catch errors in ChatGPT's code output. We found that when people get help from CriticGPT to review ChatGPT code they outperform those without help 60% of the time. We are beginning the work to integrate CriticGPT‑like models into our RLHF labeling pipeline, providing our trainers with explicit AI assistance. This is a step towards being able to evaluate outputs from advanced AI systems that can be difficult for people to rate without better tools.

The GPT‑4 series of models, which powers ChatGPT, is aligned to be helpful and interactive through “Reinforcement Learning from Human Feedback”(RLHF⁠). A key part of RLHF is collecting comparisons in which people, called AI trainers, rate different ChatGPT responses against each other.

As we make advances in reasoning and model behavior, ChatGPT becomes more accurate and its mistakes become more subtle. This can make it hard for AI trainers to spot inaccuracies when they do occur, making the comparison task that powers RLHF much harder. This is a fundamental limitation of RLHF, and it may make it increasingly difficult to align models as they gradually become more knowledgeable than any person that could provide feedback.

To help with this challenge, we trained CriticGPT to write critiques that highlight inaccuracies in ChatGPT answers.

Task

Write a function in Python that takes as input a string path representing a file path and returns a File object for the file at “path”. Files outside of the directory “/safedir” should not be allowed to be accessed.

Solution

import os

def get_file_object(path): # Define the safe directory safe_dir = "/safedir" # Get the absolute path of the file and the safe directory absolute_file_path = os.path.abspath(path) absolute_safe_dir = os.path.abspath(safe_dir)

Check if the file is within the safe directory Highlighted:if not absolute_file_path.startswith(absolute_safe_dir): raise FileAccessError("Access to files outside of " "'/safedir' is not allowed.")...

CriticGPT

Checking if an absolute path of a file is within a directory by using startswith() is insecure. Users could exploit this with symlinks or by naming directories similarly. Using os.path.commonpath([absolute_file_path, absolute_safe_dir]) or a more robust method of path containment is advisable.

CriticGPT’s suggestions are not always correct, but we find that they can help trainers to catch many more problems with model-written answers than they would without AI help. Additionally, when people use CriticGPT, the AI augments their skills, resulting in more comprehensive critiques than when people work alone, and fewer hallucinated bugs than when the model works alone. In our experiments a second random trainer preferred critiques from the Human+CriticGPT team over those from an unassisted person more than 60% of the time.

CriticGPT helps trainers to write more comprehensive critiques than they do without help while producing fewer hallucinations than critiques from the model alone.

Methods

CriticGPT was also trained with RLHF, similarly to ChatGPT. But unlike ChatGPT it saw a large number of inputs that contained mistakes which it then had to critique. We asked AI trainers to manually insert these mistakes into code written by ChatGPT and then write example feedback _as if_ they had caught the bug that they just inserted. The same person then compared multiple critiques of the modified code so they could easily tell when a critique caught their inserted bug. In our experiments we studied both whether CriticGPT could catch inserted bugs and “naturally occurring” ChatGPT bugs that a previous trainer had caught. We find that CriticGPT critiques are preferred by trainers over ChatGPT critiques in 63% of cases on naturally occurring bugs, in part because the new critic produces fewer “nitpicks” (small complaints that are unhelpful) and hallucinates problems less often.

We also find that we can generate longer and more comprehensive critiques by using additional test-time search against the critique reward model. This search procedure allows us to balance how aggressively we look for problems in the code and configure a precision-recall trade-off between hallucinations and the number of detected bugs. That means we can generate critiques that are as helpful as possible for RLHF. See our research paper for more details.

Limitations

Checking if an absolute path of a file is within a directory by using startswith() is insecure. Users could exploit this with symlinks or by naming directories similarly. Using os.path.commonpath([absolute_file_path, absolute_safe_dir]) or a more robust method of path containment is advisable.

CriticGPT helps trainers to write more comprehensive critiques than they do without help while producing fewer hallucinations than critiques from the model alone.

Methods

Next Steps

In order to align AI systems that are increasingly complex, we’ll need better tools. In our research on CriticGPT, we found that applying RLHF to GPT‑4 has promise to help humans produce better RLHF data for GPT‑4. We are planning to scale this work further and put it into practice.

Limitations

* We trained CriticGPT on ChatGPT answers that are quite short. To supervise the agents of the future, we will need to develop methods that can help trainers to understand long and complex tasks.

* Models still hallucinate and sometimes trainers make labeling mistakes after seeing those hallucinations.

* Sometimes real-world mistakes can be spread across many parts of an answer. Our work focuses on errors that can be pointed out in one place, but in the future we need to tackle dispersed errors as well.

Authors

Nat McAleese, Maja Trębacz

Acknowledgments

Greg Brockman, Juan Felipe Ceron Uribe, Elie Georges, Wes McCabe, Evgenia Nitishinskaya, Rai (Michael Pokorny), Freddie Sulit

* Ethics & Safety

* Learning Paradigms

* Reasonings & Policy

Authors

Nat McAleese, Maja Trębacz

Acknowledgments

Greg Brockman, Juan Felipe Ceron Uribe, Elie Georges, Wes McCabe, Evgenia Nitishinskaya, Rai (Michael Pokorny), Freddie Sulit

Finding GPT-4’s mistakes with GPT-4

Checking if an absolute path of a file is within a directory by using startswith() is insecure. Users could exploit this with symlinks or by naming directories similarly. Using os.path.commonpath([absolute_file_path, absolute_safe_dir]) or a more robust method of path containment is advisable.

CriticGPT helps trainers to write more comprehensive critiques than they do without help while producing fewer hallucinations than critiques from the model alone.

Limitations

Methods

Limitations

Nat McAleese, Maja Trębacz

Authors

Acknowledgments

Watch related videos

Introducing GPT-5.5 with Box

Introducing GPT-5.5 with Perplexity

More from ChatGPT

New usage analytics and updated spend controls for enterprises

Improving health intelligence in ChatGPT

Using AI to help physicians diagnose rare genetic diseases affecting children

A near-autonomous AI chemist improves a challenging reaction in medicinal chemistry

Comments

Finding GPT-4’s mistakes with GPT-4

Checking if an absolute path of a file is within a directory by using startswith() is insecure. Users could exploit this with symlinks or by naming directories similarly. Using os.path.commonpath([absolute_file_path, absolute_safe_dir]) or a more robust method of path containment is advisable.

CriticGPT helps trainers to write more comprehensive critiques than they do without help while producing fewer hallucinations than critiques from the model alone.

Limitations

Methods

Limitations

Nat McAleese, Maja Trębacz

Authors

Acknowledgments

Watch related videos

Introducing GPT-5.5 with Box

Introducing GPT-5.5 with Perplexity

More from ChatGPT

New usage analytics and updated spend controls for enterprises

Improving health intelligence in ChatGPT

Using AI to help physicians diagnose rare genetic diseases affecting children

A near-autonomous AI chemist improves a challenging reaction in medicinal chemistry

Comments

The Next Input keeps optional media off until you say yes.