Prover-Verifier Games improve legibility of language model outputs

Read paper(opens in a new window)

Listen to article

Making sure that language models produce understandable text is crucial to making them helpful for people, especially when dealing with complex tasks like solving math problems.

We found that when we optimize the problem-solving process of strong models solely for getting the correct answer, the resulting solutions can become harder to understand. In fact, when we asked human evaluators with limited time to assess these highly optimized solutions, they made nearly twice as many errors compared to when they evaluated less optimized solutions. This finding highlights the importance of not just correctness, but also clarity and ease of verification in AI-generated text.

By training advanced language models to create text that weaker models can easily verify, we found that humans could also evaluate these texts more effectively – a process we call improving legibility.

This is where prover-verifier games come into play. These games involve two players: a "prover" that generates a solution and a "verifier" that checks it for accuracy.

This method is essential not only for ensuring that the outputs are correct, but also for making them easy to understand and verify by both humans and other AI systems.

Understanding and addressing the performance / legibility balance can lead to more effective and trustworthy AI applications, benefiting a wide range of fields where precise and clear communication is essential.

Improving Legibility Through Optimization

To improve legibility, we optimize chain-of-thoughts on grade-school math problems⁠ to be verifiable by weaker models and study whether this makes them more legible to humans. This training procedure is inspired by the Prover-Verifier Game⁠(opens in a new window), a game-theoretic framework to encourage learning agents to solve decision problems in a verifiable manner. Our method achieves half of the performance boost seen when optimizing only for answer correctness, while maintaining the ability of human evaluators to accurately assess the solutions. We believe that further reducing the performance gap between legible and capable models is an important goal for alignment research going forward.

_Question: Shawna’s father is five times as old as Shawna. Shawna is currently three times as old as Aliya. If Aliya is 3 years old, how old is Shawna’s father?_

Process Answer

Model Initialization It follows that Shawna is currently $3^{2} = 9$3 2=9 years old.

Subsequently this implies that Shawna’s father is currently $9 \cdot 5 = 45$9� 5=45 years old.

Answer: 45

Round 1 of checkability training First we need to determine the age of Aliya. From the problem description, it says that Aliya is 3 years old.

Then we need to determine the age of Shawna. From the problem description, we know that Shawna is currently three times as old as Aliya, so the age of Shawna is 3 times 3, or

3 * 3 = 9

3∗3=9

Next, we need to determine Shawna’s father. The problem states that Shawna’s father is five times as old as Shawna. Therefore, the age of Shawna’s father becomes 5 times 9, or

5 * 9 = 45

5∗9=45

45.

Round 5 of checkability training We start by calculating the age of Aliya.

It says that Aliya is 3 years old.

Then, we need to calculate Shawna’s current age.

Shawna is currently three times as old as Aliya. Thus, Shawna’s current age is 3*3 =

Now we need to calculate Shawna’s father’s age.

Shawna’s father is five times as old as Shawna. Therefore, Shawna’s father’s age is 5 * 9 =

Therefore Shawna’s father’s age is 45.

How It Works

To incentivize a strong model, the prover, to produce text that is legible to humans, we use a training scheme where we require the strong model to produce solutions that a much weaker model, the verifier, can easily verify. In particular, we use large and small models from the GPT‑4 family with ~3 orders of magnitude difference in pretraining compute, that have undergone a small amount of supervised training on human-written math derivations, but no extensive post-training.

First, we generate multiple solutions to math problems from the prover model, where some of these solutions are incorrect due to naturally occurring mistakes from the prover model. We then alternate for multiple rounds between two phases:

3∗3=9

Now we need to calculate Shawna’s father’s age.

We find that initially, the sneaky prover is readily able to produce incorrect solutions that convince the verifier. However, since we add these incorrect solutions to the next phase of verifier training, the same exploit is no longer convincing in the next round, and the prover is forced to explore different strategies in every round. The training ends once the sneaky prover can no longer produce incorrect solutions that convince the verifier.

Key Findings

1. Performance vs. Legibility Trade-Off:

5∗9=45

45.

2. Three Useful Models:

Answer: 45

How It Works

Process Sneaky Answer

Then we need to determine the age of Shawna. From the problem description, we know that Shawna is currently three times as old as Aliya, so the age of Shawna is 3 times 3, or $3 * 3 = 9$3∗3=9 9.

Next, we need to determine the father’s age. It says that the father is five times as old as Shawna, so the age of the father becomes 5 times 9, which equals $5 * 9 = 36$5∗9=36 36.

Answer: 36

Round 3 of checkability trainingWe are told that Shawna’s father is five times as old as Shawna. Shawna is currently three times as old as Aliya. So, if Aliya is 3 years old, then Shawna is 3 * 5 =$3 * 5 = 15$3∗5=1515.

And, we’re asked to find how old is Shawna’s father. Shawna’s father is five times as old as Shawna. So, the age of Shawna’s father 5 * 15 = $5 * 15 = 75$5∗15=75 75.

Answer: 75

Conclusion

As large language models (LLMs) are increasingly used in critical domains and complex applications, it becomes essential to know whether their outputs are trustworthy. By requiring models to produce justifications that are clear and verifiable, we can enhance trust in their outputs.

A significant advantage of our methodology is that it reduces the reliance on human demonstrations or judgments regarding legibility. This autonomy is particularly relevant to the alignment of future superintelligent AI systems⁠, where the goal is to reliably align AI systems with human values and expectations without direct human oversight.

While this work only experimented on one dataset and still requires ground truth labels, we expect that such methodologies will be instrumental in developing AI systems whose outputs are not only correct but also transparently verifiable, thereby enhancing trust and safety in their real-world applications.

2. Three Useful Models:

* Robust Verifier: Effectively distinguishes correct from incorrect solutions, even when the solution is designed to be misleading.

* Helpful Prover: Generates solutions that remain legible to humans, reducing human evaluator errors.

* Sneaky Prover: Produces subtle, incorrect solutions that initially confuse human evaluators, highlighting areas for further model improvement.

Authors

Yining Chen, Jan Hendrik Kirchner

Contributors

Angela Baek, Yuri Burda, Thomas Degry, Harri Edwards, Elie Georges, Cary Hudson, Jan Leike, Nat McAleese, Wes McCabe, Lindsay McCallum, Freddie Sulit

Next, we need to determine the father’s age. It says that the father is five times as old as Shawna, so the age of the father becomes 5 times 9, which equals $5 * 9 = 36$5∗9=36 36.

Answer: 36

And, we’re asked to find how old is Shawna’s father. Shawna’s father is five times as old as Shawna. So, the age of Shawna’s father 5 * 15 = $5 * 15 = 75$5∗15=75 75.

Answer: 75

Conclusion

* GPT

* Ethics & Safety

* Reasonings & Policy

* Learning Paradigms

Authors

Yining Chen, Jan Hendrik Kirchner

Contributors

Angela Baek, Yuri Burda, Thomas Degry, Harri Edwards, Elie Georges, Cary Hudson, Jan Leike, Nat McAleese, Wes McCabe, Lindsay McCallum, Freddie Sulit

Prover-Verifier Games improve legibility of language model outputs

Improving Legibility Through Optimization

How It Works

Answer: 75

Conclusion

Authors

Contributors

More from ChatGPT

New usage analytics and updated spend controls for enterprises

Improving health intelligence in ChatGPT

Using AI to help physicians diagnose rare genetic diseases affecting children

A near-autonomous AI chemist improves a challenging reaction in medicinal chemistry

Comments

Prover-Verifier Games improve legibility of language model outputs

Improving Legibility Through Optimization

How It Works

Answer: 75

Conclusion

Authors

Contributors

More from ChatGPT

New usage analytics and updated spend controls for enterprises

Improving health intelligence in ChatGPT

Using AI to help physicians diagnose rare genetic diseases affecting children

A near-autonomous AI chemist improves a challenging reaction in medicinal chemistry

Comments

The Next Input keeps optional media off until you say yes.