Worth checking before choosing or changing a subscription. Handig om te checken voordat je een abonnement kiest of wijzigt.
Extracting Concepts from GPT-4 Extracting Concepts from GPT-4
Title: Extracting Concepts from GPT-4 Title: Extracting Concepts from GPT-4
Quick editorial signal Snelle redactionele duiding
- Track this as a OpenAI update, not just a standalone headline. Bekijk dit als OpenAI-update, niet alleen als losse headline.
- Check plan details before changing subscriptions or advising a team. Controleer plandetails voordat je abonnementen wijzigt of een team adviseert.
- Likely worth revisiting after people have used the release in practice. Waarschijnlijk de moeite waard om opnieuw te bekijken zodra mensen het in praktijk gebruiken.
Extracting Concepts from GPT-4 | OpenAI
Skip to main content
[](https://openai.com/)
* Research
* Products
* Business
* Developers
* Company
* Foundation(opens in a new window)
Log inTry ChatGPT(opens in a new window)
Table of contents
* Products
* Business
* Developers
* Company
June 6, 2024
Publication
Extracting concepts from GPT‑4
We used new scalable methods to decompose GPT‑4’s internal representations into 16 million oft-interpretable patterns.
Read paper(opens in a new window)Read the code(opens in a new window)Browse features(opens in a new window)
We currently don't understand how to make sense of the neural activity within language models. Today, we are sharing improved methods for finding a large number of "features"—patterns of activity that we hope are human interpretable. Our methods scale better than existing work, and we use them to find 16 million features in GPT‑4. We are sharing a paper(opens in a new window), code(opens in a new window), and feature visualizations(opens in a new window) with the research community to foster further exploration.
The challenge of interpreting neural networks
Unlike with most human creations, we don’t really understand the inner workings of neural networks. For example, engineers can directly design, assess, and fix cars based on the specifications of their components, ensuring safety and performance. However, neural networks are not designed directly; we instead design the algorithms that train them. The resulting networks are not well understood and cannot be easily decomposed into identifiable parts. This means we cannot reason about AI safety the same way we reason about something like car safety.
In order to understand and interpret neural networks, we first need to find useful building blocks for neural computations. Unfortunately, the neural activations inside a language model activate with unpredictable patterns, seemingly representing many concepts simultaneously. They also activate densely, meaning each activation is always firing on each input. But real world concepts are very sparse—in any given context, only a small fraction of all concepts are relevant. This motivates the use of sparse autoencoders, a method for identifying a handful of "features" in the neural network that are important to producing any given output, akin to the small set of concepts a person might have in mind when reasoning about a situation. Their features display sparse activation patterns that naturally align with concepts easy for humans to understand, even without direct incentives for interpretability.
However, there are still serious challenges to training sparse autoencoders. Large language models represent a huge number of concepts, and our autoencoders may need to be correspondingly huge to get close to full coverage of the concepts in a frontier model. Learning a large number of sparse features is challenging, and past work has not been shown to scale well.
Our research progress: large scale autoencoder training
We developed new state-of-the-art methodologies which allow us to scale our sparse autoencoders to tens of millions of features on frontier AI models. We find that our methodology demonstrates smooth and predictable scaling, with better returns to scale than prior techniques. We also introduce several new metrics for evaluating feature quality.
We used our recipe to train a variety of autoencoders on GPT‑2 small and GPT‑4 activations, including a 16 million feature autoencoder on GPT‑4. To check interpretability of features, we visualize a given feature by showing documents where it activates. Here are some interpretable features we found:
Human Imperfection Price Increases X and Y Training Logs Rhetorical Questions Algebraic Rings Who/What the Dopamine
GPT-4 feature: phrases relating to things (especially humans) being flawed
View full visualization(opens in a new window)
most people, it isn’t. We all have wonderful days, glimps es of what we perceive to be perfection, but we can also all have truly shit-t astic ones, and I can assure you that you’re not alone. So toddler of mine, and most other toddlers out there, remember; Don’t be a
has w arts. What system that is used to build real world software doesn't? I've built systems in a number of languages and frameworks and they all had w arts and issues. How much research has the author done to find other solutions? The plea at the end seemed very lazy web ish to me
often put our hope in the wrong places – in the world, in other people, in our abilities or finances – but all of that is like sinking sand. The only place we can find hope is in Jesus Christ. These words by Kut less tell us just where we need to go to find hope. I lift my
churches since the last Great Re formation has also become warped. I state again, while churches are formed and planted with the most Holy and Divine of inspir ations, they are not free from the corruption of humanity. While they are of our great and perfect Father, they are on an imperfect Earth. And we Rog ues are
perfect. If anyone does not believe that let them say so. You really do appear to be just about a meter away from me. But you are actually in my brain. What art istry! What perfection! Not the slightest bl urring. And in 3-D. Sound is also 3-D. And images.
We found many other interesting features, which you can browse here(opens in a new window).
Limitations
We are excited for interpretability to eventually increase model trustworthiness and steerability. However, this is still early work with many limitations:
most people, it isn’t. We all have wonderful days, glimps es of what we perceive to be perfection, but we can also all have truly shit-t astic ones, and I can assure you that you’re not alone. So toddler of mine, and most other toddlers out there, remember; Don’t be a
has w arts. What system that is used to build real world software doesn't? I've built systems in a number of languages and frameworks and they all had w arts and issues. How much research has the author done to find other solutions? The plea at the end seemed very lazy web ish to me
often put our hope in the wrong places – in the world, in other people, in our abilities or finances – but all of that is like sinking sand. The only place we can find hope is in Jesus Christ. These words by Kut less tell us just where we need to go to find hope. I lift my
Looking ahead, and open sourcing our research
While sparse autoencoder research is exciting, there is a long road ahead with many unresolved challenges. In the short term, we hope the features we've found can be practically useful for monitoring and steering language model behaviors and plan to test this in our frontier models. Ultimately, we hope that one day, interpretability can provide us with new ways to reason about model safety and robustness, and significantly increase our trust in powerful AI models by giving strong assurances about their behavior.
Today, we are sharing a paper(opens in a new window) detailing our experiments and methods, which we hope will make it easier for researchers to train autoencoders at scale. We are releasing a full suite of autoencoders for GPT‑2 small, along with code(opens in a new window) for using them, and the feature visualizer(opens in a new window) to get a sense of what the GPT‑2 and GPT‑4 features may correspond to.
Limitations
We are excited for interpretability to eventually increase model trustworthiness and steerability. However, this is still early work with many limitations:
* Like previous works, many of the discovered features are still difficult to interpret, with many activating with no clear pattern or exhibiting spurious activations unrelated to the concept they seem to usually encode. Furthermore, we don't have good ways to check the validity of interpretations.
Authors
Jeffrey Wu, Leo Gao, Tom Dupré la Tour, Henk Tillman
Acknowledgments
Taya Christianson, Elizabeth Proehl, Yo Shavit, Niko Felix, Cathy Yeh, Gabriel Goh, Rajan Troll, Alec Radford, Jan Leike, Ilya Sutskever, David Robinson, Greg Brockman
Today, we are sharing a paper(opens in a new window) detailing our experiments and methods, which we hope will make it easier for researchers to train autoencoders at scale. We are releasing a full suite of autoencoders for GPT‑2 small, along with code(opens in a new window) for using them, and the feature visualizer(opens in a new window) to get a sense of what the GPT‑2 and GPT‑4 features may correspond to.
* GPT
* Language
* Learning Paradigms
Authors
Jeffrey Wu, Leo Gao, Tom Dupré la Tour, Henk Tillman
Acknowledgments
Taya Christianson, Elizabeth Proehl, Yo Shavit, Niko Felix, Cathy Yeh, Gabriel Goh, Rajan Troll, Alec Radford, Jan Leike, Ilya Sutskever, David Robinson, Greg Brockman
Help shape what we cover next Help bepalen wat we hierna volgen
Anonymous feedback, no frontend account needed. Anonieme feedback, zonder front-end account.
More from OpenAI Meer van OpenAI
All updates Alle updatesOpenAI available at FedRAMP Moderate OpenAI available at FedRAMP Moderate
Expanding secure AI for government. Expanding secure AI for government.
Choco automates food distribution with AI agents Choco automates food distribution with AI agents
Using OpenAI APIs, Choco processes millions of orders, reducing manual work and enabling always-on operations across global food supply chains. Using OpenAI APIs, Choco processes millions of orders, reducing manual work and enabling always-on operations across global food supply chains.
An open-source spec for Codex orchestration: Symphony. An open-source spec for Codex orchestration: Symphony.
Title: An open-source spec for Codex orchestration: Symphony. Title: An open-source spec for Codex orchestration: Symphony.
The next phase of the Microsoft OpenAI partnership The next phase of the Microsoft OpenAI partnership
Amended agreement provides long-term clarity. Amended agreement provides long-term clarity.