Our approach to data and AI

AI should expand opportunities for everyone. By transforming information in new ways, AI systems help us solve problems and express ourselves. Today, our AI tools like ChatGPT are being used around the world to help farmers in Kenya and India increase crop yields (Digital Green⁠), researchers accelerate drug discovery (Moderna⁠), governments support their workforces (State of Pennsylvania⁠(opens in a new window)), educators advance⁠ student learning, and people with visual impairments navigate our world (Be My Eyes⁠). AI tools like DALL·E⁠ and Sora⁠ (currently in research preview) are empowering⁠ creatives from aspiring artists to filmmakers⁠.

Our mission is to benefit all of humanity. This encompasses not only our users, but also creators and publishers. While we believe legal precedents and sound public policy make learning fair use, we also feel that it’s important we contribute to the development of a broadly beneficial social contract for content in the AI age.

We believe AI systems should benefit and respect the choices of creators and content owners. We’re continually improving our industry-leading systems to reflect content owner preferences, and are dedicated to building products and business models to fuel vibrant ecosystems for creators and publishers.

We are not professional writers, artists, or journalists, nor are we in those lines of business. We focus on building tools to help these professions create and achieve more. To accomplish this, we listen to and work closely with members of these communities, and look forward to our continued dialogues. Today, we’re sharing more about where we are and where we’re headed.

Decades ago, the robots.txt standard was introduced and voluntarily adopted by the Internet ecosystem for web publishers to indicate what portions of websites web crawlers could access.

Last summer, OpenAI pioneered the use of web crawler permissions for AI, enabling web publishers to express their preferences about the use of their content in AI. We take these signals into account each time we train a new model.

That said, we understand these are incomplete solutions, as many creators do not control websites where their content may appear, and content is often quoted, reviewed, remixed, reposted and used as inspiration across multiple domains. We need an efficient, scalable solution for content owners to express their preferences about the use of their content in AI systems.

OpenAI is developing Media Manager, a tool that will enable creators and content owners to tell us what they own and specify how they want their works to be included or excluded from machine learning research and training. Over time, we plan to introduce additional choices and features.

This will require cutting-edge machine learning research to build a first-ever tool of its kind to help us identify copyrighted text, images, audio, and video across multiple sources and reflect creator preferences.

We’re collaborating with creators, content owners, and regulators as we develop Media Manager. Our goal is to have the tool in place by 2025, and we hope it will set a standard across the AI industry.

Today, we live in an attention economy built for advertisers over users and quantity over quality. Our ambition is to use AI to change this: to empower creators and publishers and to enhance the user experience.

We’re continuously making our products more useful discovery engines. We recently

improved source links in ChatGPT⁠(opens in a new window) to give users better context and web publishers new ways to connect with our audiences.

We’re also working with partners to display their content in our products and increase their connection to readers. We’ve announced partnerships with global news publishers from the

Financial Times⁠, to Le Monde⁠, Prisa Media⁠, Axel Springer⁠ and more, to display their content in ChatGPT and enrich the user experience on news topics. More innovation is on the way. This content may also be used to train ChatGPT to better surface relevant publisher content to users and to improve our tools for newsrooms.

Our partnerships are crafted to benefit partners and their users, making our models more useful to their employees, customers, and communities. To help advance educational resources, we partnered with nonprofits

Khan Academy⁠ and UK-based ExamSolutions⁠(opens in a new window) to improve our model’s math performance, which accelerates their ability to expand access to personalized AI tutoring on their platform.

We want our AI models to learn from as many languages, cultures, subjects, and industries as possible so they can benefit as many people as possible. The more diverse datasets are, the more diverse the models’ knowledge, understanding, and languages become – like a person who has been exposed to a wide range of cultural perspectives and experiences – and the more people and countries AI can safely serve.

Each new generation of foundation models is trained from scratch on a new dataset. We constantly improve our architecture and increase the scale and diversity of our datasets significantly beyond our previous models. Unlike larger companies in the AI field, we do not have a large corpus of data collected over decades. We primarily rely on publicly available information to teach our models how to be helpful.

We train our models using:

* Select publicly available data, mostly collected from industry-standard machine learning datasets and web crawls, similar to search engines. We exclude sources we know to have paywalls, primarily aggregate personally identifiable information, have content that violates our policies, or have opted-out.

* Proprietary data from data partnerships⁠. We partner to access non-publicly available content, such as archives and metadata. Our partners range from a major private video library for images and videos to train Sora to the Government of Iceland⁠ to help preserve their native languages. We don’t pursue paid partnerships for purely publicly available information.

* Human feedback from AI trainers, red teamers, employees, and users whose data control settings allow model improvements.

We take care to reduce the processing of personal and sensitive information, and we train our models not to provide private or sensitive information about people. We use a number of techniques to process raw data for safe use in training, and increasingly use AI models to help us clean, prepare and generate data.

We do not train on our customers’ business data, including data from ChatGPT Team, ChatGPT Enterprise, or our API Platform. ChatGPT Free and Plus users can control whether they contribute to future model improvements in their settings⁠(opens in a new window).

Our approach to data and AI

More from ChatGPT

New usage analytics and updated spend controls for enterprises

Improving health intelligence in ChatGPT

Just a moment...

A near-autonomous AI chemist improves a challenging reaction in medicinal chemistry

Comments

Our approach to data and AI

More from ChatGPT

New usage analytics and updated spend controls for enterprises

Improving health intelligence in ChatGPT

Just a moment...

A near-autonomous AI chemist improves a challenging reaction in medicinal chemistry

Comments

The Next Input keeps optional media off until you say yes.