Introducing 4o Image Generation

Try in ChatGPT (opens in a new window)

Listen to article

At OpenAI, we have long believed image generation should be a primary capability of our language models. That’s why we’ve built our most advanced image generator yet into GPT‑4o. The result—image generation that is not only beautiful, but useful.

Whiteboard session Meaningful words Comic strip Science experiment

A wide image taken with a phone of a glass whiteboard, in a room overlooking the Bay Bridge. The field of view shows a woman writing, sporting a tshirt wiith a large OpenAI logo. The handwriting looks natural and a bit messy, and we see the photographer's reflection.

The text reads:

(left)

"Transfer between Modalities:

Suppose we directly model

p(text, pixels, sound) [equation]

with one big autoregressive transformer.

Pros:

* image generation augmented with vast world knowledge

* next-level text rendering

* native in-context learning

* unified post-training stack

Cons:

* varying bit-rate across modalities

* compute not adaptive"

(Right)

"Fixes:

* model compressed representations

* compose autoregressive prior with a powerful decoder"

On the bottom right of the board, she draws a diagram:

"tokens -> [transformer] -> [diffusion] -> pixels"

_Best of 8_

selfie view of the photographer, as she turns around to high five him

Useful image generation

From the first cave paintings to modern infographics, humans have used visual imagery to communicate, persuade, and analyze—not just to decorate. Today's generative models can conjure surreal, breathtaking scenes, but struggle with the workhorse imagery people use to share and create information. From logos to diagrams, images can convey precise meaning when augmented with symbols that refer to shared language and experience.

GPT‑4o image generation excels at accurately rendering text, precisely following prompts, and leveraging 4o’s inherent knowledge base and chat context—including transforming uploaded images or using them as visual inspiration. These capabilities make it easier to create exactly the image you envision, helping you communicate more effectively through visuals and advancing image generation into a practical tool with precision and power.

00:00

Improved capabilities

We trained our models on the joint distribution of online images and text, learning not just how images relate to language, but how they relate to each other. Combined with aggressive post-training, the resulting model has surprising visual fluency, capable of generating images that are useful, consistent, and context-aware.

Text rendering

A picture is worth a thousand words, but sometimes generating a few words in the right place can elevate the meaning of an image. 4o’s ability to blend precise symbols with imagery turns image generation into a tool for visual communication.

Street signs Menu Invitation

Create a photorealistic image of two witches in their 20s (one ash balayage, one with long wavy auburn hair) reading a street sign.

Context:

a city street in a random street in Williamsburg, NY with a pole covered entirely by numerous detailed street signs (e.g., street sweeping hours, parking permits required, vehicle classifications, towing rules), including few ridiculous signs at the middle: (paraphrase it to make these legitimate street signs)"Broom Parking for Witches Not Permitted in Zone C" and "Magic Carpet Loading and Unloading Only (15-Minute Limit)" and "Reindeer Parking by Permit Only (Dec 24–25)

Violators will be placed on Naughty List." The signpost is on the right of a street. Do not repeat signs. Signs must be realistic.

Characters:

one witch is holding a broom and the other has a rolled-up magic carpet. They are in the foreground, back slightly turned towards the camera and head slightly tilted as they scrutinize the signs.

Composition from background to foreground:

streets + parked cars + buildings -> street sign -> witches. Characters must be closest to the camera taking the shot

_Best of ~8_

Multi-turn generation

Because image generation is now native to GPT‑4o, you can refine images through natural conversation. GPT‑4o can build upon images and text in chat context, ensuring consistency throughout. For example, if you’re designing a video game character, the character’s appearance remains coherent across multiple iterations as you refine and experiment.

Video game Concrete poem Sticker

Give this cat a detective hat and a monocle

_Best of 1_

turn this into a triple A video games made with a 4k game engine and add some User interface as overlay from a mystery RPG where we can see a health bar and a minimap at the top as well as spells at the bottom with consistent and iconography

update to a landscape image 16:9 ratio, add more spells in the UI, and unzoom the visual so that we see the cat in a third person view walking through a steampunk manhattan creating beautiful contrast and lighting like in the best triple A game, with cool-toned colors

_Best of 2_

create the interface when the player opens the menu and we see the cat's character profile with his equipment and another page showing active quests (and it should make sense in relationship with the universe worldbuilding we are describing in the image)

credit creator: Manuel Sainsily

Instruction following

GPT‑4o’s image generation follows detailed prompts with attention to detail. While other systems struggle with ~5-8 objects, GPT‑4o can handle up to 10-20 different objects. The tighter binding of objects to their traits and relations allows for better control.

Organized objects Empty city Wine glass Invisible elephant Math equation

A square image containing a 4 row by 4 column grid containing 16 objects on a white background. Go from left to right, top to bottom. Here's the list:

1. a blue star

2. red triangle

3. green square

4. pink circle

5. orange hourglass

6. purple infinity sign

7. black and white polka dot bowtie

8. tiedye "42"

9. an orange cat wearing a black baseball cap

10. a map with a treasure chest

11. a pair of googly eyes

12. a thumbs up emoji

13. a pair of scissors

14. a blue and white giraffe

15. the word "OpenAI" written in cursive

16. a rainbow-colored lightning bolt

_Best of 5_

In-context learning

GPT‑4o can analyze and learn from user-uploaded images, seamlessly integrating their details into its context to inform image generation.

Triangle wheeled vehicle Chainsaw Woman Building

13. a pair of scissors

14. a blue and white giraffe

15. the word "OpenAI" written in cursive

_Best of ~16_

now put this in a photo taken in new york city.

World knowledge

Native image generation enables 4o to link its knowledge between text and images, resulting in a model that feels smarter and more efficient.

Code-generated image Cocktail recipes Weather infographic Whale guide Matcha instructions

Code Example (Three.js)

Unknown component type: componentCodeExample

make an image of what this means to you

Photorealism and style

Training on images reflecting a vast variety of image styles allows the model to create or transform images convincingly.

](https://images.ctfassets.net/kftzwdyauwt9/2R9czqCiP1nqec6UED0AJd/0f24e9e9299c871ffd3d5b76f5635d16/roope-car.png?w=3840&q=90&fm=webp)

![Image 21: Best of 1 | Generate an portrait ad on a solid pastel background.

In solid white san serif text, "ChatGPT image generation" in the top left, about a third of the way down.

In solid white san serif text, "Form follows function", in the bottom right, about a third of the way up.

In the background, put a photo of a really sleek, modern sculpture. It should gradually transition from a wireframe sketch on the left to the fully photorealistic version on the right.

At the very bottom, in medium-small text, say "This entire poster was generated by ChatGPT image generation."](https://images.ctfassets.net/kftzwdyauwt9/7uGZYx7ovMPCJG0Jn3bbR8/ae523a57bf84f7489ae3cf146aeb9063/ChatGPT_Image_Mar_24202510_13_57_PM.png?w=3840&q=90&fm=webp)

![Image 23: Realistic photograph of a horse galloping from right to left across a vast, calm ocean surface, accurately depicting splashes, reflections, and subtle ripple patterns beneath their hooves. Exaggerate horse movements but everything else should be still, quiet to show contrast with the horse's strength. clean composition, cinematographic. A wide, panoramic composition showcasing a distant horizon. Atmospheric perspective creating depth. zoomed out so the horse appears minuscule compared to vast ocean.

horse is right at the horizon where ocean meets sky. use rule of thirds to position horse. size of horse is 1% size of entire image because camera is so far away from subject. camera view is super close to the ground/ocean like a worm's eye view. horse is galloping right where ocean meets the sky](https://images.ctfassets.net/kftzwdyauwt9/7mpwDbMvRinCinnROkjT1Y/37fa6456d29de5e5fea4a2e79f267a8f/ChatGPT_Image_Mar_24202503_32_29_PM.png?w=3840&q=90&fm=webp)

A candid paparazzi-style photo of Karl Marx hurriedly walking through the parking lot of the Mall of America, glancing over his shoulder with a startled expression as he tries to avoid being photographed. He’s clutching multiple glossy shopping bags filled with luxury goods. His coat flutters behind him in the wind, and one of the bags is swinging as if he’s mid-stride. Blurred background with cars and a glowing mall entrance to emphasize motion. Flash glare from the camera partially overexposes the image, giving it a chaotic, tabloid feel.

](https://images.ctfassets.net/kftzwdyauwt9/2R9czqCiP1nqec6UED0AJd/0f24e9e9299c871ffd3d5b76f5635d16/roope-car.png?w=3840&q=90&fm=webp)

![Image 21: Best of 1 | Generate an portrait ad on a solid pastel background.

In solid white san serif text, "ChatGPT image generation" in the top left, about a third of the way down.

Limitations

Our model isn’t perfect. We’re aware of multiple limitations at the moment which we will work to address through model improvements after the initial launch.

Cropping Hallucinations High binding problems Precise graphing Multilingual text rendering Editing precision Dense information with small text

We’ve noticed that GPT‑4o can occasionally crop longer images, like posters, too tightly, especially near the bottom.

A candid paparazzi-style photo of Karl Marx hurriedly walking through the parking lot of the Mall of America, glancing over his shoulder with a startled expression as he tries to avoid being photographed. He’s clutching multiple glossy shopping bags filled with luxury goods. His coat flutters behind him in the wind, and one of the bags is swinging as if he’s mid-stride. Blurred background with cars and a glowing mall entrance to emphasize motion. Flash glare from the camera partially overexposes the image, giving it a chaotic, tabloid feel.

* ](https://images.ctfassets.net/kftzwdyauwt9/2R9czqCiP1nqec6UED0AJd/0f24e9e9299c871ffd3d5b76f5635d16/roope-car.png?w=3840&q=90&fm=webp)

* ![Image 32: Best of 1 | Generate an portrait ad on a solid pastel background.

In solid white san serif text, "ChatGPT image generation" in the top left, about a third of the way down.

In solid white san serif text, "Form follows function", in the bottom right, about a third of the way up.

In the background, put a photo of a really sleek, modern sculpture. It should gradually transition from a wireframe sketch on the left to the fully photorealistic version on the right.

* ![Image 34: Realistic photograph of a horse galloping from right to left across a vast, calm ocean surface, accurately depicting splashes, reflections, and subtle ripple patterns beneath their hooves. Exaggerate horse movements but everything else should be still, quiet to show contrast with the horse's strength. clean composition, cinematographic. A wide, panoramic composition showcasing a distant horizon. Atmospheric perspective creating depth. zoomed out so the horse appears minuscule compared to vast ocean.

Limitations

Our model isn’t perfect. We’re aware of multiple limitations at the moment which we will work to address through model improvements after the initial launch.

Cropping Hallucinations High binding problems Precise graphing Multilingual text rendering Editing precision Dense information with small text

We’ve noticed that GPT‑4o can occasionally crop longer images, like posters, too tightly, especially near the bottom.

Introducing 4o Image Generation

Create a photorealistic image of two witches in their 20s (one ash balayage, one with long wavy auburn hair) reading a street sign.

update to a landscape image 16:9 ratio, add more spells in the UI, and unzoom the visual so that we see the cat in a third person view walking through a steampunk manhattan creating beautiful contrast and lighting like in the best triple A game, with cool-toned colors

5. orange hourglass

Native image generation enables 4o to link its knowledge between text and images, resulting in a model that feels smarter and more efficient.

In solid white san serif text, "ChatGPT image generation" in the top left, about a third of the way down.

Limitations

More from ChatGPT

New usage analytics and updated spend controls for enterprises

Improving health intelligence in ChatGPT

Using AI to help physicians diagnose rare genetic diseases affecting children

A near-autonomous AI chemist improves a challenging reaction in medicinal chemistry

Comments

Introducing 4o Image Generation

Create a photorealistic image of two witches in their 20s (one ash balayage, one with long wavy auburn hair) reading a street sign.

update to a landscape image 16:9 ratio, add more spells in the UI, and unzoom the visual so that we see the cat in a third person view walking through a steampunk manhattan creating beautiful contrast and lighting like in the best triple A game, with cool-toned colors

5. orange hourglass

Native image generation enables 4o to link its knowledge between text and images, resulting in a model that feels smarter and more efficient.

In solid white san serif text, "ChatGPT image generation" in the top left, about a third of the way down.

Limitations

More from ChatGPT

New usage analytics and updated spend controls for enterprises

Improving health intelligence in ChatGPT

Using AI to help physicians diagnose rare genetic diseases affecting children

A near-autonomous AI chemist improves a challenging reaction in medicinal chemistry

Comments

The Next Input keeps optional media off until you say yes.