Introducing ChatGPT Images 2.0

By OpenAI

Summary

Topics Covered

Image generation that thinks, not just renders
Seeing the old model mistakes for the first time
From image generator to interactive AI companion
Thinking mode enables coherent multi-image creation
Generating thousands of characters without errors

Full Transcript

Today we are launching IMAGen 2.0. If we

think of Dalia as cave drawings and IMAG gen 1 as ancient art, then I imag 2.0 is the Renaissance.

Image Gen 2.0 is the smartest image generation model ever built with the ability to generate complex, polished, and productionready visuals with accurate text and structured design. You

see, this model isn't just generating images, it's thinking.

That's right. Image Gen 2.0 is thinking and researching. And it can even search

and researching. And it can even search the web to generate images with the most accurate information available. And with

that information, the model is able to generate infographics that explain complex systems and images that solve math problems with proofs.

And with new multilingual capabilities, you can create visuals with multiple languages for the entire world.

And now, for the first time in image generation, you can create multiple distinct images at once. So you can generate entire magazines with structured typography and photorealistic photos, full renovation plans for every

room in your house, or manga comics with recurring characters and evolving story lines.

And you can now generate images with 2K resolution across multiple aspect ratios with extraordinary micro details.

You see, we are no longer generating images to marvel at. With Images 2.0, we are generating images to discover and navigate, to invent and build, to dream and explore the world

and bring ideas to life.

A little over a year ago, we launched images and chatbt. People loved it and it was amazing to see the creativity it unleashed. But today, we're going to

unleashed. But today, we're going to blow way past that with images 2.0.

Images 2.0 is a huge step forward. This

is like going from GPT3 to GPT5 all at once. the ability to create incredible

once. the ability to create incredible new images, express creativity, and really make beautiful and complex things is is quite remarkable. I This is an easier thing to just show you than talk about. So, I'd like to jump right in. Uh

about. So, I'd like to jump right in. Uh

the team really cooked on this one, and we can't wait to see what you'll do with it. Uh it's available right now in

it. Uh it's available right now in Chatbt and in the API. And here's Gabe to tell you more about it.

So, hey everyone. I'm Gabe. I'm the

research team for CHBT images.

Hi, I'm Kan.

I'm Kenji.

I'm Alex. and we are researchers on the image generation team.

So I am very excited about this model and um I I think this model is producing images of certain quality. I think it's very hard to explain but one way I would

say it is sort of like uh they look just normal. They just look at normal images

normal. They just look at normal images and uh one experience I had um looking at these images is that after you look at these enough you go back and look at previous images and you see all the mistakes that the the previous model did

that you didn't even notice before. I

mean, they look great at the time, but I think these images look so much better.

Um, anyway, I'm going to kick off a prompt. Um,

prompt. Um, here's a picture of the four of us, uh, we took yesterday, and, uh, we're going to try and create a magazine cover from

this from this from this image. So, um,

one thing about this model is this model has a lot of breath and a lot of depth to it. So, I think it'll be a while, uh,

to it. So, I think it'll be a while, uh, you know, before everyone discovers all the little nooks and crannies of this model. Uh, but one thing we noticed

model. Uh, but one thing we noticed about this model is that this model is really good at design. So, you know, it seems to be really really deliberate

about where it puts the the text in the image and um Oh, I think is a live stream. This is fine.

Yeah, I think maybe I need to retry this. Okay.

Yes.

No, I think it's fine.

Oh. Oh, yeah. Great. I think it's okay.

Everything is good. Yes. Um, okay. So,

let's take a look.

Um, so it it it's really deliberate about where it puts the text and and the design is looks really nice. I I

remember a time where image generation could barely generate a single word um you know without making typos. And now

know typos are very rare. In fact, it's very hard to even find a single typo.

So, that's been one of the things that surprised me about the new model is things that I just never thought were possible about cohesion and not making a typo and complex text and a ton of detail in one image.

It's rare to find a mistake.

Yes, it's very rare. You can you can you can do a whole paragraph or you full page of text without making a mistake and or the full layout of the magazine.

Yeah, full layout of the magazine.

All the small text seems to be very well done and I think the design is really nice.

You guys look like a very cool boy band.

Yeah. Okay. So we'll be releasing two uh versions of the model. There was the instant version of the model which is what you're seeing here and there is a thinking version of the model. So the

thinking version of the model is um something that um you can toggle using the thinking mode and it will be available to paid users. And what it does is it it deliberates a little bit before it actually generates an image.

So it winds up a really good prompt and it can search the web. It can do a lot.

So I'm going to um try out this prompt.

So last year um we did a version of this prompt where we turned a selfie much more powerful.

We can actually generate an entire uh manga from a single prompt. So we can generate like three pages of a manga from a single prompt. And I'm just going to kick off this generation and then uh

Kenji will talk a bit more about it.

So this is you selected thinking mode.

We're only doing this for paid users for now. And it can do a much more complex

now. And it can do a much more complex image.

That is correct. Yeah. So, you have to select the thinking mode to get to this mode. And you can generate multiple

mode. And you can generate multiple images at once and a lot of other very interesting things KJ will talk about.

So, I'm going to kick off another prompt here. Um,

here. Um, and I'm not going to spoil it, but it has something to do with the word duct tape. So, yes. Um, okay. I'll hand it

tape. So, yes. Um, okay. I'll hand it over to Kiwan to talk about instant mode. Okay. Thanks, Gabe. Um, instant

mode. Okay. Thanks, Gabe. Um, instant

mode is a version available to everyone starting from today, uh, which we think is much better visual intelligence compared to our previous models. Uh I

especially want to highlight that this is the first image model that is actually useful to our daily lives. Uh

as an example, I'm going back to the laptop right now and um I'm asking this model for some help to buy new clothes for my upcoming summer vacation after

this launch. Uh so in this prompt I'm

this launch. Uh so in this prompt I'm giving it a portrait image of me and asking it to suggest me like eight different nice summer outfits. uh and in

this task the model needs two different kinds of visual intelligence. One is

visual understanding where it actually looks at my image and understand how it look like and come up with some plans for nice outfits for me. And another

axis is visual generation where it actually turns that uh planned layout into coherent and organized image. uh

and we think we made a lot of progress in both of this visual understanding and visual generation. Both of these aspects

visual generation. Both of these aspects uh as a result being able to handle this kind of tasks uh very well. Uh and

um we now have an output for this where you can see eight different uh really cool outfits for me.

Kan, what do you like the best?

Uh I like the first look because I prefer something minimal. I think it's somehow pretty similar to what I'm wearing right now.

Maybe inverted colors.

Uh, so I'm gonna follow up.

That's fine.

Yeah, I like Luke for the first Luke.

I like the first Luke.

Mhm. Uh, can you zoom into it and make a same style fashion

fashion should of me one hero few alternative views views

and detailed clause get so here's a prompt I'm going to follow to the model with this prompt Um, so, so I'm just basically asking you

to zoom into it and show me how I I would look like when I'm actually with this outfit. So, while waiting for it,

this outfit. So, while waiting for it, I'm going to revisit the first image a little bit more. Um, so we're back at the laptop. Um, and one really cool

the laptop. Um, and one really cool thing about uh this image, I think, is that uh all of these coding pieces are labeled with corresponding text. Like it

shows like all of this sneakers and fitted tea, things like that. And all of these are looking really like B. So uh

this basically shows that our model is uh much more capable and interl visual figures together with a lot of text uh which comes from much improved visual

intelligence essentially. Uh yes. So now

intelligence essentially. Uh yes. So now

we have the detailed view of myself where you can see me in this outfit and also from like many different uh angles.

It's really like an experience of just going to a store and actually trying this thing.

So through this demo, I just want to highlight that uh this new model is no more like an AI image generator that you just gives a prompt and it returns an image. It's more like an um AI that you

image. It's more like an um AI that you just interactively talk to and it's just going to respond you using images that are very much understandable like this.

Uh now I'm passing it to Kenji who will be talking about a deeper intelligence in our model called thinking mode.

Thanks Kwan. Uh a major capability that we've introduced in this model is the ability for image generation to think before it produces its final output.

This is particularly useful for very complex prompts uh for things that require like web searches for require you to output multiple images that have to uh maintain coherence with each other

or even for it to check its work before saying hey here's your final output. But

let's just like look over some examples of this first. Uh Gabe actually kicked off a few of these examples at the start of the live stream. So let's go to the one on the phone, which is the one of him and Sam. Um the selfie of them, and

they created a manga of it. And if we look at the very first uh image, we can see Yeah, it does look like Gabe and Sam right?

Yeah. Mhm.

But I think what's even cooler about it is that if you look at the follow-up images, they still look like Dave and Sam and they still look like in the style that

was originally maintained in the first uh in the first page.

And even better is that the story should be very consistent among pages one, two, and three.

Now, thanks. uh to see another one of these in action, let's look at the other example that Gabe kicked off. So to give a little backstory about this, uh a few

weeks ago, we beta tested the instant version of this model on Elmarina under the code name duct tape. A few like a few of you on the internet were like really good detectives and deduced that

it was us. Um but we're going to announce that it was us. And so in this prompt, we basically asked that uh uh basically GBT images too to go and find

social media reactions to this duct tape model and uh basically quote quote people. Um and so we see quotes from

people. Um and so we see quotes from threads, LinkedIn, Reddit, etc. But I think an even crazier part is that we've also asked the model to put a QR code to

chatb.com so that you can try out this model right now um for yourselves. And

can we just make sure that it works?

Yeah, I tried.

Oh, nice, nice, nice. So, uh, image generation with thinking allows you to do really complex things such as, so in this case, web search, synthesize answers, and put a QR code all in one

image. But we have still more and Alex

image. But we have still more and Alex will talk to you about these new details.

Uh, so we've also made a lot of improvements in naturalness. And let me just kick off a few uh prompts. So like

uh like Gabe said earlier, our outputs can now just look like natural images.

And uh you can actually trigger this by adding something like photorealistic or also there are other variations like professional photography and shot on

iPhone or disposable camera. Um

so in like in this first example I'm just pretending we are back in 2015 was when open was founded and but somehow

there's images too. So and then uh basically as you can see uh the model is actually able to replicate the tiny imperfections graininess and the

lighting of the lecture hall.

um even all the text on the slide and the lecture plan that the model came up with are quite coherent and uh beyond these like photo realism

I'm also very excited the model is much more flexible now and in particular we can make really wide and really tall images uh up to three one by three and

3x one so let's take a look at this this is like one of our team's favorite style pumps and it really demonstrates these

uh ability to make really tall images and it makes my neck super long and and like I mean this is pretty cool but uh I

mean maybe it's a bit hard to use as a profile picture or share so you can also use this option to make it one by one I'm just I won't show this for interest

of time so uh for another found an example that combines both the aspect ratio and

naturalness. I also have this uh I asked

naturalness. I also have this uh I asked the model to make a 360 image of the moon landing and I I think it looks like a 360 photo

panorama but we can also take a look in this panorama viewer that I've coded earlier.

So, wow.

As you can see, it's it's like actually a very consistent uh 360 image. It's

you can also see this this the sun and the shadows are are also in the right direction.

Oh, that's super cool. That's

incredible. You said you vibe coded this part of it.

Yeah, I just I just made it with codeex very quickly. Nice.

very quickly. Nice.

Um then there's some scene, but you have to look for it like This is incredible. The the images are obviously beautiful, but the intelligence uh behind these images and what a difference that is to any other

image generation service out there has been has been incredible. Uh huge

congratulations on the progress here. Uh

okay, next up we're going to have Nitant and Buyan join us for uh a little bit more.

And while we're doing that, Gabe, I'm curious what styles you have been enjoying the most or sort of most surprised by. Yeah, I I think there's a

surprised by. Yeah, I I think there's a few keywords that I really like, but I think like Alex said, I think the word photo realism actually triggers something really very interesting in the model.

Definitely give that one a try.

Yes.

Okay, welcome.

Hello. Thank you Sam and thanks Gabe.

Um, hi, I'm Buan. I'm another member of the image and research team.

And I'm Natant. I'm an engineer in the chat images team.

I'm about to introduce the improved text text rendering capability of our new model. OpenAI is a San Francisco based

model. OpenAI is a San Francisco based company. We speak English and use

company. We speak English and use English at work. However, we want everyone in the world to enjoy the same excitement we have when generating images. So, in imagent 2, we made a lot

images. So, in imagent 2, we made a lot of improvements to make sure that our model can generate per every text perfectly across all the languages, all the cultures in the world. Let's take a

look.

So in my first example, I want to generate a poster that's a typography art about different language in the world. It's going to feature many many

world. It's going to feature many many languages. And let's see how does it

languages. And let's see how does it appears. Um, and while it's generating,

appears. Um, and while it's generating, I'm going to kick off another demo.

Let's say I want to open a open eye bakery. It's a fictional bakery. And I

bakery. It's a fictional bakery. And I

want to open it in Japan. And I want to make a poster about it in purely in Japanese.

What languages have you noticed that the new model's gotten the best at?

Um, I think mostly the Asian languages, let's say Hindi, Chinese, Korean, and um, Japanese. That's because those

um, Japanese. That's because those languages traditionally have thousands of characters in the alphabet unlike the 26 in English. So um previously our

model had a hard time memorizing these characters but now just prompted and generate entire pages of text in these languages without arrows.

Wow.

Let's see how does it go. Oh, here is our first example. The typography art. I

deliberately prompt it to be in the form of photography of a uh of a real magazine. So it not only look realistic

magazine. So it not only look realistic but I can also see the correct characters. Here's Niha in Chinese.

characters. Here's Niha in Chinese.

There is hello as a bonjou in French.

And I hope everyone in the world can actually enjoy our model. Creating your

own art using your own language. Let's

take a look at the second example. My

opening at bakery. Oh, look at it. It

even made our logo into into this piece of bread, right? Um this is a Japanese poster. You can see all the ki, all the

poster. You can see all the ki, all the hilagana. Uh you can even zoom in and

hilagana. Uh you can even zoom in and see the details. Look at this.

Look at all the hilagas here.

So I really hope everyone in the world can use this model to make your own poster, open your own shop, everything.

And just to show everyone how far we can go with our image generation model. So

this is an image I generated with our experimental 4K API. Uh this is just a pile of eyes, but this is also not just one pile of rice. What if I tell you

there's one single grain in it with the text GPT image on it? Can you find it?

Here we go.

Yeah, it's at the center.

I can't see. I see something.

I made it easy for you guys.

That's awesome.

Look at this. Let's zoom in. GPT image 2 on one single grain of rice among the entire pile this big.

This is how far we can go with our latest model.

Amazing.

Next, I will let take over.

Yeah. So, uh, images 2.0 is available to all users to try out right now. And if

you're accessing chatgpt from your app, um, make sure to update it to the latest version. And you should see a welcome

version. And you should see a welcome screen that looks like this, which means you're good to go. Uh, I'm going to start off with a simple everyday kind of prompt. I'm asking it to u create a

prompt. I'm asking it to u create a recipe in Hindi. Uh as Buan said the the new model is significantly better at understanding and rendering text in lots

of languages including many Indian ones that I've tried like Hindi, Telugu, Canada, Tamil, um Marathi and so on and the difference is especially obvious if

uh there's a lot of densely packed text.

So let's see what it comes back with.

I'm also curious to see what Indian dish it decides to go with.

Oh, there we go. Went with Aloo Parata.

That's a That's a classic.

Nice.

Oh, and uh the text looks really good, too.

I don't spot any any errors um at first glance.

And next, let's also check out some of the uh the new preset styles that we've added into the app. I'm just going to select create images here. And you'll

see a bunch of fun ones and some that really take advantage of the new models capabilities.

Um, actually, how about why don't we make um logos for the OpenAI bakery?

Buen, sure. Why don't you just take a photo of

sure. Why don't you just take a photo of my bakery poster and see what fixing?

Let's do it.

So, looks like this is going to come back with 16 to 20 um logo ideas. Uh but

this is actually a rather uh simple prompt uh given the model's capabilities. It's really for good at

capabilities. It's really for good at following very detailed instructions. So

uh if you have very specific brand language, design, aesthetics, um all of those things that really matter for creative work, um you can use static to

iterate and refine on your ideas to get exactly what you want out of it.

And we have colorful logo ideas right here.

Wow. Here we go.

Nice.

Which one do you guys like the most?

These are good. Uh, how about this one?

Oh, yeah. This one combines lo our logo and the bread.

I like it. All of this is also also making me hungry.

This was uh this is really amazing. I

can't wait to see what people will do with this. The the beauty of the images

with this. The the beauty of the images uh will come through right away. The

intelligence is very deep and we hope you all have fun exploring this. As we

mentioned, this is live today in Chat GBT and the API. Uh, so proud of the team on what they've created here and we hope you will have as much fun using it as we did getting to build it. Thank you

very much.

Loading...

Loading video analysis...