When this was up yesterday I complained that the refusal rate was super high especially on government and military shaped tasks, and that this would only push contractors to use CN-developed open source models for work that could then be compromised.
Today I'm discovering there is a tier of API access with virtually no content moderation available to companies working in that space. I have no idea how to go about requesting that tier of access, but have spoken to 4 different defense contractors in the last day who seem to already be using it.
Think of all the trivial ways an image generator could be used in business, and there is likely a similar use-case among the DoD and its contractors (e.g. create a cartoon image of a ship for a naval training aid; make a data dashboard wireframe concept for a decision aid).
Input one image of a known military installation and one civilian building. Prompt to generate a similar _civilian_ building, but resembling that military installation in some way: similar structure, similar colors, similar lighting.
Then include this image in a dataset of another net with marker "civilian". Train that new neural net better so that it does lower false positive rate when asked "is this target military".
Vastly oversimplified but for every civilian job there's an equivalent military job. Superficially, the military is basically a country-sized self-contained corporation. Anywhere that Wal-Mart's corporate office could use AI so could the military.
It's "tier 5", I've had an account since the 3.0 days so I can't be positive I'm not grandfathered in, but, my understanding is as long as you have a non-trivial amount of spend for a few months you'll have that access.
(fwiw for anyone curious how to implement it, it's the 'moderation' parameter in the JSON request you'll send, I missed it for a few hours because it wasn't in Dalle-3)
I just took any indication that the parent post meant absolute zero moderation as them being a bit loose with their words and excitable with how they understand things, there were some signs:
1. it's unlikely they completed an API integration quickly enough to have an opinion on military / defense image generation moderation yesterday, so they're almost certainly speaking about ChatGPT. (this is additionally confirmed by image generation requiring tier 5 anyway, which they would have been aware of if they had integrated)
2. The military / defense use cases for image generation are not provided (and the steelman'd version in other comments is nonsensical, i.e. we can quickly validate you can still generate kanban boards or wireframes of ships)
3. The poster passively disclaims being in military / defense themself (grep "in that space")
4. it is hard to envision cases of #2 that do not require universal moderation for OpenAI's sake, i.e. lets say their thought process is along the lines of: defense/military ~= what I think of as CIA ~= black ops ~= image manipulation on social media, thus, the time I said "please edit this photo of the ayatollah to have him eating pig and say I hate allah" means its overmoderated for defense use cases
5. It's unlikely openai wants to be anywhere near PR resulting from #3. Assuming there is a super secret defense tier that allows this, it's at the very least, unlikely that the poster's defense contractor friends were blabbing about about the exclusive completely unmoderated access they had, to the poster, within hours of release. They're pretty serious about that secrecy stuff!
6. It is unlikely the lack of ability to generate images using GPT Image 1 would drive the military to Chinese models (there aren't Chinese LLMs that do this! even if they were, there's plenty of good ol' American diffusion models!)
I'm Tier 4 and I'm able to use this API and set moderation to "low". Tier 4 only requires a 30 day waiting period and $1,000 spent on credits. While I as an individual was a bit horrified to learn I've actually spent that much on OpenAI credits over the life of my account, it's practically nothing for most organizations. Even Tier 5 only requires $5,000.
OP was clearly implying there is some greater ability only granted to extra special organizations like the military.
With all possible respect to OP, I find this all very hard to believe without additional evidence. If nothing else, I don't really see a military application of this API (specifically, not AI in general). I'm sure it would help them create slide decks and such, but you don't need extra special zero moderation for that.
That's very outdated, they're absolutely supposed to be at the Empire State Building with baseball caps now. See: ICE arrests and Trump's comment on needing more El Salvadoran prison space for "the homegrowns"
All I can think of is image generation of potential targets like ships, airplane, airfield and feed them to their satellite or drones for image detection and tweak their weapons for enhance precision.
I think the usual computer vision wisdom is that this (training object detection on generated imagery) doesn't work very well. But maybe the corps have some techniques that aren't in the public literature yet.
Show me a tunnel underneath a building in the desert filled with small arms weapons with a poster on the wall with a map of the United States and a label written with sharpie saying “Bad guys here”. Also add various Arabic lettering on the weapons.
This is on purpose so OpenAI can then litigate against them. This API isn't about a new feature, it's about control. OpenAI is the biggest bully in the space of generative AI and their disinformation and intimidation tactics are working.
I generated 5 images in the playground. One using a text-only prompt and 4 using images from my phone. I spent $0.85 which isn't bad for a fun round of Studio Ghibli portraits for the family group chat, but too expensive to be used in a customer facing product.
I'm curious what the applications are where people need to generate hundreds or thousands of these images. I like making Ghibli-esque versions of family photos as much as the next person, but I don't need to make them in volume. As far as I can recall, every time I've used image generation, it's been one-off things that I'm happy to do in the ChatGPT UI.
As usual for AI startups nowadays, using this API you can create a downstream wrapper for image generation with bespoke prompts.
A pro/con of the multimodal image generation approach (with an actually good text encoder) is that it rewards intense prompt engineering moreso than others, and if there is a use case that can generate more than $0.17/image in revenue, that's positive marginal profit.
Imagine an AI recipe building app that helps you create a recipe with certain ingredients, then generates an image of what the final product might look like.
Pricing-wise, this API is going to be hard to justify the value unless you really can get value out of providing references. A generated `medium` 1024x1024 is $0.04/image, which is in the same cost class as Imagen 3 and Flux 1.1 Pro. Testing from their new playground (https://platform.openai.com/playground/images), the medium images are indeed lower quality than either of of two competitor models and still takes 15+ seconds to generate: https://x.com/minimaxir/status/1915114021466017830
Prompting the model is also substantially more different and difficult than traditional models, unsurprisingly given the way the model works. The traditional image tricks don't work out-of-the-box and I'm struggling to get something that works without significant prompt augmentation (which is what I suspect was used for the ChatGPT image generations)
ChatGPT's prompt adherence is light years ahead of all the others. I won't even call Flux/Midjoueny its competitors. ChatGPT image gen is practically a one-of-its-kind unique product on the market: the only usable AI image editor for people without image editing experience.
I think in terms of image generation, ChatGPT is the biggest leap since Stable Diffusion's release. LoRA/ControlNet/Flux are forgettable in comparison.
I can confirm, ChatGPT's prompt adherence is so incredibly good, it gets even really small details right, to a level that diffusion-based generators couldn't even dream of.
I started Accomplice v1 back in 2021 with this goal in mind and raised some VC money but it was too early.
Now, with these latest imagen-3.0-generate-002 (Gemini) and gpt-image-1 (OpenAI) models – especially this API release from OpenAI – I've been able to resurrect Accomplice as a little side project.
Accomplice v2 (https://accomplice.ai) is just getting started back up again – I honestly decided to rebuild it only a couple weeks ago in preparation for today once I saw ChatGPT's new image model – but so far 1,000s of free to download PNGs (and any SVGs that have already been vectorized are free too (costs a credit to vectorize)).
I generate new icons every few minutes from a huge list of "useful icons" I've built. Will be 100% pay-as-you-go. And for a credit, paid users can vectorize any PNGs they like, tweak them using AI, upload their own images to vectorize and download, or create their own icons (with my prompt injections baked in to get you good icon results)
Do multi-modal models make something like this obsolete? I honestly am not sure. In my experience with Accomplice v1, a lot of users didn't know what to do with a blank textarea, so the thinking here is there's value in doing some of the work for them upfront with a large searchable archive. Would love to hear others' thoughts.
That looks interesting, but I don't know how useful single icons can be. For me, the really useful part would be to get a suite of icons that all have a consistent visual style. Bonus points if I can prompt the model to generate more icons with that same style.
Recraft has a style feature where you give some images. I wonder if that would work for icons. You can also try giving an image of a bunch of icons to ChatGPT and have it generate more, then vectorize them.
It seems to me like this is a new hybrid product for -vibe coders- beacuse otherwise the -wrapping- of prompting/improving a prompt with an LLM before hitting the text2image model can certainly be done as you say cheaper if you just run it yourself.
maybe OpenAI thinks model business is over and they need to start sherlocking all the way from the top to final apps (Thus their interest on buying out cursor, finally ending up with windsurf)
Idk this feels like a new offering between a full raw API and a final product where you abstract some of it for a few cents, and they're basically bundling their SOTA llm models with their image models for extra margin
> It seems to me like this is a new hybrid product for -vibe coders- beacuse otherwise the -wrapping- of prompting/improving a prompt with an LLM before hitting the text2image model can certainly be done as you say cheaper if you just run it yourself.
In case you didn’t know, it’s not just wrapping in an LLM. The image model they’re referencing is a model that’s directly integrated into the LLM for functionality. It’s not possible to extract, because the LLM outputs tokens which are part of the image itself.
That said, they’re definitely trying to focus on building products over raw models now. They want to be a consumer subscription instead of commodity model provider.
Right! I forgot the new model was a multi-modal one generating image outputs from both image and text inputs, i guess this is good and price will come down eventually.
waiting for some FOSS multi-modal model to come out eventually too
great to see openAI expanding into making actual usable products i guess
yeah, the integration is the real shift here. by embedding image generation into the LLM’s token stream, it’s no longer a pipeline of separate systems but a single unified model interface. that unlocks new use cases where you can reason, plan, and render all in one flow. it’s not just about replacing diffusion models, it’s about making generation part of a broader agentic loop. pricing will drop over time, but the shift in how you build with this is the more interesting part.
> Prompting the model is also substantially more different and difficult than traditional models
Can you elaborate? This was not my experience - retesting the prompts that I used for my GenAI image shootout against gpt-image-1 API proved largely similar.
Similarly to how 90% of my LLM needs are met by Mistral 3.1, there's no reason to use 4o for most t2i or i2i, however there's a definite set of tasks that are not possible with diffusion models, or if they are they require a giant ball of node spaghetti in comfyui to achieve. The price is high but the likelyhood of getting the right answer on the first try is absolutely worth the cost imo.
It may lose against other models on prompt-to-image, but I'd be very excited to see another model that's as good at this one as image+prompt-to-image. Editing photos with ChatGPT over the past few weeks has been SO much fun.
I work in the space. There are a lot of use cases that get censored by OpenAI, Kling, Runway, and various other providers for a wide variety of reasons:
- OpenAI is notorious for blocking copyrighted characters. They do prompt keyword scanning, but also run a VLM on the results so you can't "trick" the model.
- Lots of providers block public figures and celebrities.
- Various providers block LGBT imagery, even safe for work prompts. Kling is notorious for this.
- I was on a sales call with someone today who runs a father's advocacy group. I don't know what system he was using, but he said he found it impossible to generate an adult male with a child. In a totally safe for work context.
- Some systems block "PG-13" images of characters that are in bathing suits or scantily clad.
I don't think so. This model kills the need for Flux, ComfyUI, LoRAs, fine tuning, and pretty much everything that's come before it.
This is the god model in images right now.
I don't think open source diffusion models can catch up with this. From what I've heard, this model took a huge amount of money to train that not even Black Forest Labs has access to.
ComfyUI supports 4o natively so you get the best of both worlds, there is so much that you can't do with 4o because there's a fundamental limit on the level of control you can have over image generation when your conditioning is just tokens in an autoregressive model. There's plenty of reason to use comfy even if 4o is part of your workflow.
As for LoRAs and fine tuning and open source in general; if you've ever been to civit.ai it should be immediately obvious why those things aren't going away.
“ Editing videos: invideo enables millions of users to transform their ideas into videos using AI. With the integration of gpt-image-1, the platform now offers improved text generation, fine-grain editing controls, and advanced style guidance.”
Does this mean this also does video in some manner?
It's a mix of both it feels to me as I've been testing it. For example, you can't get it to make a clock showing custom time like 3:30, or someone writing with their left hand.. And it can't do follow many instructions or do them very precisely. But it shows that this kind of architecture will be be capable of that if scaled up most likely.
> GoDaddy is actively experimenting to integrate image generation so customers can easily create logos that are editable [..]
I remember meeting someone on Discord 1-2 years ago (?) working on a GoDaddy effort to have customer-generated icons using bespoke foundation image gen models? Suppose that kind of bespoke model at that scale is ripe for replacement by gpt-image-1, given the instruction-following ability / steerability?
Great svg generation would be far more userful! For example, being able to edit svg images after generated by Ai would be quick to modify the last mile.. For our new website https://resonancy.io the simple svg workflow images created was still very much created by hand.. and trying various ai tools to make such images yielded shockingly bad off-brand results even when provided multiple examples. By far the best tool for this is still canva for us..
Anyone know of an Ai model for generating svg images? Please share.
One note with these is most of the production ones are actually diffusion models that get ran through an image->svg model after. The issue with this is that the layers aren't set up semantically like you'd expect if you were crafting these by hand, or if you were directly generating svgs. The results work, but they aren't perfect.
I was impressed with recraft.ai for SVGs - https://simonwillison.net/2024/Nov/15/recraft-v3/ - though as far as I can tell they generate raster images and then SVG-ize them before returning the result.
"Image from a reference" is a bit of a rabbit hole. For traditional image generation models, in order for it to learn a reference, you have to fine-tune it (LoRA) and/or use a conditioning model to constrain the output (InstantID/ControlNet)
The interesting part of this GPT-4o API is that it doesn't need to learn them. But given the cost of `high` quality image generation, it's much cheaper to train a LoRA for Flux 1.1 Pro and generate from that.
Reflux is fantastic for the basic reference image based editing most people are using this for, but 4o is far more powerful than any existing models because of it's large scale and cross-modal understanding, there are things possible with 4o that are just 100% impossible with diffusion models. (full glass of wine, horse riding an astronaut, room without pink elephants, etc)
Does anyone know if you can give this endpoint an image as input along with text - not just an image to mask, but an image as part of a text input description.
I can’t see a way to do this currently, you just get a prompt.
This, I think, is the most powerful way to use the new image model since it actually understands the input image and can make a new one based on it.
Eg you can give it a person sitting at a desk and it can make one of them standing up. Or from another angle. Or in the moon.
I think this is technically "image variations" and I think image variations are still only dall-e 3 for now (best I could tell earlier today from the API)
This is the new model that's available in ChatGPT, which most notably can do transfer generation. i.e. "take this image and restyle it to look like X". Or "take this sneaker and give me a billboard ad for it"
This is their presumably auto regressive image model. It has outstanding prompt adherence and great detail in addition to strong style transfer abilities.
When this was up yesterday I complained that the refusal rate was super high especially on government and military shaped tasks, and that this would only push contractors to use CN-developed open source models for work that could then be compromised.
Today I'm discovering there is a tier of API access with virtually no content moderation available to companies working in that space. I have no idea how to go about requesting that tier of access, but have spoken to 4 different defense contractors in the last day who seem to already be using it.
What's a good use case for a defense contractor to generate AI images besides to include in presentations?
Think of all the trivial ways an image generator could be used in business, and there is likely a similar use-case among the DoD and its contractors (e.g. create a cartoon image of a ship for a naval training aid; make a data dashboard wireframe concept for a decision aid).
Fabricating evidence of weapons of mass destruction in some developing nation.
I kid, more real world use cases would be for concept images for a new product or marketing campaigns.
Manufacturing consent
Literally how it will be used; you are correct.
Input one image of a known military installation and one civilian building. Prompt to generate a similar _civilian_ building, but resembling that military installation in some way: similar structure, similar colors, similar lighting.
Then include this image in a dataset of another net with marker "civilian". Train that new neural net better so that it does lower false positive rate when asked "is this target military".
Vastly oversimplified but for every civilian job there's an equivalent military job. Superficially, the military is basically a country-sized self-contained corporation. Anywhere that Wal-Mart's corporate office could use AI so could the military.
Do you work with OpenAI models via FedRAMP GGC High Azure? If so I would love to hear more about your experience.
It's "tier 5", I've had an account since the 3.0 days so I can't be positive I'm not grandfathered in, but, my understanding is as long as you have a non-trivial amount of spend for a few months you'll have that access.
(fwiw for anyone curious how to implement it, it's the 'moderation' parameter in the JSON request you'll send, I missed it for a few hours because it wasn't in Dalle-3)
API shows either auto or low available. Is there another secret value with even lower restrictions?
Not that I know of.
I just took any indication that the parent post meant absolute zero moderation as them being a bit loose with their words and excitable with how they understand things, there were some signs:
1. it's unlikely they completed an API integration quickly enough to have an opinion on military / defense image generation moderation yesterday, so they're almost certainly speaking about ChatGPT. (this is additionally confirmed by image generation requiring tier 5 anyway, which they would have been aware of if they had integrated)
2. The military / defense use cases for image generation are not provided (and the steelman'd version in other comments is nonsensical, i.e. we can quickly validate you can still generate kanban boards or wireframes of ships)
3. The poster passively disclaims being in military / defense themself (grep "in that space")
4. it is hard to envision cases of #2 that do not require universal moderation for OpenAI's sake, i.e. lets say their thought process is along the lines of: defense/military ~= what I think of as CIA ~= black ops ~= image manipulation on social media, thus, the time I said "please edit this photo of the ayatollah to have him eating pig and say I hate allah" means its overmoderated for defense use cases
5. It's unlikely openai wants to be anywhere near PR resulting from #3. Assuming there is a super secret defense tier that allows this, it's at the very least, unlikely that the poster's defense contractor friends were blabbing about about the exclusive completely unmoderated access they had, to the poster, within hours of release. They're pretty serious about that secrecy stuff!
6. It is unlikely the lack of ability to generate images using GPT Image 1 would drive the military to Chinese models (there aren't Chinese LLMs that do this! even if they were, there's plenty of good ol' American diffusion models!)
I'm Tier 4 and I'm able to use this API and set moderation to "low". Tier 4 only requires a 30 day waiting period and $1,000 spent on credits. While I as an individual was a bit horrified to learn I've actually spent that much on OpenAI credits over the life of my account, it's practically nothing for most organizations. Even Tier 5 only requires $5,000.
OP was clearly implying there is some greater ability only granted to extra special organizations like the military.
With all possible respect to OP, I find this all very hard to believe without additional evidence. If nothing else, I don't really see a military application of this API (specifically, not AI in general). I'm sure it would help them create slide decks and such, but you don't need extra special zero moderation for that.
> 4 different defense contractors in the last day
Now I'm just wondering what the hell defense contractors need image generation for that isn't obviously horrifying...
“Generate me a crowd of civilians with one terrorist in.”
“Please move them to some desert, not the empire state building.”
“The civilians are supposed to have turbans, not ballcaps.”
That's very outdated, they're absolutely supposed to be at the Empire State Building with baseball caps now. See: ICE arrests and Trump's comment on needing more El Salvadoran prison space for "the homegrowns"
All I can think of is image generation of potential targets like ships, airplane, airfield and feed them to their satellite or drones for image detection and tweak their weapons for enhance precision.
I think the usual computer vision wisdom is that this (training object detection on generated imagery) doesn't work very well. But maybe the corps have some techniques that aren't in the public literature yet.
Show me a tunnel underneath a building in the desert filled with small arms weapons with a poster on the wall with a map of the United States and a label written with sharpie saying “Bad guys here”. Also add various Arabic lettering on the weapons.
It's probably horrifying!
They make presentations. Most of their work is presentations with diagrams. Icons.
[dead]
This is on purpose so OpenAI can then litigate against them. This API isn't about a new feature, it's about control. OpenAI is the biggest bully in the space of generative AI and their disinformation and intimidation tactics are working.
For the curious I generated the same prompt for each of the quality types. ‘Auto’, ‘low’, ‘medium’, ‘high’.
Prompt: “a cute dog hugs a cute cat”
https://x.com/terrylurie/status/1915161141489136095
I also then showed a couple of DALL:E 3 images for comparison in a comment
> the same prompt for each of the quality types. ‘Auto’, ‘low’, ‘medium’, ‘high’.
“Auto” is just whatever the best quality is for a model. So in this case it’s the same as “high”.
I generated 5 images in the playground. One using a text-only prompt and 4 using images from my phone. I spent $0.85 which isn't bad for a fun round of Studio Ghibli portraits for the family group chat, but too expensive to be used in a customer facing product.
> but too expensive to be used in a customer facing product.
Enhance headshots for putting on Linkedin.
I'm curious what the applications are where people need to generate hundreds or thousands of these images. I like making Ghibli-esque versions of family photos as much as the next person, but I don't need to make them in volume. As far as I can recall, every time I've used image generation, it's been one-off things that I'm happy to do in the ChatGPT UI.
As usual for AI startups nowadays, using this API you can create a downstream wrapper for image generation with bespoke prompts.
A pro/con of the multimodal image generation approach (with an actually good text encoder) is that it rewards intense prompt engineering moreso than others, and if there is a use case that can generate more than $0.17/image in revenue, that's positive marginal profit.
AI-assisted education is promising.
I'm still struggling to see how you would need thousands of AI generated images rather than just using existing real images for education.
- personalization (style, analogy to known concepts)
- specificity (a diagram that perfectly encapsulates the exact set of concepts you're asking about)
But LLMs are not reliable enough, so you can not actually expect “specificity”
Not perfect now, but adequate in some domains. Will only get better.
That is true in a broader sense, but education and abundant money don't generally go hand in hand.
don't I know it
I use the api because i don’t use chatgpt enough to justify the cost of their UI offering.
Imagine an AI recipe building app that helps you create a recipe with certain ingredients, then generates an image of what the final product might look like.
Pricing-wise, this API is going to be hard to justify the value unless you really can get value out of providing references. A generated `medium` 1024x1024 is $0.04/image, which is in the same cost class as Imagen 3 and Flux 1.1 Pro. Testing from their new playground (https://platform.openai.com/playground/images), the medium images are indeed lower quality than either of of two competitor models and still takes 15+ seconds to generate: https://x.com/minimaxir/status/1915114021466017830
Prompting the model is also substantially more different and difficult than traditional models, unsurprisingly given the way the model works. The traditional image tricks don't work out-of-the-box and I'm struggling to get something that works without significant prompt augmentation (which is what I suspect was used for the ChatGPT image generations)
ChatGPT's prompt adherence is light years ahead of all the others. I won't even call Flux/Midjoueny its competitors. ChatGPT image gen is practically a one-of-its-kind unique product on the market: the only usable AI image editor for people without image editing experience.
I think in terms of image generation, ChatGPT is the biggest leap since Stable Diffusion's release. LoRA/ControlNet/Flux are forgettable in comparison.
Well, there's also gemini-2.0-flash-exp-image-generation. Also autoregressive/transfusion based.
It's also good but clearly not close still. Maybe Gemini 2.5 or 3 will have better image gen.
Such a good name....
This is a take so incredulous it doesn’t seem credible.
I can confirm, ChatGPT's prompt adherence is so incredibly good, it gets even really small details right, to a level that diffusion-based generators couldn't even dream of.
It is correct, the shift from diffusion to transformers is a very, very big difference.
its 100% the correct take
yeah this is my personal experience. The new image generation is the only reason I keep an OpenAI subscription rather than switching to Google.
So, I've long dreamed of building an AI-powered https://iconfinder.com.
I started Accomplice v1 back in 2021 with this goal in mind and raised some VC money but it was too early.
Now, with these latest imagen-3.0-generate-002 (Gemini) and gpt-image-1 (OpenAI) models – especially this API release from OpenAI – I've been able to resurrect Accomplice as a little side project.
Accomplice v2 (https://accomplice.ai) is just getting started back up again – I honestly decided to rebuild it only a couple weeks ago in preparation for today once I saw ChatGPT's new image model – but so far 1,000s of free to download PNGs (and any SVGs that have already been vectorized are free too (costs a credit to vectorize)).
I generate new icons every few minutes from a huge list of "useful icons" I've built. Will be 100% pay-as-you-go. And for a credit, paid users can vectorize any PNGs they like, tweak them using AI, upload their own images to vectorize and download, or create their own icons (with my prompt injections baked in to get you good icon results)
Do multi-modal models make something like this obsolete? I honestly am not sure. In my experience with Accomplice v1, a lot of users didn't know what to do with a blank textarea, so the thinking here is there's value in doing some of the work for them upfront with a large searchable archive. Would love to hear others' thoughts.
But I'm having fun again either way.
That looks interesting, but I don't know how useful single icons can be. For me, the really useful part would be to get a suite of icons that all have a consistent visual style. Bonus points if I can prompt the model to generate more icons with that same style.
Recraft has a style feature where you give some images. I wonder if that would work for icons. You can also try giving an image of a bunch of icons to ChatGPT and have it generate more, then vectorize them.
Recraft's icon generator let's you do this.
https://imgur.com/a/BTzbsfh
It definitely captures the style - but any reasonably complicated prompt was beyond it.
I think the latter approach is the best bet right now, agree.
It seems to me like this is a new hybrid product for -vibe coders- beacuse otherwise the -wrapping- of prompting/improving a prompt with an LLM before hitting the text2image model can certainly be done as you say cheaper if you just run it yourself.
maybe OpenAI thinks model business is over and they need to start sherlocking all the way from the top to final apps (Thus their interest on buying out cursor, finally ending up with windsurf)
Idk this feels like a new offering between a full raw API and a final product where you abstract some of it for a few cents, and they're basically bundling their SOTA llm models with their image models for extra margin
> It seems to me like this is a new hybrid product for -vibe coders- beacuse otherwise the -wrapping- of prompting/improving a prompt with an LLM before hitting the text2image model can certainly be done as you say cheaper if you just run it yourself.
In case you didn’t know, it’s not just wrapping in an LLM. The image model they’re referencing is a model that’s directly integrated into the LLM for functionality. It’s not possible to extract, because the LLM outputs tokens which are part of the image itself.
That said, they’re definitely trying to focus on building products over raw models now. They want to be a consumer subscription instead of commodity model provider.
Right! I forgot the new model was a multi-modal one generating image outputs from both image and text inputs, i guess this is good and price will come down eventually.
waiting for some FOSS multi-modal model to come out eventually too
great to see openAI expanding into making actual usable products i guess
yeah, the integration is the real shift here. by embedding image generation into the LLM’s token stream, it’s no longer a pipeline of separate systems but a single unified model interface. that unlocks new use cases where you can reason, plan, and render all in one flow. it’s not just about replacing diffusion models, it’s about making generation part of a broader agentic loop. pricing will drop over time, but the shift in how you build with this is the more interesting part.
I find prompting the model substantially easier than traditional models, is it really more difficult or are you just used to traditional models?
I suspect what I'll do with the API is iterate at medium quality and then generate a high quality image when I'm done.
> Prompting the model is also substantially more different and difficult than traditional models
Can you elaborate? This was not my experience - retesting the prompts that I used for my GenAI image shootout against gpt-image-1 API proved largely similar.
https://genai-showdown.specr.net
Similarly to how 90% of my LLM needs are met by Mistral 3.1, there's no reason to use 4o for most t2i or i2i, however there's a definite set of tasks that are not possible with diffusion models, or if they are they require a giant ball of node spaghetti in comfyui to achieve. The price is high but the likelyhood of getting the right answer on the first try is absolutely worth the cost imo.
Huh? For me the quality of the API seems to be identical to what I'm getting in ChatGPT.
pretty amazing that in ~two years a 15 second latency AI image generation API that cost 4 cents lags behind competitors.
It may lose against other models on prompt-to-image, but I'd be very excited to see another model that's as good at this one as image+prompt-to-image. Editing photos with ChatGPT over the past few weeks has been SO much fun.
Here's my dog in a pelican costume: https://bsky.app/profile/simonwillison.net/post/3lneuquczzs2...
The dog ChatGPT generated doesn't actually look like your dog. The eyes are so different. Really cute image, though.
> A generated `medium` 1024x1024 is $0.04/image
It's actually more than that. It's about 16.7 cents per image.
$0.04/image is the pricing for DALL-E 3.
16.7 cents is the high quality cost, and medium is 4.2 cents: https://platform.openai.com/docs/pricing#:~:text=1M%20charac...
Ah, they changed that page since I saw it yesterday.
They didn't show low/med/high quality, they just said an image was a certain number of tokens with a price per token that led to $0.16/image.
No, it's not
It's far and away the most powerful image model right now. $0.04/image is a decent price!
This is extremely domain-specific. Diffusion models work much better for certain things.
Can you cite an example? I'm really curious where that set of usecases lies.
Explicit adult content.
False. That has nothing to do with the model architecture and everything to do with cloud inference providers wanting to avoid regulatory scrutiny.
I work in the space. There are a lot of use cases that get censored by OpenAI, Kling, Runway, and various other providers for a wide variety of reasons:
- OpenAI is notorious for blocking copyrighted characters. They do prompt keyword scanning, but also run a VLM on the results so you can't "trick" the model.
- Lots of providers block public figures and celebrities.
- Various providers block LGBT imagery, even safe for work prompts. Kling is notorious for this.
- I was on a sales call with someone today who runs a father's advocacy group. I don't know what system he was using, but he said he found it impossible to generate an adult male with a child. In a totally safe for work context.
- Some systems block "PG-13" images of characters that are in bathing suits or scantily clad.
None of this is porn, mind you.
Sure but that has nothing to do with the model architecture and everything to do with the cloud inference providers wanting to cover their asses.
What does any of that have to do with the distinction between diffusion vs. autoregressive models?
I don't think so. This model kills the need for Flux, ComfyUI, LoRAs, fine tuning, and pretty much everything that's come before it.
This is the god model in images right now.
I don't think open source diffusion models can catch up with this. From what I've heard, this model took a huge amount of money to train that not even Black Forest Labs has access to.
ComfyUI supports 4o natively so you get the best of both worlds, there is so much that you can't do with 4o because there's a fundamental limit on the level of control you can have over image generation when your conditioning is just tokens in an autoregressive model. There's plenty of reason to use comfy even if 4o is part of your workflow.
As for LoRAs and fine tuning and open source in general; if you've ever been to civit.ai it should be immediately obvious why those things aren't going away.
“ Editing videos: invideo enables millions of users to transform their ideas into videos using AI. With the integration of gpt-image-1, the platform now offers improved text generation, fine-grain editing controls, and advanced style guidance.”
Does this mean this also does video in some manner?
Usage of gpt-image-1 is priced per token, with separate pricing for text and image tokens:
Text input tokens (prompt text): $5 per 1M tokens Image input tokens (input images): $10 per 1M tokens Image output tokens (generated images): $40 per 1M tokens
In practice, this translates to roughly $0.02, $0.07, and $0.19 per generated image for low, medium, and high-quality square images, respectively.
that's a bit pricy for a startup.
For the curious, this is LLM-based rather than diffusion based, meaning that it adheres to text prompts with much higher accuracy.
As an example, some users (myself included) of a generative image app were trying to make a picture of person in the pouch of a kangaroo.
No matter what we prompted, we couldn’t get it to work.
This new model did it in one shot!
Source? It's much more likely that the LLM generates the latent vector which serves as an input to the diffusion model.
It's a mix of both it feels to me as I've been testing it. For example, you can't get it to make a clock showing custom time like 3:30, or someone writing with their left hand.. And it can't do follow many instructions or do them very precisely. But it shows that this kind of architecture will be be capable of that if scaled up most likely.
> GoDaddy is actively experimenting to integrate image generation so customers can easily create logos that are editable [..]
I remember meeting someone on Discord 1-2 years ago (?) working on a GoDaddy effort to have customer-generated icons using bespoke foundation image gen models? Suppose that kind of bespoke model at that scale is ripe for replacement by gpt-image-1, given the instruction-following ability / steerability?
Any one has an idea of what represent an "image token" for the pricing? Is it a block of an image from a given fixed size?
Great svg generation would be far more userful! For example, being able to edit svg images after generated by Ai would be quick to modify the last mile.. For our new website https://resonancy.io the simple svg workflow images created was still very much created by hand.. and trying various ai tools to make such images yielded shockingly bad off-brand results even when provided multiple examples. By far the best tool for this is still canva for us..
Anyone know of an Ai model for generating svg images? Please share.
Recraft also has an svg model: https://replicate.com/recraft-ai/recraft-v3-svg
One note with these is most of the production ones are actually diffusion models that get ran through an image->svg model after. The issue with this is that the layers aren't set up semantically like you'd expect if you were crafting these by hand, or if you were directly generating svgs. The results work, but they aren't perfect.
I was impressed with recraft.ai for SVGs - https://simonwillison.net/2024/Nov/15/recraft-v3/ - though as far as I can tell they generate raster images and then SVG-ize them before returning the result.
SVGFusion https://arxiv.org/abs/2412.10437 which is a new paper from SVGRender group https://huggingface.co/SVGRender
OmniSVG https://arxiv.org/abs/2504.06263v1
Amazing thanks for sharing! Will have a read. A commercial model would be something that I will pay for!
Is free cheap enough ;)
https://omnisvg.github.io/
https://huggingface.co/OmniSVG
I don't know about -commercial- offerings but you can try also something like SVGRender which you should be able to run on your own GPU etc https://ximinng.github.io/PyTorch-SVGRender-project/
first paper linked on prior comment is the latest one from SVGRender group, but not sure if any runnable model weights are out yet for it (SVGFusion)
Try neoSVG or Recraft, it is awesome!
Hmm seems pricey.
What's the current state of the art for API generation of an image from a reference plus modifier prompt?
Say, in the 1c per HD (1920*1080) image range?
"Image from a reference" is a bit of a rabbit hole. For traditional image generation models, in order for it to learn a reference, you have to fine-tune it (LoRA) and/or use a conditioning model to constrain the output (InstantID/ControlNet)
The interesting part of this GPT-4o API is that it doesn't need to learn them. But given the cost of `high` quality image generation, it's much cheaper to train a LoRA for Flux 1.1 Pro and generate from that.
Reflux is fantastic for the basic reference image based editing most people are using this for, but 4o is far more powerful than any existing models because of it's large scale and cross-modal understanding, there are things possible with 4o that are just 100% impossible with diffusion models. (full glass of wine, horse riding an astronaut, room without pink elephants, etc)
Imagen supports image references in the API as well, just on Vertex, not on Gemini API yet.
lesson: never build your moat around optimizing the existing AI capability
Does anyone know if you can give this endpoint an image as input along with text - not just an image to mask, but an image as part of a text input description.
I can’t see a way to do this currently, you just get a prompt.
This, I think, is the most powerful way to use the new image model since it actually understands the input image and can make a new one based on it.
Eg you can give it a person sitting at a desk and it can make one of them standing up. Or from another angle. Or in the moon.
Seems like exactly one of their examples, or am I missing something? "Create a new image using image references" https://platform.openai.com/docs/guides/image-generation#cre...
I think this is technically "image variations" and I think image variations are still only dall-e 3 for now (best I could tell earlier today from the API)
Intelligence is fast approaching utility status.
Thank you for a great contribution to global warming.
Lots of comments on the price being too high, what are the odds this is a subsidized bare metal cost?
just based on how long it takes to produce these images, and how much text responses cost, I wouldn't be surprised at all if it was close to cost
Does the AI have the same content restrictions that the chat service does?
Far too expensive, I think I will wait for an equivalent Gemini model.
I don't understand why this api needs organization verification. More paperwork ahead. Facepalm
PermissionDeniedError: Error code: 403 - {'error': {'message': 'To access gpt-image-1, please complete organization verification
Likely because they've seen a lot of the potential abuse capabilities. i.e. the "generate a drivers license with this face".
So the options are: 1) nerf the model so it can't produce images like that, or 2) use some type of KYC verification.
The model is already pretty lobotomized refusing even mundane requests randomly.
Upload a picture of a friend -> OK. Upload my own picture -> I can't generate anything involving real people.
Also after they enabled global chat memory I started seeing my other chats leaking into the images as literal text. Disabled it since.
[dead]
aren't you all embarrassed seeing lame press releases of the most uninteresting things on the top of HN front page? i kinda feel bad.
I'm embarassed that you find revolutionary tech uninteresting.
it's literary one feature now available in a different billing format. get a gripe.
This news is relevant for developers though.
How so? I'm (nominally) a developer and this has nothing to do with my job or personal pursuits.
I don't get it. I've been using `dall-e-3` over the public API for a couple years now. Is this just a new model?
EDIT: Oh, yes, that's what it appears to be. Is it better? Why would I switch?
This is the new model that's available in ChatGPT, which most notably can do transfer generation. i.e. "take this image and restyle it to look like X". Or "take this sneaker and give me a billboard ad for it"
This is their presumably auto regressive image model. It has outstanding prompt adherence and great detail in addition to strong style transfer abilities.
The new image generation model is miles ahead of DALL-E 3, especially when generating text.
Basically they are charging for the ability to make accurate text generation.