The latest strange yet fascinating creation of OpenAI is DALL-E, which by way of hasty summary could be called “GPT-3 for images”. It produces illustrations, photos, renders or whatever method you prefer, from anything you can intelligently describe, from “a cat wearing a bow tie” to “a daikon radish in a tutu walking a dog.”
What researchers built with GPT-3 was an AI that given a prompt, would try to produce a realistic version of what it explains. So if you tell “a tale of a kid who finds a witch in the woods,” it’s going to try to write one, and if you press the trigger again, it’s going to write it again, differently. Again, and again, and again.
Some of these attempts may be better than others, in fact, others may be barely credible while some will be almost indistinguishable from anything written by a human. But, it does not generate garbage or severe grammatical errors, making it ideal for a number of tasks. as start-ups and researchers are currently exploring.
This concept is followed by DALL-E (a mixture of Dali and WALL-E). For years, AI agents have been translating the text into pictures, with varying but steadily increasing results. In this case, to create a plausible image that matches a prompt, the agent uses the language understanding and context given by GPT-3 and its underlying structure.
As OpenAI puts it:
GPT-3 showed that language can be used to instruct a large neural network to perform a variety of text generation tasks. Image GPT showed that the same type of neural network can also be used to generate images with high fidelity. We extend these findings to show that manipulating visual concepts through language is now within reach.
Interestingly, in combination with DALL-E, CLIP, was used to understand and rank the images in question, but it is a bit more technical and more difficult to understand. Here you can read about CLIP.
According to OpenAI:
In the future, we plan to analyze how models like DALL·E relate to societal issues like economic impact on certain work processes and professions, the potential for bias in the model outputs, and the longer-term ethical challenges implied by this technology.
Right now, like GPT-3, this technology is amazing and yet difficult to make clear predictions regarding.