We already know that OpenAIâs chatbots can pass the bar exam without going to law school. Now, just in time for the Oscars, a new OpenAI app called Sora hopes to master cinema without going to film school. For now a research product, Sora is going out to a few select creators and a number of security experts who will red-team it for safety vulnerabilities. OpenAI plans to make it available to all wannabe auteurs at some unspecified date, but it decided to preview it in advance.
Other companies, from giants like Google to startups like Runway, have already revealed text-to-video AI projects. But OpenAI says that Sora is distinguished by its striking photorealismâsomething I havenât seen in its competitorsâand its ability to produce longer clips than the brief snippets other models typically do, up to one minute. The researchers I spoke to wonât say how long it takes to render all that video, but when pressed, they described it as more in the âgoing out for a burritoâ ballpark than âtaking a few days off.â If the hand-picked examples I saw are to be believed, the effort is worth it.
OpenAI didnât let me enter my own prompts, but it shared four instances of Soraâs power. (None approached the purported one-minute limit; the longest was 17 seconds.) The first came from a detailed prompt that sounded like an obsessive screenwriterâs setup: âBeautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.â
The result is a convincing view of what is unmistakably Tokyo, in that magic moment when snowflakes and cherry blossoms coexist. The virtual camera, as if affixed to a drone, follows a couple as they slowly stroll through a streetscape. One of the passersby is wearing a mask. Cars rumble by on a riverside roadway to their left, and to the right shoppers flit in and out of a row of tiny shops.
Itâs not perfect. Only when you watch the clip a few times do you realize that the main charactersâa couple strolling down the snow-covered sidewalkâwould have faced a dilemma had the virtual camera kept running. The sidewalk they occupy seems to dead-end; they would have had to step over a small guardrail to a weird parallel walkway on their right. Despite this mild glitch, the Tokyo example is a mind-blowing exercise in world-building. Down the road, production designers will debate whether itâs a powerful collaborator or a job killer. Also, the people in this videoâwho are entirely generated by a digital neural networkâarenât shown in close-up, and they donât do any emoting. But the Sora team says that in other instances theyâve had fake actors showing real emotions.
The other clips are also impressive, notably one asking for âan animated scene of a short fluffy monster kneeling beside a red candle,â along with some detailed stage directions (âwide eyes and open mouthâ) and a description of the desired vibe of the clip. Sora produces a Pixar-esque creature that seems to have DNA from a Furby, a Gremlin, and Sully in Monsters, Inc. I remember when that latter film came out, Pixar made a huge deal of how difficult it was to create the ultra-complex texture of a monsterâs fur as the creature moved around. It took all of Pixarâs wizards months to get it right. OpenAIâs new text-to-video machine ⦠just did it.
âIt learns about 3D geometry and consistency,â says Tim Brooks, a research scientist on the project, of that accomplishment. âWe didnât bake that inâit just entirely emerged from seeing a lot of data.â
While the scenes are certainly impressive, the most startling of Soraâs capabilities are those that it has not been trained for. Powered by a version of the diffusion model used by OpenAIâs Dalle-3 image generator as well as the transformer-based engine of GPT-4, Sora does not merely churn out videos that fulfill the demands of the prompts, but does so in a way that shows an emergent grasp of cinematic grammar.
That translates into a flair for storytelling. In another video that was created off of a prompt for âa gorgeously rendered papercraft world of a coral reef, rife with colorful fish and sea creatures.â Bill Peebles, another researcher on the project, notes that Sora created a narrative thrust by its camera angles and timing. âThere’s actually multiple shot changesâthese are not stitched together, but generated by the model in one go,â he says. âWe didnât tell it to do that, it just automatically did it.â
In another example I didnât view, Sora was prompted to give a tour of a zoo. âIt started off with the name of the zoo on a big sign, gradually panned down, and then had a number of shot changes to show the different animals that live at the zoo,â says Peebles, âIt did it in a nice and cinematic way that it hadn’t been explicitly instructed to do.â
One feature in Sora that the OpenAI team didnât show, and may not release for quite a while, is the ability to generate videos from a single image or a sequence of frames. âThis is going to be another really cool way to improve storytelling capabilities,â says Brooks. âYou can draw exactly what you have on your mind and then animate it to life.â OpenAI is aware that this feature also has the potential to produce deepfakes and misinformation. âWeâre going to be very careful about all the safety implications for this,â Peebles adds.