We’ve all been impressed with the generative art templates: DALL-E, Imagen, Stable Diffusion, Midjourney, and now Facebook’s generative video template, Make-A-Video. They are easy to use and the results are impressive. They also raise fascinating questions about programming languages. Rapid engineering, designing the prompts that drive these models, is likely to be a new specialty. There is already a self-published book on rapid engineering for DALL-E, and an excellent tutorial on rapid engineering for Midjourney. Ultimately, what we’re doing when creating a prompt is programming, but not the kind of programming we’re used to. The input is free-form text, not a programming language as we know it. It’s natural language, or at least it’s supposed to be: there’s no formal grammar or syntax behind it.
Rapid engineering books, articles, and courses inevitably teach a language, the language you need to know to speak DALL-E. At present, it is an informal language, not a formal language with a specification in BNF or another metalanguage. But as this segment of the AI industry grows, what will people expect? Will users expect prompts that worked with DALL-E version 1.X to work with version 1.Y or 2.Z? If we compile a C program first with GCC and then with Clang, we don’t expect the same machine code, but we expect the program to do the same thing. We have these expectations because C, Java, and other programming languages are precisely defined in documents ratified by a standards committee or other body, and we expect compatibility gaps to be well documented. Besides, if we write “Hello, World” in C, and again in Java, we expect these programs to do exactly the same thing. Similarly, prompt engineers can also expect a prompt that works for DALL-E to behave similarly with Stable Diffusion. Granted, they may be trained on different data and therefore have different elements in their visual vocabulary, but if we can get DALL-E to draw a Tarsier eating a Cobra in the style of Picasso, shouldn’t we expect Does the same prompt do something similar with Stable Diffusion or Midjourney?
Learn faster. Dig deeper. To see further.
Indeed, programs like DALL-E define something a bit like a formal programming language. The “formality” of this language does not come from the problem itself, or from the software implementing this language – it is a natural language model, not a formal language model. Formality stems from user expectations. The Midjourney article even talks about “keywords” – sounding like an early BASIC programming manual. I’m not saying there’s anything good or bad about it – values don’t come into it at all. Users inevitably develop ideas about how things “should” behave. And the developers of these tools, if they want to become more than academic toys, will have to think about user expectations on issues such as backwards compatibility and cross-platform behavior.
This begs the question: what will the developers of programs like DALL-E and Stable Diffusion do? After all, they’re already more than academic toys: they’re already being used for commercial purposes (like designing logos), and we’re already seeing business models being built around them. In addition to the fees for using the models themselves, there are already startups selling prompt chains, a market that assumes prompt behavior is consistent over time. Will the front end of image generators continue to be large language models, able to parse just about anything but provide inconsistent results? (Is inconsistency even a problem for this domain? Once you’ve created a logo, will you need to reuse that prompt?) Or will image generator developers consult the DALL prompt reference -E (currently hypothetical, but someone will write it down eventually), and will they realize they need to implement this spec? If the latter, how will they do it? Will they develop a giant BNF grammar and use compiler generation tools, leaving aside the language model? Will they develop a more constrained natural language model, less formal than a formal computer language but more formal than Semi-Huinty?1 Could they use a language model to understand words like Tarsier, Picasso and eat, but treat phrases like “in the style of” more like keywords? The answer to this question will be important: it will be something that we have really never seen in computing before.
Will the next step in the development of generative software be the development of informal formal languages?
- Semi-Huinty is a hypothetical language somewhere in the Germanic language family. It only exists in a parody of historical linguistics that was displayed on a bulletin board in a linguistics department.