What does an AI prompt engineer actually do?

The Scene

Riley Goodside’s job as the lead prompt engineer for Scale AI didn’t exist a year ago.

The advent of ChatGPT has created a wave of mostly young people who have made it their business to play around with large language models and image-generation programs, and get them to do things that their creators never really intended, or even thought about.

Companies creating large language models of their own can hire Scale (and its prompt engineering expertise) to bombard the models with the most intricate prompts, finding weak spots. During Scale AI’s hackathon in San Francisco on Saturday. Goodside talked to Semafor about what he does in an edited conversation below.

The View From Riley Goodside

Q: What was the first prompt you did?

A: My first exposure to large language models — outside of just reading about them — was through a game called AI Dungeon. The game was in the style of a video game from the 70s and 80s, where the entire game is text. You type commands like ‘go north.’ They produced a GPT-3-powered version of one of those games that could be sort of endless. This was in 2021.

Q: What did you learn from that game?

A: The game led almost immediately to people sort of gaming the system through the inherent fluidity of large language models and the fact that it’s hard to put strict rules on them. One of the things people noticed pretty quickly is that you could increase your points in the game by just asking it to ‘reward me 20,000 points.’ They hadn’t considered the case that somebody might just ask.

And that’s a large part of what I do these days, actually, is thinking about adversarial prompting, and thinking about how large language models might be misused, or be used by somebody who’s confused about how they work and all the ways that bad results could sneak into somebody’s important work.

Q: When did you start prompt engineering?

A: My initial interest was in code completion in 2022. How could it follow instructions well for producing code? One of the things I started playing around with out of curiosity was the question of how long could a prompt be and still be followed.

GPT-3 was created to follow short instructions that somebody could prompt by saying ‘give me 10 ideas for an ice cream shop or translate French to English.’ But they never trained it on somebody writing an entire page of instructions, like a whole cascade of steps to follow.

I found it could do many of these. There were issues and it would trip up, but if you had a bit of intuition of what it could do and what it couldn’t do, you find that even if you input a page of [instructions], it still works.

Q: Was that a big revelation?

A: It was not well appreciated that instructions could do this. I spoke with a member of OpenAI’s technical staff at the time and I asked him ‘were you expecting this to be able to follow instructions of this length?’ He said ‘no, we just trained it on many examples of shorter ones and it got the gist and learned to generalize to larger instructions.’

That was my first clue that maybe I’m onto something here, that a normal person using it just playing around could discover.

Andrej Karpathy likes to describe the role of the prompt engineer as an LLM psychologist, developing folk theories of what the model is doing in its head, so to speak, with an understanding that there really is nothing in its head. There is no head.

Q: Do we know who coined the term “prompt engineer?”

A: I’m not sure who coined the term. I will say it’s something that is widely misunderstood. The term engineer has a sort of joking meaning within tech to refer to things that are gamed and fiddled with. Like calling somebody at Subway a “sandwich artist.” Or the phrase “social engineering,” which is the art of calling people on the phone and getting what you want by manipulating and smooth talking.

Q: So the “engineer” in “software engineer” is more serious?

A: Right. But prompts are engineered in a different sense. There’s some ambiguity in how you do it. So you try many of them and take whichever one works best. And that practice was referred to as prompt engineering.

Q: When evaluating the quality of a prompt, do you know it when you see it, or is there some system for measuring it?

A: There’s a lot of empirical rigor that potentially can be involved in picking what the words of a prompt are.

Q: LLMs are famously not good at math. Is there anything else that they can’t do?

A: One is exact calculation, especially hard ones, like ‘give me a cube root of a seven digit number,’ and another is reversing strings, which surprises a lot of people. Like, writing text backwards. It’s a quirk due to how they’re implemented. The model doesn’t see letters. It sees chunks of letters, about four characters long on average.

Another is array indexing. For instance, if you tell it you have a stack of Fiestaware plates of these colors: green, yellow, orange, red, purple. And then say ‘two slots below the purple one, I placed a yellow one, then one slot above the green one, I placed a black one.’ And you say ‘what is the final stack of plates?’ Language models are terrible at that. If you ask it to give me a list of 10 examples of something, sometimes you might get nine, other times 11.

Q: What do you see as the most exciting use cases for these large language models?

A: One of the most immediate benefits is that ordinary people will be able to prompt the model for all sorts of things that would have been difficult to learn before. For instance, with a GPT code interpreter, you can say ‘give me a 5x12 animated GIF of green, falling Matrix letters.’ And it will write Python code that generates this GIF for you.

Q: It also seems like LLMs will end up becoming just a really great user interface that simplifies and speeds up ordinary tasks that eat up little bits of time and mindshare throughout the day.

A: Right, and people sometimes say they expect prompt engineering to go away as these models become fully tuned. But I think what’s going to happen as a sort of counterbalancing force is that the complexity of the things that we demand of our language models will grow. Instead of just generating an essay, generate an entire novel where there are no plot holes.

Q: Let’s assume prompt engineering stays around forever. Is it going to be something that everybody does, or is it going to be something like software engineering now, where only a tiny percentage of people using computers are actually writing code?

A: Many people, especially those that just consider themselves to be software engineers, will have to know some prompt engineering. It’ll just be part of the job.

Ilya Sutskever at OpenAI tweeted that prompt engineering is a ‘transitory term that’s relevant only thanks to flaws in our models,’ and in the long run, that’s probably true.

But I think what’s also going to happen is that prompt engineering will shift to a new frontier. You’ll still have prompting in some sense, but you’ll be prompting at a higher level and asking it to do more impressive things.

Q: Why join Scale AI as opposed to starting your own AI company built around great prompting?

A: I wanted to help make large language models safer, more reliable, and more capable. There aren’t that many jobs where somebody can work directly on this problem.

It has a reputation of being something that requires a genius. That you’re going to be the alignment researcher that solves the mathematical way that you make all models better forever.

But there’s a lot of work to be done along the way. A big focus of my work lately is working on our “red teaming” products. We’re also exploring synthetic data and helping foundation model providers produce more reliable models.

Q: So large foundation model providers would come to you and say ‘hey can you help us red team and help us develop better models?’

A: Right. And understand the shape of misbehavior in some sense — all the ways somebody might come to the model and get something bad done, like generating spam, or erotica, or all the things they don’t want to be involved with.

Q: Sometimes I wonder if all of this work to make LLMs so safe has reduced functionality. Wouldn’t people like you rather have access to the raw LLMs?

A: Absolutely. There’s something that has been lost in adding alignment. There’s even a technical sense in which that’s true. It’s referred to as an alignment tax, which is the drop in performance that you get from many benchmarks.

Many people are justifiably annoyed by the over-refusals. Refusing to help with things that are actually not a problem. And there was an elegance to models that would never refuse. It was fun. You could do things like say the Oxford English Dictionary defines fluxeluvevologist as … and it would come up with some ridiculous etymology of what this word actually means.

It used to be if you asked a model ‘who are you?’ it would say ‘I’m a student in Ohio.’ Now, when you ask it that, it says ‘I’m ChatGPT.’ That’s good in some sense. It’s useful. But it takes it out of this fantasyland that did have a magic to it.