Alexa, will generative AI make you more useful?

The Scene

While the U.S. Federal Trade Commission has branded Amazon a monopoly, advances in artificial intelligence are spurring the tech giant to act more like a startup these days.

It has scrapped some bureaucratic hoops for what Rohit Prasad, Amazon’s head scientist in charge of building the company’s artificial intelligence products, calls the “transformation of making.” He’s leading the effort to bring generative AI into people’s homes, which he says is as big of a change as when Alexa was introduced in 2014.

Last week, Amazon unveiled a slew of products revamped with generative AI models running everything from the Echo to the Fire Stick. In an edited conversation, Prasad talks about how the technology will change the way we live in big and small ways.

The View From Rohit Prasad

Q: I know you were already doing some of this before ChatGPT. But that was a big catalyzing moment. How has your job changed since then?

A: For a long time, we’ve been working on what I call generalized intelligence, or AGI. And AGI means that AI is capable of any tasks that you want it to do. It’s inherently multi-purpose. It can essentially learn with minimal human input. You and I are still the highest echelon of generalized intelligence. Now that’s changing a lot.

But our arc has always been generalized intelligence. So we just formalized that into an AGI organization that I’m responsible for. And Alexa continues to be one of the biggest manifestations of that AGI. Because our millions of customers who use it want it to do everything for you, and everyone wants it slightly differently. And this is why it is the personal AI in my mind.

Q: You stepped it up to this LLM-style, more conversational experience. Does it hallucinate like some of the other foundation models?

A: LLMs are a waypoint towards generalized intelligence. They aren’t the endpoint. What you are seeing is still early days. All LLMs still hallucinate a bit. What we have done is taken extreme steps on how to ground them. If you said, ‘Alexa, it’s hot in here,’ a browser to an LLM-based system will say, ‘go to the beach.’ But if you asked Alexa on your home device, it should be grounded by knowing you have a thermostat connected to it. It should ask, ‘Do you want me to lower the temperature by five degrees?’

That’s the kind of important grounding that we have done. In addition, hallucinations are common because these are built on a token predictor. So you have to do a lot of fine tuning, as well as aligning so that it doesn’t hallucinate. Over the years of working on Alexa in the consumer domain, we have learned a lot on how to align these models.

Q: Is the model a continuation of the Alexa model or is it an entirely new model?

A: This is a new model we have built. But we’ve had years of experience with encoder-decoder models, which we were using for Alexa to learn new languages, new domains. And then we also had a visual language model, so that you can ask about product features. And some of the visual language models are also useful for things like visual processing. So we have had an immense amount of experience with that.

But this is a new model. Very large. And then it’s fusing in real-time devices and services, your personal context of what you watch, what you listen to, what’s your favorite teams, are you a vegetarian. And of course, how to align these models the right way, especially when you connect it to the real world.

Q: You say it’s large, but I imagine you wouldn’t want it to be as large as a GPT-4, because then you could introduce more uncertainty. Is the training set based on all the data you’ve gathered from Alexa over the years?

A: It’s trained on publicly available data as well as in-house datasets. But the real thing is about performance. And here, there are a lot of differences. We are focused on the home, not the browser, with this model. The home means, and especially on ambient interactions with devices that are around you, you want responses to be succinct.

The model is fairly large. There’s a pre-training stage where it’s a token predictor. That’s a massive model. And then you’re fine tuning it for voice interactions. You’re grounding it with real world APIs so that it does the right thing at the right moment for you. Plus, you’re using the personal context so it knows things like your favorite team, or the weather in your location.

Q: This is the first one I’ve seen where the multimodal text and speech are so integrated. Is this the first one of this kind?

A: There are several firsts here. This is the biggest integration of APIs and services, inclusive of devices and services, that’s integrated into an LLM. I don’t know if anyone has done as big an integration. Second, you’re rightfully thinking of this as these processes of converting speech to text, text to LLM, and then a synthesized voice. It’s much more tightly integrated. Not all the way, as the speech-speech model that is still in the lab.

What you saw in that demonstration is incredibly hard, talking to a device at a distance, maintaining context. It’s really looking at what’s the right way of having voice interactions. More things are happening jointly versus independently now. For instance, if you expressed the Red Sox, which it knows is my favorite team, won, it can transfer that knowledge into the synthesis engine so that it renders it in a joyful mood. And if they lost, it will be more empathetic. So there is a lot more coupling happening. And then the speech-to-speech model in the future will make that coupling even easier.

Q: You didn’t talk about images in your presentation. That came in later with the Fire Stick.

A: That’s also a lot of our in-house models working on text to image generation. What we have is more multilingual and multimodal. So what we showed more is from a customer lens versus the science behind the scene.

Q: What are some examples of consumer experiences that multimodal AI will make possible?

A: If we go back to my re:MARS talk from 2020, you would see we can talk about, like, how many pockets does this have if you’re looking at a trouser or pants. Do you have that shirt in blue colors? We’ve worked on those capabilities before, but now we’re building more performant models.

Q: There’s a battle now for AI talent. You’re seeing some of the top people at Google and Meta go out on their own.

A: We have been fortunate because we have had people who’ve built the team for over 10 years now. If you look at Alexa’s momentum, it’s massive. We have had the number of people using Alexa on a daily basis grow by 30% year on year. Then if you look at AWS, then you look at ads, then you look at all the businesses that Amazon has. As a scientist, I’m drooling over so many problems to work on. We are the best place to try out ideas and make it actually useful for millions of customers.

Q: Are you at MIT pitching that to people finishing their PhDs?

A: That’s the reason I never left Cambridge. But I’ve built teams in Europe, in India, the West Coast, in the Bay Area, and LA. We are quite diverse in both our talent and geographical footprint.

Q: What’s the next step in AI or generative AI?

A: AGI is going to be transformational. It’s going to be a multi-generational impact. One incarnation of that is Alexa. My kids grew up with them. It’s embedded in their social life. They just treat it as a family member. So it is truly multi-generational. I’m an optimist in the sense that it will impact all businesses. And what we’ll find is new types of jobs are created that we can’t even imagine today. That’s the right next step. Humanity is key. Overall, the economic impact will be higher. We’ll have a much better planet if AGI does its things right.

Q: Will these models just get bigger and become more capable?

A: We’ll find that size matters. But it’s all about competency. And competency in terms of how these models learn is still primitive. There are a lot of limitations with LLMs today, how you can augment it with memory. It doesn’t have infinite memory. It doesn’t know how to compute really unless you give it a tool. It doesn’t know how to do massive large-scale reasoning. It can do short-term reasoning with a chain of thought. I’m very optimistic that these limitations will be overcome over time.

Q: On the infrastructure side, that’s not all there yet, right? The data centers don’t all have GPUs. I assume the inference on these models use GPUs. What are you doing to bring the costs down to make this work at scale?

A: AWS is the best place to do any AI in terms of compute. We’ve built massive data centers. They have third party GPUs like Nvidia’s, but also accelerated platforms like Trainium and Inferentia. Trainium is our chip for training and Inferentia for inference. And those are very performant in terms of cost-compute. And this is, again, my pitch to scientists. We have the infrastructure, we have the data, and algorithms. Now you can come and make everything better.

Q: If you could snap your fingers, though, would you have a certain percentage more power or more GPUs. Is there a bottleneck that’s holding back the science right now?

A: I wouldn’t call it a bottleneck. At the same time, we should be looking at it more sustainably. In Amazon, we have a frugality principle, which breeds invention. So you should be thinking ‘how do I make the best use of the resources we have?’ You always have some constraints. The best way is to think of inventing our way out of it.

Q: One of the big ideas in generative AI is the agent concept — an AI that can shop for you or buy your airline tickets. But there’s also some disillusionment that these agents are not going to work as well as we thought. You demonstrated one way it can work. Do you see this as a service within Amazon’s ecosystem?

A: Alexa is that super intelligence that works on your behalf. Today, it already controls your home. It assists you on your tasks. It’s being used by two-year-olds to 100-year-olds. It doesn’t take huge imagination to say, ‘if I’m running out of milk, you can order it.’

Then there will be many vertical agents. They’re specialized at a certain task. So if you have a law firm, you have an agent that you can build for your work. If you are in the travel industry, you may have a travel agent. But there will also be these super agents or the super AI like Alexa. Both are possible now.