The most noticeable difference between Gemini and its competitors is that it is “multimodal,” meaning it was trained on a mixture of text, audio, and video. Other large language models also have multimodal capabilities, but do so by combining multiple models, each with a single modality.

Google said the “native” multi-modal approach gives Gemini better reasoning skills in its image analysis.

In one example shared with reporters, Google showed Gemini watching a person’s hands as they perform a magic trick with a quarter. The model first tries to guess which hand the quarter is in, then when it is wrong, realizes that it was fooled. “The coin is in the left hand, using a sleight of hand technique to make it appear as if the coin has disappeared,” Gemini said.

In another, it’s shown several paper airplane designs by YouTube celebrity and former NASA engineer Mark Rober, who asks it to determine which one will fly most effectively. Gemini correctly determines the best design.

It was also able to watch a video of a normally-dressed person mimicking the body movements of Keanu Reeves in The Matrix as his character, Neo, dodges bullets. Gemini correctly guesses that the person is reenacting a scene from the movie. Eli Collins, vice president of product for DeepMind, said the model learned that scene from “copyright safe data” found on the open web.

Google researchers said there were questions about whether the multimodal approach would be able to perform as well or better than models that focused solely on one specific modality — a kind of specialist versus generalist debate.

But they said they found their generalist model prevailed. “Gemini sets a new state of the art across a wide range of text, image, audio, and video benchmarks,” they wrote in a paper released Wednesday.

Google said Gemini also beats all other large language models in basic math capability and can understand physics.

The company declined to reveal the size of the Gemini model, giving figures only for the smallest version, called Gemini Nano, which can run on Google Pixel smartphones. But the company said it took advantage of new compute capabilities that utilize the latest version of Google’s custom chips, known as Tensor Processing Units.

That is notable because other leading large language models, like OpenAI’s GPT-4 and Anthropic’s Claude, were trained using Nvidia graphics processors, which are in short supply and expensive to operate.

Google said Gemini is designed to run more efficiently on its processors, but declined to provide specific figures.

All three Gemini models — Nano, Pro and Ultra — will be available to enterprise customers, who can tap their capabilities and offer them to its own clients.