Google launched its long-anticipated Gemini AI model on Wednesday, a move the company says puts it at the front of a race long dominated by ChatGPT maker OpenAI.
It’s Google’s attempt to reclaim the lead it gave away after researchers there made the 2017 breakthrough that allowed ChatGPT to exist in the first place. Google said Gemini is ahead of all other AI models in 30 out of 32 industry standard benchmarks, the majority of which were led by GPT–4, the most advanced one developed by OpenAI.
To build Gemini, Alphabet’s Google unit pulled together resources and talent from the far reaches of the 190,000-employee company, tapping DeepMind, the startup it acquired in 2014 to develop artificial general intelligence, as well as teams charged with pushing the limits of cloud computing and infrastructure.
“This new era of models represents one of the biggest science and engineering efforts we’ve undertaken as a company,” Alphabet and Google CEO Sundar Pichai said in a statement.
Consumers can test a pared-down version of Gemini starting Wednesday, when it is incorporated into Bard, the company’s chatbot. The most advanced version of Gemini is still undergoing tests to ensure it is safe for customers, the company said.
Eventually, Gemini will filter its way into most Google products, including the company’s generative, experimental search engine that could be the future of the company’s bread and butter business.
The firm is in a race with Microsoft to augment its suite of products, from documents to spreadsheets to email, with the new technology, eventually allowing people to converse with their computers as much as they click and type.
The most noticeable difference between Gemini and its competitors is that it is “multimodal,” meaning it was trained on a mixture of text, audio, and video. Other large language models also have multimodal capabilities, but do so by combining multiple models, each with a single modality.
Google said the “native” multi-modal approach gives Gemini better reasoning skills in its image analysis.
In one example shared with reporters, Google showed Gemini watching a person’s hands as they perform a magic trick with a quarter. The model first tries to guess which hand the quarter is in, then when it is wrong, realizes that it was fooled. “The coin is in the left hand, using a sleight of hand technique to make it appear as if the coin has disappeared,” Gemini said.
In another, it’s shown several paper airplane designs by YouTube celebrity and former NASA engineer Mark Rober, who asks it to determine which one will fly most effectively. Gemini correctly determines the best design.
It was also able to watch a video of a normally-dressed person mimicking the body movements of Keanu Reeves in The Matrix as his character, Neo, dodges bullets. Gemini correctly guesses that the person is reenacting a scene from the movie. Eli Collins, vice president of product for DeepMind, said the model learned that scene from “copyright safe data” found on the open web.
Google researchers said there were questions about whether the multimodal approach would be able to perform as well or better than models that focused solely on one specific modality — a kind of specialist versus generalist debate.
But they said they found their generalist model prevailed. “Gemini sets a new state of the art across a wide range of text, image, audio, and video benchmarks,” they wrote in a paper released Wednesday.
Google said Gemini also beats all other large language models in basic math capability and can understand physics.
The company declined to reveal the size of the Gemini model, giving figures only for the smallest version, called Gemini Nano, which can run on Google Pixel smartphones. But the company said it took advantage of new compute capabilities that utilize the latest version of Google’s custom chips, known as Tensor Processing Units.
That is notable because other leading large language models, like OpenAI’s GPT-4 and Anthropic’s Claude, were trained using Nvidia graphics processors, which are in short supply and expensive to operate.
Google said Gemini is designed to run more efficiently on its processors, but declined to provide specific figures.
All three Gemini models — Nano, Pro and Ultra — will be available to enterprise customers, who can tap their capabilities and offer them to its own clients.
With few exceptions, companies working in the AI industry or offering AI services to their employees say GPT-4 is the undisputed winner in terms of capability.
Benchmarks don’t tell the whole story. These evaluations are based mainly on real-world experience. Companies have their own criteria based on their specific needs, and in pretty much every case, GPT-4 has no close competition.
It could be that Gemini’s claimed success in a wide variety of benchmarks will mean it outperforms GPT-4 in the real world. We won’t know for sure until Google’s model reaches wide distribution and is put to the test by the same companies that have found GPT-4 to be the most capable.
And where Google competitor Microsoft is reliant on OpenAI to develop new models, Google has now shown it is able to build state-of-the-art AI completely in-house. That advantage is especially important after OpenAI CEO Sam Altman was fired last month from the company under mysterious circumstances, only to be rehired after the startup came close to dissolving.
Still, by one practical measure, for instance, GPT-4 is still the undisputed winner. It can ingest about 300 pages of text in a single prompt in a measure known as the “context window.” That capability is important for use cases like legal research, where analyzing long documents is important. Gemini can only handle about one quarter as much text, according to the paper released Wednesday, though this is the first version and the context window will increase, along with other capabilities.
But for most enterprise needs, Gemini Ultra will be overkill, just like GPT-4. Most companies find they can use much smaller and less capable models, which are less expensive, with the same amount of success. That’s because for business use cases, companies are not looking for general purpose AI. They want models that zero in on data stored on corporate servers.
Today, general purpose AI models like GPT-4 and Gemini are useful for consumers. But there’s another possible customer for Gemini: Startups.
A new generation of AI startups aims to create “agents” that can take autonomous action on behalf of users. Think of them as AI personal assistants. Today, even GPT-4 is insufficient to deliver this experience. Could Gemini, with its multimodal capabilities, allow more ambitious products from AI startups?
We won’t know until startups can use it in earnest, but some of the capabilities Google showed off in demos suggests it may represent a new level in capability.
Even if Gemini is not a game changer right off the bat, it clearly represents a long-term threat to OpenAI’s dominance. When it comes to LLMs like Gemini, Google is kind of a sleeping giant awakened by ChatGPT.
Many of Google’s best minds reside within DeepMind, which has been busy on more narrow applications of AI. DeepMind’s accomplishments like AlphaFold are arguably more impactful and important than ChatGPT.
Now, DeepMind is focusing its brain power on general purpose AI models and the results are pretty stark. Gemini is on its first version and looks like it may have instantly become the industry standard. I can only imagine what Gemini 3 or 4 will look like.
The View From OpenAI
Sometime in the spring or summer of 2024, OpenAI will likely release GPT-5, the next and most advanced version of its large language model. OpenAI has hinted that this will be orders of magnitude better than GPT-4 and, like Gemini, will be multimodal from the ground up.
If GPT-5 achieves artificial general intelligence, a key benchmark that gives AI human-level intelligence in most tasks, it will have left Google in the dust.
But of course, achieving AGI on the next big model is unlikely. What we may see is that Google and OpenAI keep leapfrogging each other in capability until one eventually gets there first.