Is there enough text to feed the AI beast?

The Scoop

A Stanford study made headlines this week with the prediction that the largest AI models could run out of new text to scrape by the end of this year — but the study’s leader said he believes AI companies won’t really feel the crunch until at least the end of the decade.

Epoch, an AI forecasting research institute, projected the amount of data required to train models as they continue to scale in size, and compared this to how much data is expected to be published online in the future. Its results appear in the AI Index Report published by the Stanford Institute for Human-Centered AI this week.

The amount of data on the internet is growing at a pace of about 7% per year, Epoch’s director, Jaime Sevilla, said, while the amount of data AI is being trained on is increasing at 200% per year. If the biggest models have ingested most of the content already, there won’t be much new information for them to learn from.

But while the Stanford study says AI companies could run out of text within months, Epoch has adjusted its predictions and is planning to publish a new research paper with updating its estimates, and believes that there will still be enough public data left to train AI models “five or six years from now,” he said.

The shift comes because analysts initially only considered the high-quality text from reputable sources that have been edited by people for accuracy like news articles and Wikipedia pages.

“We’re less sure about how important it’s going to be to train only on high-quality data. We think that broader kinds of data might still be useful, perhaps not to the same degree, but they might still be enough to continue the pace of scaling so we have become a bit more optimistic,” Sevilla said.

In this article:

The Scoop

Know More

Katyanna’s view

Room for Disagreement

Notable

Know More

Tech companies will have to find new sources of data if they run out of information to scrape from the internet to train increasingly large AI models. Well-funded AI giants will probably find a way around this shortage. OpenAI’s CEO Sam Altman has said that it won’t be too much of a problem if AI can generate useful synthetic data.

Training on AI-generated outputs, however, is risky and unreliable. Models are prone to spewing false facts, and these errors are carried forward if they’re trained on the text they produce, degrading their performance over time. Last year, computer scientists showed how a language model, released by Meta in 2022, got worse when it was repeatedly trained on synthetic data.

In one example, a conversation about the architecture of an old church quickly morphed into a list of jackrabbits with different coloured tails. The text produced by AI is often bland and repetitive, and nudging models to write more creatively comes at the cost of sacrificing accuracy.

Figuring out how to generate good synthetic data is an active area of research. Microsoft has been exploring methods to produce text that’s both diverse and accurate. Meanwhile, Anthropic may have found a decent way around these issues since it trained its latest language model on some synthetic text.

There is another way to get more data too: Paying humans to create it. Several tech companies work with data labeling services to look for people with writing skills or expert knowledge to produce content to train AI. Those with deep pockets, like OpenAI and Google, have negotiated content licensing deals with platforms and publishers worth millions of dollars per year.

Katyanna’s view

One immediate consequence of the text shortage may be a widening gap between the performance of open source AI and private commercial technologies.

Companies will have to pay for new sources of data, both by synthetically generating it themselves and paying people to create it. Taking text produced by one commercial model to train a competing system is typically against the terms and conditions of usage (although this is difficult to stop in practice), and those that can’t afford to pay creators for content could struggle to compete.

Room for Disagreement

Data shortages may not matter so much if computer scientists can develop new techniques or architectures that make models learn more efficiently on less information. There’s a push to make small models more effective by training them on other sources of text that are cleaner and more specialized, like textbooks, instead of general information scraped from the web.

Notable

Stanford HAI also found that private investment in AI has decreased overall, but there are more startups than ever and spending on generative AI has octupled to $25.2 billion.
To find more text, OpenAI reportedly used its AI speech-to-text tool, Whisper, to automatically transcribe audio from YouTube videos, according to the New York Times.
AI startups are training their own models on synthetic data generated by top systems like GPT-4 to copy its behavior, The Information reported.