In the burgeoning field of generative AI, the term “synthetic data” has become a lightning rod, dividing those who think it’s the savior of the industry and others who say it will destroy AI in what researchers describe as “model collapse.”
Data, and lots of it, is a necessary ingredient in generative AI. But “real” data, or content created by humans, is fraught with problems. Much of it is copyrighted and it contains other troublesome issues, like racial bias, inaccurate information, and pornography.
Yet synthetic data, which is machine-learning generated based on the patterns and properties of real-world data, can also be thorny because it can miss the nuances of human-created content, repeat human biases and be difficult to validate for accuracy. If those shortcomings make it into large language models, one theory goes, it could create a vicious cycle where models keep creating worse synthetic data that then gets fed back into newer models, creating an algorithmic version of Idiocracy.
In many ways, the fate of synthetic data sits at the center of the biggest questions facing generative AI. With artists, novelists, and even comedians claiming AI companies have illegally used copyright-protected material to train models, synthetic data could be a work-around. And synthetic data, which doesn’t require potentially costly licenses, may be necessary to make the next leap in capability because there isn’t enough data to keep improving models, especially for certain specialized areas of knowledge like biotech and drug discovery.
In exclusive interviews with Semafor, Microsoft researchers offered new insight into the role synthetic data will play in the development of new AI models, and it may not be what people feared or hoped.
“There is a lot of confusion out there,” said Sébastien Bubeck, who leads the Machine Learning Foundations group at Microsoft Research.
The disorganized, vast pool of online words that make up the internet is what made ChatGPT incredibly smart. Researchers believe it’s also partly why it hallucinates and sometimes goes off the rails.
But what if AI models could do the same learning from less data that was more organized and targeted, like synthetic data? Microsoft put the theory to the test. The result was Phi 1.5, an open source AI model that is a tiny fraction of the size of GPT-4 and yet has many of the same core capabilities.
The idea behind Phi was to get at the essence of how GPT-4, the AI model that powers ChatGPT, learned. And then use that knowledge to create a dataset capable of teaching those lessons to a smaller model in a more direct and efficient way.
“The first question that I want to ask is, ‘What were the minimum ingredients that were needed for this intelligence to emerge?” said Bubeck.
Microsoft used the larger model to create a kind of curriculum to teach a smaller model — what researchers there called “textbooks.”
The author of those textbooks was GPT-4, which was prompted to stay laser focused on data that researchers thought would lead to the best capabilities. In doing so, GPT-4 created a dataset that did away with its encyclopedic knowledge of the web, like a parent teaching children while sheltering them from the harsh and confusing realities of the world.
Researchers then made that child demonstrate its thinking with a series of exercises. “Just like human beings, after you’ve read the textbook, you don’t really know anything yet. You have to put this knowledge into action,” Bubeck said. “You have to do the exercises. The jump in capabilities was huge after this fine tune.”
The pursuit of high quality synthetic data has become a sub-industry in the field of AI. IBM’s “InfoSphere Optim Test Data Fabrication” product promises to create data from scratch to help customers “minimize the risks related to using sensitive production data.”
Bubeck said he understands how people could view synthetic data as potentially detrimental, but he says the process is more involved than most realize: “When we say synthetic data, we don’t mean very naively just a model to randomly generate data where you keep training on this randomly generated data until you have this loop, which, of course, is going to go in a bad direction.”
He said the data created by Microsoft’s researchers is not random. It’s highly tailored, and then filtered and checked by other models. For example, ChatGPT is known to be bad at math, but it can still produce what amounts to a math textbook for smaller models.
First, researchers could ask ChatGPT to create a large set of multiplication problems of which only 10% would demonstrate the correct math. But those answers can be fed into a calculator of sorts that filters out the incorrect answers. The result is a rich dataset to train smaller models. And Bubeck says the techniques used to create valuable synthetic data could be used to help large models like GPT-4 self-improve.
For most applications, large language models don’t need to know everything. They need to be able to do something resembling reasoning.
For instance, ChatGPT isn’t that helpful at writing articles for me. It will produce something close to an article. But in the end, it takes me longer to edit ChatGPT than it would to write the article myself from scratch.
What I really want is for ChatGPT to be able to read my notes and every other article I’ve written for Semafor and figure out how to write something close to a finished product. And for that, a small, open-source model trained on synthetic data might work even better than ChatGPT. There would be less noise from an expansive training dataset and more focus on the data that actually matters — mine (and maybe Semafor’s).
We think of large language models as these monsters that hoover up the internet and then regurgitate it in a copyright lawyer’s nightmare. But that’s likely not where we’re headed. It’s not what most consumers want and therefore it isn’t what companies will want to build.
Artists, for instance, are going to want AI models that can help them create in their own style — not create a knockoff of someone else’s art.
It’s unclear whether synthetic data can help address issues like racial bias. It stands to reason that it would, because datasets can be created that are specifically tailored to address bias. But it’s such a tricky issue that we likely won’t know until people figure it out, if they ever do.
Room for Disagreement
The pursuit of synthetic data in large language models is still a new area and we can’t be sure where it will go. In a blog post, the AI company Syntheticus, which makes AI-generated synthetic data, argues: “Synthetic data may not capture the complexity of real-world datasets and can potentially omit important details or relationships needed for accurate predictions.
For instance, a healthcare organization might generate synthetic patient data for training an AI model for predicting disease progression, but due to its lack of realism, the model may not be able to accurately predict said disease progression from the synthetic data.”
- Demis Hassabis, the co-founder of DeepMind, talks about how synthetic data was used in the AlphaFold discovery. The problem was that there weren’t enough known proteins in the world to create a large enough dataset to train the algorithm. He calls it “self-distillation,” or using AlphaFold’s predictions to make the training set bigger. “That was critical to AlphaFold working,” Hassabis said in a comprehensive and eye-opening interview on the Lex Fridman podcast.