A new class-action lawsuit accuses OpenAI and partner Microsoft of infringing on works by non-fiction authors, the latest in a string of legal actions against artificial intelligence companies.
It comes amid turmoil at OpenAI, where most of the startup’s nearly 800 employees have threatened to quit if ousted CEO Sam Altman doesn’t return to his role. He was fired by OpenAI’s board on Friday and announced Sunday that he would join Microsoft, whose CEO Satya Nadella told CNBC on Monday he is looking to partner with Altman in whatever form that takes.
The lawsuit against the two companies, filed Tuesday in federal court in the Southern District of New York, makes similar arguments to other allegations that AI companies used copyrighted works in massive training sets employed to build tools like ChatGPT.
The lead plaintiff in the suit, Julian Sancton, is the author of Madhouse at the End of the Earth, which he spent five years and tens of thousands of dollars writing, according to the lawsuit, which hasn’t previously been reported.
“The commercial success of the ChatGPT products for OpenAI and Microsoft comes at the expense of non-fiction authors who haven’t seen a penny from either defendant,” said Susman Godfrey partner Justin Nelson, the lead attorney representing Sancton.
OpenAI doesn’t disclose what data it used to train GPT-4, its most advanced large language model, but lawyers for Sancton say ChatGPT divulged the secret. “In the early days after its release, however, ChatGPT, in response to an inquiry, confirmed: “Yes, Julian Sancton’s book ‘Madhouse at the End of the Earth’ is included in my training data,” the lawsuit reads.
One way that lawsuit is different from others is that it ropes in Microsoft, which did not decide what training data to use in OpenAI’s models or even design the models itself. Rather, Microsoft provided the infrastructure for training and running them.
The models are now core to Microsoft’s business, which has given it a boost in stock price, the suit points out.
“Microsoft would have known that OpenAI’s training data was scraped indiscriminately from the internet and included a massive quantity of pirated and copyrighted material, including a trove of copyrighted nonfiction works,” the suit alleges.
The companies didn’t immediately respond to requests for comment.
Last week, Stability AI’s vice president of audio, Ed Newton-Rex, resigned in protest over the company’s stance on copyrighted work (It was ok with using them.)
Famous fiction authors like Jonathan Franzen and John Grisham sued OpenAI earlier this year for copyright infringement. Sarah Silverman and other authors are also suing Meta on the same grounds. Several other lawsuits are making their way through the courts.
AI companies have argued that using copyrighted works in training data constitutes “fair use” of the material. In essence, computers are “learning” from the copyrighted works, just like humans learn when they read.
Sancton’s attorneys argue it’s not the same thing. “While OpenAI’s anthropomorphizing of its models is up for debate, at a minimum, humans who learn from books buy them, or borrow them from libraries that buy them, providing at least some measure of compensation to authors and creators,” the lawsuit said.
It alleges that OpenAI deliberately conceals its training sets to hide the copyrighted works it uses. “Another reason to keep its training data and development of GPT-3, GPT-3.5, and GPT-4 secret: To keep rightsholders like Plaintiff and members of the Class in the dark about whether their works were being infringed and used to train OpenAI’s models,” the lawsuit argues.
AI copyright law will surely make its way to the U.S. Supreme Court. The fundamental question: If an AI model is not actually reproducing a protected work, then is the fact that it learned from it a technical violation of copyright?
If AI companies pay for copyrighted works — say, buying a book — can they legally use it to train an AI model, or do they need to license the material from the owner of the copyright?
There’s also a purely moral question: Even if it turns out the AI companies are right, and training AI models with copyrighted material constitutes fair use, should they?
This is a very thorny one. I am the author of a non-fiction book that is almost surely in the training sets for these models and I don’t really have a problem with it. I don’t think large language models will ever really pose competition for books. A book is a lot more than a bunch of words.
What I find upsetting is that there are places people can pirate the book online and read it for free. Nobody seems outraged by that, though.
I also think that we have all contributed to this technology in one way or another; it’s trained on basically the entire internet.
Even if AI companies compensated me for the use of the book, what would it be worth? A few cents? I do, however, think that if AI companies use my book in their training data, they should at least be required to buy a copy. Otherwise, that’s just plain old pirating.
The third point is how technology is moving beyond the copyright issue already. As we’ve reported, the newest small models in generative AI are trained using synthetic data created by the larger models.
And companies like OpenAI are hiring other companies like Scale AI to create content from scratch, specifically to train new AI models.
At some point, there may be a proliferation of generative AI models that contain no problematic material at all.
Room for Disagreement
Ed Newton-Rex argues in this article that what AI companies are doing is wrong: “Setting aside the fair use argument for a moment — since ‘fair use’ wasn’t designed with generative AI in mind — training generative AI models in this way is, to me, wrong. Companies worth billions of dollars are, without permission, training generative AI models on creators’ works, which are then being used to create new content that in many cases can compete with the original works. I don’t see how this can be acceptable in a society that has set up the economics of the creative arts such that creators rely on copyright.”