Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft
In addition to its large collection of books, the Institutional Data Project is partnering with the Boston Public Library to scan millions of articles from various newspapers currently in the public domain and has expressed an interest in establishing similar collaborations in the future. Exactly how the book collection will be released has not yet been determined. The Institutional Data Initiative has asked Google to cooperate on public distribution, but details are still being finalized. Kent Walker, Google’s president of global affairs, said in a statement that the company was “proud to support” the initiative.
Regardless of how IDI’s dataset is released, it will join many similar projects, startups, and initiatives that promise to give companies access to a wealth of high-quality AI training materials without running into copyright issues. risk. Companies like Calliope Networks and ProRata Already appeared Licensing and management compensation plan Designed to allow creators and rights holders to be compensated for providing artificial intelligence training materials.
There are other new public realm projects. Last spring, French artificial intelligence startup Pleias roll out Its own public domain collection, Common Corpus, contains an estimated 3 to 4 million books and journal collections, according to project coordinator Pierre-Carl Langlais. With the support of the French Ministry of Culture, the universal corpus on the open source artificial intelligence platform Hugging Face has been downloaded more than 60,000 times this month alone. Last week, Pleias announced it would release the first set of large-scale language models trained on the dataset, which Langlais told Wired was the first “exclusively trained on open data and consistent with [EU] Artificial Intelligence Act”.
We are also working on creating a similar image collection. The birth of artificial intelligence startups release This summer it launched its own project called Source.Plus, which contains public domain images from Wikimedia Commons and various museums and archives. several important cultural institutions They have long made their archives available to the public as independent projects, such as the Metropolitan Museum of Art in New York.
Ed Newton-Rex, former senior director of Stability AI, now responsible for operations non-profit organization The agency, which certifies ethically trained AI tools, said the rise of these datasets showed there was no need to steal copyrighted material to build high-performance and high-quality AI models. OpenAI previously told UK lawmakers it would be “Impossible” Create products like ChatGPT without using copyrighted works. “Large public domain datasets like this further undermine the ‘necessity defense’ used by some AI companies to justify removing copyrighted works to train their models,” Newton-Rex said.
But he still has reservations about whether IDI and similar projects will really change the status quo of AI training. “These datasets will only have a positive impact if they are used (perhaps in combination with other material licenses) to replace deleted copyrighted works. If they are simply added to the mix, becoming part of a dataset in which Also included are the unlicensed life works of world creators that will greatly benefit artificial intelligence companies,” he said.
Update Dec. 12, 11:18 a.m. ET: This story has been updated with comments from Google.
2024-12-12 14:06:28