This is where the data to build AI comes from
their findings, Shared exclusively with MIT Technology Reviewshowing a worrying trend: AI data practices threaten to overwhelmingly concentrate power in the hands of a handful of dominant technology companies.
Shayne Longpre, an MIT researcher involved in the project, said that in the early 2010s, the data set came from a variety of sources.
It comes not just from encyclopedias and the web, but from sources such as parliamentary records, earnings calls and weather forecasts. At the time, Lumpre said, AI datasets were collected and curated specifically from different sources to fit individual tasks.
Then, Transformer, the architecture that supports language models, was invented in 2017, and the field of artificial intelligence began to see that the larger the model and data set, the better the performance. Today, most artificial intelligence datasets are constructed by indiscriminately pulling data from the Internet. Since 2018, the web has been the main source of data sets used in all media (such as audio, image and video), and the gap between scraped data and more curated data sets has emerged and widened.
“In base model development, nothing seems to be more important for functionality than the scale and heterogeneity of data and networks,” Longpre said. The need for scale also heavily drives the use of synthetic data.
The past few years have also seen the rise of multi-modal generative AI models, which can produce videos and images. Like large language models, they need as much data as possible, and YouTube is the best source.
For the video model, as you can see in this chart, more than 70% of the speech and image datasets come from the same source.
This could be a boon for Alphabet, the parent company of Google that owns YouTube. While text is distributed across the web and controlled by many different websites and platforms, video material is extremely concentrated on one platform.
2024-12-18 10:50:10