Datasets containing millions of music tracks are openly accessible, often without proper authorization.
This widespread availability of unauthorized content has recently come under scrutiny.
Alex Reisner, a reporter for The Atlantic, recently brought to light four distinct music datasets actively employed in training artificial intelligence models, subsequently making them publicly searchable. Among these, two datasets are remarkably extensive, encompassing 12 million and 9 million tracks respectively. The remaining two, while comparatively smaller, still contribute substantially to AI training with over 100,000 songs each.
Reisner's findings indicate that these datasets have been downloaded thousands of times. While pinpointing every user is not feasible, both Google and Stability have acknowledged their utilization in various research papers. It's important to note that certain source materials, such as the Free Music Archive dataset, permit free streaming for personal enjoyment but mandate proper licensing for any commercial endeavors.
Despite their theoretical free availability online, integrating these datasets into AI model training is far more complex than a simple download and direct input. Reisner elaborates on the underlying challenges:
"Three of the datasets I identified are disseminated as compilations of links to songs hosted on platforms like YouTube or Spotify. AI developers then acquire the actual audio content through automated tools, some of which are capable of circumventing login requirements, advertisements, and various mechanisms designed to generate revenue or subscribers for content creators. The use of such tools constitutes a clear violation of these platforms' terms of service."
The roster of artists whose work appears within these datasets is extensive, encompassing pop icons such as Lady Gaga and Fred Again.., rock legends like Radiohead and Bruce Springsteen, electronic pioneers Aphex Twin, the influential Wu-Tang Clan, and experimental composer Hainbach. Interested individuals are invited to visit The Atlantic's AI Watchdog site to independently explore the vast array of songs, books, and other media currently being utilized to train global AI models.
The Editorial Staff at AIChief is a team of professional content writers with extensive experience in AI and marketing. Founded in 2025, AIChief has quickly grown into the largest free AI resource hub in the industry.
