Elon Musk has echoed the concerns of AI experts that the world has exhausted the real-world data required to train AI models.
During a recent livestream with Stagwell chairman Mark Penn on X, Musk stated that “the cumulative sum of human knowledge” had been used up by last year.
This aligns with statements made by former OpenAI chief scientist Ilya Sutskever at the NeurIPS conference in December, where he warned that the AI industry had hit “peak data” and suggested a shift away from current training methods.
Musk further suggested that the future of AI model development lies in synthetic data and AI-generated data used to train other AI systems.
This data, Musk believes, could allow AI models to self-learn and adapt without the need for new real-world data.The trend of using synthetic data is already underway, with major companies like Microsoft, Meta, OpenAI, and Anthropic incorporating it into their AI training processes.
Gartner forecasts that by 2024, 60% of data used for AI projects will be synthetically generated. The benefits of synthetic data include significant cost savings.
For example, AI startup Writer developed its Palmyra X 004 model using almost entirely synthetic data, saving millions in development costs.
However, there are risks associated with this approach. Synthetic data can lead to “model collapse,” where AI systems become less creative and more biased due to the limitations inherent in the data used for training.
If the synthetic data reflects biases from the sources, those biases can be amplified in the AI’s outputs, leading to potentially harmful consequences.