Artificial intelligence requires human intervention. As clever or resourceful as this class of technology may seem, there’s an inherent issue with AI that isn’t likely to be solved anytime soon. The issue is that artificial intelligence uses up information far faster than we humans can produce. And, let’s not misunderstand the situation here; we create original content, AI simply spins, reworks, infers, and finds information.
This issue plagues generation AI more than anything else. These are tools like large language models such as ChatGPT, and image generation software such as Dall-E. Language models like ChatGPT require a significant amount of training on all sorts of information. In fact, the utility itself makes this quite clear if your prompt includes queries on events that happened after 2021. You’ll get a neat little paragraph in which ChatGPT asserts that it was trained on specific information and can only answer questions or provide context concerning information and events prior to 2021.

However, the vast stores of information that these systems are trained on is what makes them seem so knowledgeable and highly educated. There’s been a massive movement to train these systems on more and more data so that their perceived abilities and breadth of knowledge improve even further. But, unfortunately, we are not running out of data.
A recent research paper published by researchers from Epoch provides further information about this issue, and according to the paper, we could run out of training data for artificial intelligence by as early as 2026. It’s not only the researchers at Epoch who feel this way, though. Teven Le Scao, a researcher at Hugging Face, an AI company, has also asserted that he fears we may soon run out of appropriate data with which to train further artificial intelligence initiatives.
Researchers at Epoch say that the problem stems primarily from the way that data is sorted. When data is selected to be used in AI training, it gets sorted into two piles; high-quality, and low-quality. However, there’s apparently a very blurry line between the two piles. Nonetheless, there’s a difference between that which usually falls cleanly into the category of high-quality data, and that which does not.
High-quality data is usually written by professional writers, while social media posts, 4chan rants, and general commentary most often qualify as low-quality data. The issue is that there is far more low-quality data than high, and AI researchers prefer to use high-quality data when training artificial intelligence because of the intended result of the initiative. People want to interact with high quality content.
Some researchers assert that it may be time to reassess what qualifies as low quality data. However, this carries the significant risk of ending up with chat bots riddled with baised human opinions and thoughts that are then circulated to whoever may be using the utility at the time. We’ll have to wait for researchers to battle this one out, but it certainly seems like AI’s longevity could be called into question soon. We may be looking at just another tech fad.