News

OpenAI confirms that it is impossible to train ChatGPT… without stealing data

Generative tools have a plagiarism problem that is difficult to solve.

January 9, 2024
Updated: July 2, 2025 at 12:15 AM

OpenAI confirms that it is impossible to train ChatGPT… without stealing data

Training an artificial intelligence is not a simple task. Large generative models like ChatGPT or DALL-E need gigantic datasets to improve their capabilities and results. However, sometimes these datasets may include copyrighted material. According to OpenAI, the company behind ChatGPT, it is a necessary evil. The company claims that it would be “impossible” to create such high-level neural networks without using copyrighted material.

ChatGPT DOWNLOAD

Public access to generative models and their extreme popularity has made their legislation lag behind, not knowing how to proceed in these cases. In an investigation into the risks and opportunities presented by these tools carried out by the UK Commission on Communications and Digital Affairs, OpenAI admitted that their models require copyrighted material to function.

In this case, the company has come to confirm what was an open secret. And it is that if we turn to these tools, we will see that it is not very difficult to recreate scenes from very famous movies or existing writings. But are these practices legal? To this day, it is a question that continues to generate a lot of controversy.

A report by the IEEE states that Midjourney and DALL-E 3, two of the most popular image generation models, can recreate existing movie and video game scenes almost to the millimeter. Two of the co-authors of this report, Gary Marcus (AI expert) and Reid Southern (digital illustrator), conclude with almost certainty that both Midjourney and OpenAI trained their generative models with protected works.

All these images were generated with Midjourney

For OpenAI, the explanation is simple: “As copyright covers virtually all types of human expression today […] it would be impossible to train the leading AI models without using copyrighted material.” On the other hand, OpenAI has offered to indemnify companies that make rights claims, as long as customers have not consciously generated such works. In this case, if I ask DALL-E to recreate a scene from a protected movie exactly, this would not entitle me to compensation.