AIs have a big problem with data and copyrights: what's going on

Morten Blichfeldt Andersendirector of the Danish publisher Praxisscanning the GPT Store (the virtual shop of OpenAI where chatbots customized by users and based on the same model as ChatGPT are available) found numerous bots or GPTs which, apparently, they were trained by illicitly using copyrighted material. Blichfeldt Andersen reported the matter to OpenAI, which removed the bots that violated the copyright (albeit only after the intervention of the associations Danske Forlag And Rights Alliance), and does not rule out taking legal action against the company directed by Sam Altman.

The event that we have just briefly reported provides us with assistance to make some reflections on a concrete problem that companies involved in the development of AI models are already dealing with: the scarcity of data and information quality to train with their own artificial intelligence models.

How much and what data are needed to train artificial intelligence

To fully understand the scope of this problem, we need to take a step back and understand how much data is needed for AI training. Although it is not known how OpenAI and the “singing company” train their models, some industry experts have made some estimates that we could consider plausible and which clearly highlight the data scarcity concept. Among these are the analyzes carried out by Pablo Villalobos of theEpoch Research Institute. According to the expert, to train a large model like GPT-4 of OpenAI would have been used something like 12 trillion tokens (a token would correspond to a word or a portion of it).

If the figure just mentioned seems enormous to you, you will change your mind when you learn that, according to current development trends and estimates made by Villalobos, the next generation of the OpenAI model, GPT-5may require between 60,000 and 100,000 billion tokens, which is 10,000-20,000 billion more than can be provided by currently available quality resources. In short, there would not be enough data to satisfy GPT-5's data “hunger”.. And it gets worse: this shortfall is estimated on the basis of the most “optimistic” scenario possible. We therefore understand how the problem of data sparsity can impact the future development of large language models (LLMs).

In fact, it is probably impossible to have free access to all the quality material currently available to “feed” it to the algorithms that need to be trained. This is because access to this data is often compromised precisely because of copyright issues, like those reported by Praxis, but also by various newspapers. One above all New York Timeswhich in December last year filed a lawsuit against OpenAI claiming that millions of its articles «were used to train chatbots that now compete with New York Times».

What are the possible technical and legal solutions for using data

In order to ensure adequate development of AI, it is necessary to find technical and legal solutions to the collection and use of data for training the next generations of LLMs.

On the technical frontsome companies are experimenting with the use of synthetic data (i.e. generated ad hoc) generated through advanced artificial intelligence models, which could be used to address the shortage of quality data. The generation of synthetic data occurs using two AI models: conceptually, one is used as a “creator” of the contents (textual and visual) which retrieves information from the Web; the other, however, evaluates the contents produced, defining their quality. On paper, the combination of two models specialized in two different phases of the data generation work (control and feedback) could lead to satisfying the information hunger of the models to be trained in a relatively short time.

Be careful though: the generation of synthetic data is not the panacea for all ills, as AI models can introduce errors and biases in the data generated, thus leading to the creation of inconsistent or senseless results (called in jargon gibberish), which would cause a phenomenon known as model collapse (literally “model collapse”).

This is why it is necessary also work on the legal front, for example by intervening in the definition of the concept of copyright, introducing new protections and rules for content creators, users and also for companies involved in the development of AI. Some of these – including OpenAI itself – given the problem of the scarcity of quality information with which to train their models are evaluating the creation of real data marketswhere the value of the information used to train models can be recognized and rewarded.

don't miss this article

Fastweb is working on an artificial intelligence that “thinks” in Italian

Sources

Digital Agenda Andrea Villiotti Morten Blichfeldt Andersen (via LinkedIn)