City Voice (ES)

There is no doubt that the potential of generic AI is enormous. It is right that it has been presented as a capability that could usher in a new, technology-based era full of benefits for humanity. It can speed up mundane tasks at work, aid in medical breakthroughs and analyze patterns in ways that Alan Turing and the Bletchley Park codebreakers could only have dreamed of.

At the recent AI Security Summit, much was said about the threats posed by ‘Frontier AI’. There was some criticism that the talks focused too much on future threats and not enough on clear and present threats. Those cited include large-scale job redundancy – as AI takes over tasks previously performed by humans – and biases in the data on which AI is trained that could lead to biases in human decision making. Could. These are important concerns, but there is another potential danger – the risk of model collapse. And the key to that is data.

Model collapse occurs when generative AI becomes unstable, becomes completely unreliable or simply stops working. This happens when generative models are trained on AI-generated content – ​​or “synthetic data” – rather than human-generated data. As time goes on, “models begin to lose information about less common but still important aspects of the data, producing less diverse outputs.”

There are many scenarios where AI models can collapse, but almost all of them are related to the data on which these AI models are trained, including well-known tools like ChatGPT.

Although much is made of the vast scale of data used for this purpose, we do not know enough about the origins and lineage of that data. However, we know that most of the data is not AI-convinced and, therefore, not trustworthy. Therefore, the risk of model collapse is significant.

The lack of knowledge about whether training data can be trusted is problematic, but this is multiplied when you consider how AIs work and how they ‘learn’. LLMs use a variety of sources including news media, academic papers, books and Wikipedia. They work by training on large amounts of text data to learn patterns and associations between words, allowing them to understand and generate coherent and contextually relevant language based on the input they receive. They can answer any question, from building a website to how to treat kidney infection. The assumption is that such advice or answers will become better and more nuanced over time as the AI ​​learns, the technology advances and more data is used for training. However, if the data feeding generative AI exaggerates some characteristics – and downplays others – of the data, existing biases and prejudices will be exponentially magnified. Furthermore, if the data lack specific domains or diverse perspectives, the model may demonstrate limited understanding. Some issues are further contributing to its decline.

For example, consider future news reports that may be partially or completely written by AI and Wikipedia articles may be written, edited – or with input from AI, and you may see the beginning of a cycle. which may lead to model collapse. When an AI is subsisting on a diet of AI-flavored content, the quality and diversity of the content is likely to diminish over time.

The alleged problems with ChatGPT have already been discussed and researched, particularly how its ability to write code may have gotten worse rather than better. This may be due to the fact that the AI ​​has been trained on data from sources like StackOverflow, and users are contributing to the programming forum using answers received in ChatGPTT. Stack Overflow has now banned the use of generative AI in questions and answers on its site.

If we have become dependent on AI affected by AI, model collapse could have serious consequences, including everything from job or financial losses to increased bias and data breaches. As said, solutions and solutions also lie in the data. The first is a robust AI data infrastructure, which ensures that new data comes from trusted sources that do not use recalcitrant or recycled data that pollutes models. This is an opportunity to create better, stronger, more stable AI that will benefit society in the long run. Second, we need openness about the fine details of the data sources used in training – which expert users can assess.

This will help in strengthening the models and will increase trust as well as promote cooperation. Third, continued research is needed on the effects of omitting or removing particular data from models and what impact this has on output quality. We are currently seeing an increasing variety of models of different compositions and sizes – for example, smaller models for specialist applications.

Such models may have highly specific applications or functional areas, may use data sets that have been evaluated with respect to data ethics standards, and are developed with collaborative oversight and human feedback, whether for example Be it medical professionals, statisticians or software engineers. This will lead to greater data literacy, which, as I have recently argued, is essential for a world where AI is not going away. This means new data leaders who can successfully chart a path through the early stages of our interactions with generative AI, on a daily basis.

Ultimately we need a strong commitment to representing and encoding the provenance of the data. If content is machine-generated, it should bear that imprint – I’ve long argued that this is an important part of driving our AI-augmented future “A thing should be called what it is and should be What she says”. Not least because the authors of the original model collapse paper believe that “the value of data collected about actual human interactions with the system rapidly diminishes in the presence of LLM-generated content in data crawled from the Internet.” Would be valuable.”

AI is far more likely to be a useful friend than a destructive foe, but as our dependence on it grows, we need to question its development. This way, we can work with it thoughtfully and intelligently, rather than getting frustrated when everything doesn’t work out as we expected.

Sir Nigel Shadbolt is Executive Chairman of the Open Data Institute, which he co-founded with Sir Tim Berners-Lee, Principal of Jesus College Oxford, Professor of Computer Science at the University of Oxford, and Visiting Professor of Artificial Intelligence at the University Was. Of southampton.

