What Happens When AI Begins to Learn from AI-Generated Data?
The year is 1945. The Trinity Test, the first detonation of an atomic bomb, shatters the New Mexico desert, ushering in the nuclear age. While the world grapples with its devastating potential, an unforeseen consequence unfolds: radioactive isotopes permeate the environment, leaving a mark on even the most ordinary materials like steel.
This seemingly insignificant detail holds immense significance for a specific type of scientific endeavor: low-background experiments. For those seeking to unravel the universe’s deepest secrets, even trace amounts of radioactivity can pose a significant challenge. Enter low-background metal, a precious resource from the pre-atomic era, boasting near-mythical purity.
Low-background metals precede the Trinity Test and subsequent nuclear testing, providing an environment untouched by the radioactive emissions of the modern world. Picture delving into the subatomic realm with a particle detector, a sensitive instrument crafted to capture the faintest emissions of radioactivity. Yet, envision its adversary: background radiation emanating from the detector itself! This is where low-background metal excels, furnishing a pristine platform for these delicate measurements. Presently, these metals are discovered in shipwrecks that sank prior to the initial atomic test.
Contemporary Emissions
Before the internet era, human knowledge was meticulously documented in various forms such as books, newspapers, and journals. With technological advancements, this wealth of information transitioned into digital formats stored on electronic devices. The internet further democratized access to knowledge, making it readily available globally. Social media platforms amplified this accessibility, leading to the accumulation of vast amounts of human-generated data in diverse formats.
In this article, we’ll delve into the significance of understanding and preserving the data that underpins AI-generated content in our contemporary digital age.
While AI might churn out grammatically correct text, there’s a risk of factual inaccuracies.
The accumulation of massive data has provided a fertile ground for researchers to harness in the creation of Large Language Models (LLMs) such as GPT, Llama, and Gemini. This groundbreaking advancement enables the generation of fresh content in various formats — text, audio, or video — simply by supplying prompts to these models. Today, students use Generative AI as a helpful tool to tackle homework assignments, while authors leverage it to craft books, blogs, and articles. Researchers are also embracing AI to aid in writing scientific papers, streamlining the publication process, and accelerating knowledge dissemination. Looking ahead, the emergence of newer and more advanced multi-modal LLMs promises to unlock even greater potential, empowering creatives to produce sophisticated forms of art, photographs, music, videos, and perhaps even full-length feature films.
In today’s academic landscape, an increasing number of scholars are leveraging AI to augment their writing and research processes. While AI tools are not typically responsible for drafting entire papers, they serve as valuable assistants in enhancing content and streamlining workflows. It’s essential to recognize the benefits that AI brings to academic research, including increased efficiency and access to advanced analytical capabilities. Some authors have even credited AI models, such as ChatGPT, as co-authors, though this practice is not universally accepted within the academic community.
Despite the rigorous review process that research papers undergo prior to publication, there may still be concerns regarding the extent of AI involvement and its impact on authorship and the quality of research analysis. We’re beginning to witness the establishment of policies and procedures aimed at disclosing and reviewing AI-generated output before publication. Many publications now have policies in place regarding the use of AI tools, requiring authors to transparently acknowledge their utilization in the research process.
However, ensuring consistent adherence to these practices among all content creators on the internet remains an ongoing challenge. Within the expansive domain of social media, where content is often published without undergoing a rigorous review process, ensuring both safety and factual accuracy presents a significant challenge.
As AI-generated content becomes increasingly prevalent, distinguishing between human-generated and AI-generated content presents a significant challenge, carrying the risk of contaminating human-generated data. While AI may produce grammatically correct text, there’s a potential for factual inaccuracies resulting from what’s known as hallucination. Photographs and videos may appear realistic but could lack crucial details. Additionally, custom-trained models without necessary safeguards could generate content that is potentially harmful or offensive to specific groups. The accessibility of AI tools also raises concerns about misuse, leading to the creation and dissemination of fake content that blurs the line between reality and fiction. The data used to train future models may not be as pristine as desired, posing a significant risk. Without proper consideration of the information being ingested, future AI models could inadvertently produce factually incorrect output.
The Curse of Recursion
In a recent study titled “The Curse of Recursion: Training on Generated Data Makes Models Forget,” the authors propose that when AI is trained on data it has previously generated, it becomes less efficient in producing high-quality output. Furthermore, they contend that learning from data generated by other models leads to model collapse — a degenerative process wherein models gradually forget the true underlying data distribution. In this process, the data generated by earlier models inadvertently pollutes the training set of subsequent models. Consequently, these later models, trained on corrupted data, develop distorted perceptions of reality.
To ensure the continued advancement and training of future models, the authors emphasize the importance of preserving human-generated data and propose methods for distinguishing it from AI-generated data. However, they raise concerns about the uncertainty surrounding how this distinction can be effectively made at scale. They advocate for organizations and communities involved in the creation of Large Language Models (LLMs) to collaborate on potential solutions.
Preserving the past and charting the way forward
Like low-background metals, pristine human-generated data may become a rare and valuable resource in the future. Certainly, the quality of AI-generated content will improve over time; however, in the meantime drawing inspiration from GitHub’s Arctic Code Vault program, perhaps it is high time we consider safeguarding human-generated data in a similar vault.
As we navigate the challenges that lie ahead, it becomes increasingly important to find the right balance. While embracing the opportunities that AI presents for us in the future, we hope and believe that advancements in technology can help solve the problems we’ve created. Perhaps AI will play a role in mitigating and resolving the challenges it has posed — or perhaps not. Only time will tell.
References:
- The Curse of Recursion: Training on Generated Data Makes Models Forget https://arxiv.org/abs/2305.17493
- Low-background metal: Pure, unadulterated treasure https://qz.com/emails/quartz-obsession/1849564217/low-background-metal-pure-unadulterated-treasure
- ACL 2023 Policy on AI Writing Assistance https://2023.aclweb.org/blog/ACL-2023-policy/
- ChatGPT listed as author on research papers: many scientists disapprove https://www.nature.com/articles/d41586-023-00107-z