Copyright, GenAI, and the future of academic publishing

Copyright, GenAI, and the future of academic publishing

Generative artificial intelligence poses significant challenges to copyright law and the principles of open science. In a new preprint, I study this complex interplay and existing regulatory frameworks. This blog post provides an overview of my central findings.

I am pleased to share a new preprint that has just been posted on arXiv: “Who Owns the Knowledge? Copyright, GenAI, and the Future of Academic Publishing.” The text grew from my contribution to the 20th International Conference on Scientometrics & Informetrics and is now substantially expanded, both in terms of legal analysis and policy implications.

My paper asks a deceptively simple question: who is best in place to regulate the use of research output for training generative AI systems? I argue that current copyright frameworks in major AI-developing jurisdictions – the United States, the European Union, the United Kingdom, China – are not equipped to answer this question in a way that is satisfactory for the academic community. Instead, they tend to privilege either innovation narratives or risk management, while the issue of using scholarly works as training data remains in a legal grey zone. Key international initiatives such as OECD AI Principles, UNESCO AI Ethics Recommendation, and G7 Hiroshima Process International Guiding Principles for Advanced AI System do not provide satisfactory answers either.

Regulation and licensing gaps

A central part of my paper is an in-depth examination of how existing copyright and licensing instruments function once LLM training enters the picture. Widely used Creative Commons licenses, especially CC BY, work reasonably well for human-to-human reuse, but they were never designed to govern large-scale, machine-driven ingestion of research outputs. Creative Commons’ new CC Signals initiative is perhaps the most ambitious attempt so far to repair this mismatch. It has promising features – machine- and human-readable preference signals, a notion of reciprocity, community-driven development – but also unresolved problems around enforceability, retroactive application to the vast existing CC corpus, and the risk of fragmented adoption across jurisdictions and platforms.

On the regulatory side, my paper traces recent developments in several key jurisdictions. In the United States, I follow the trajectory from Executive Order 14110 and the proposed Generative AI Copyright Disclosure Act of 2024, which at least tried to address training-data transparency, to the shift under Executive Order 14179 and the 2025 AI Action Plan, which effectively sidelines copyright concerns in favour of “American AI leadership”. In contrast, the UK’s Artificial Intelligence (Regulation) Bill is the first serious attempt by a major jurisdiction to mandate disclosure of third-party data and intellectual property used for training, with explicit assurances of informed consent and copyright compliance. Chinese AI rules and the EU’s AI Act and AI in Science strategy prioritize risk and content control while remaining largely silent on the core copyright question of training on scholarly works.

A major question in the US context is whether training GenAI models on scholarly outputs should be treated as a fair use exception, which would mean that it is an exclusion from copyright law. I contend that reliance on this legal concept is problematic. GenAI models are commercial or at least potentially commercial; their outputs can be understood as derivative in many situations, and opaque training on paywalled and open science content undermines long-standing academic norms of attribution and credit. As I argue in my paper, authors should have the right to refuse the use of their work for training GenAI models or particular classes of models. At the same time, enforcement will not be easy. Open datasets, non-transparent training pipelines and the impossibility of “untraining” models raise difficult questions. Once a large language model (LLM) has been trained on data, it is technically infeasible to remove the influence of a single work from the trained model.

A way forward

Universities and public research funders are not just bystanders. They act as publishers and repository owners, run their own GenAI projects, and have a unique capacity to translate ethical concerns into institutional policies. I therefore suggest that they support the development and conditional adoption of CC Signals, negotiate rights-retention language that explicitly addresses AI training, and advocate for international harmonization of copyright standards for training datasets, possibly along lines similar to the TRIPS model. None of this replaces state regulation, of course, but without academic institutions taking a clearer stance, there is a real risk that the emerging governance of “academic AI” will be shaped almost exclusively by commercial actors.

To make sure we deal with the opportunities and challenges of GenAI in sensible ways, there is an urgent need for more research. This includes more systematic comparative legal analyses, empirical studies of economic impact on rights holders, and research on how specific technical architectures interact with legal concepts such as substantial similarity and transformative use.

If you are interested in the intersection of generative AI, copyright, open science, and scholarly communication, I would very much welcome your comments and criticisms on my paper.

AI Contribution Disclosure: The author used DeepSeek R3 as limited assistance with grammatical restructuring and lexical optimization of the content. Human author maintained full oversight throughout this process, with all AI-generated outputs being subsequently verified, contextually adjusted, and substantively edited. Final content responsibility remains exclusively with the human author.

Header image by Fabrikasimf on Freepik.
DOI:

0 Comments

Add a comment