
A landmark study from researchers at Stony Brook University, Carnegie Mellon University, and Columbia Law School has produced findings that could significantly erode one of the technology industry’s most relied-upon legal arguments in its ongoing battles with publishers and rights holders. The research demonstrates that artificial intelligence chatbots are capable of reproducing entire books almost verbatim, casting serious doubt on industry claims that these systems merely learn linguistic patterns rather than retain copies of source material.
OpenAI and Google have both maintained, across a series of legal and regulatory proceedings, that their AI systems do not directly store or memorise the text used during their development. The central argument has been that large language models process written material solely to acquire an understanding of language structure and patterns, rather than to archive the content itself. The new study strikes directly at the credibility of that position.
In the university experiment, researchers trained AI chatbots to expand short plot summaries of published works into full passages of prose. The results were striking: the regurgitated texts matched the original published works at a rate of between 85 and 90 per cent. Among the authors whose works were reproduced were Kazuo Ishiguro, known for Never Let Me Go, and Margaret Atwood. Critically, the chatbots were able to reproduce complete passages even when the original text was entirely absent from the prompt, leading researchers to conclude that the models may possess what they termed “latent memorisation” of copyrighted works.
The legal implications are considerable. Academic authors of the study noted that evidence of memorisation could “undermine claims of transformative use and demonstrate market harm” to publishers, with the question of whether AI systems can recall published text described as a “pivotal” issue in active litigation. In one high-profile dispute involving the New York Times, OpenAI had previously characterised any verbatim reproduction of text by its chatbot as a “rare bug” that the company was working to eliminate. The new findings sit uneasily alongside that characterisation.
Ed Newton-Rex, a campaigner for creative rights and former AI industry executive, was unequivocal in his assessment of the research. He stated that the study “takes a sledgehammer to AI companies’ ludicrous claims that their models only learn patterns and statistics from the creative work they are trained on,” and described the situation as “the largest and most unjust exploitation of other people’s creativity in history.” His remarks reflect a growing sentiment among rights holders that the legal and regulatory frameworks governing AI training data remain inadequate.
Publishers have pursued a series of lawsuits against major technology companies, alleging that the use of their content to develop commercial AI applications constitutes a breach of copyright law. These cases are working their way through the US court system at varying stages. One notable precedent emerged from proceedings involving Anthropic, the AI laboratory behind the Claude chatbot. A judge found that Anthropic’s use of books during model development did not infringe US copyright law, and that the outputs generated by the chatbot were sufficiently “transformative” to withstand legal challenge. Nonetheless, the same proceedings revealed that Anthropic had downloaded and retained a library of more than seven million pirated books, a finding that prompted the company to settle claims with affected authors in a deal valued at USD 1.5 billion.
The broader significance of the Stony Brook-led research lies in its potential to recalibrate the legal landscape for AI developers. If courts accept the evidence of latent memorisation as proof of direct copyright infringement rather than incidental pattern learning, the transformative use defence, which has thus far offered some measure of protection to AI companies, may prove far less durable than previously assumed. For investors with exposure to the major AI platform developers, the evolving litigation risk warrants close attention, particularly as settlements of the scale seen in the Anthropic case begin to establish precedent for the costs associated with training data disputes.
Google and OpenAI were contacted by researchers for comment but had not responded at the time of publication.
The following content has been published by Stockmark.IT. All information utilised in the creation of this communication has been gathered from publicly available sources that we consider reliable. Nevertheless, we cannot guarantee the accuracy or completeness of this communication.
This communication is intended solely for informational purposes and should not be construed as an offer, recommendation, solicitation, inducement, or invitation by or on behalf of the Company or any affiliates to engage in any investment activities. The opinions and views expressed by the authors are their own and do not necessarily reflect those of the Company, its affiliates, or any other third party.
The services and products mentioned in this communication may not be suitable for all recipients, by continuing to read this website and its content you agree to the terms of this disclaimer.






