Those claiming AI training on copyrighted works is “theft” misunderstand key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves. When AI systems ingest copyrighted works, they’re extracting general patterns and concepts - the “Bob Dylan-ness” or “Hemingway-ness” - not copying specific text or images.
This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages. The AI discards the original text, keeping only abstract representations in “vector space”. When generating new content, the AI isn’t recreating copyrighted works, but producing new expressions inspired by the concepts it’s learned.
This is fundamentally different from copying a book or song. It’s more like the long-standing artistic tradition of being influenced by others’ work. The law has always recognized that ideas themselves can’t be owned - only particular expressions of them.
Moreover, there’s precedent for this kind of use being considered “transformative” and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was ruled legal despite protests from authors and publishers. AI training is arguably even more transformative.
While it’s understandable that creators feel uneasy about this new technology, labeling it “theft” is both legally and technically inaccurate. We may need new ways to support and compensate creators in the AI age, but that doesn’t make the current use of copyrighted works for AI training illegal or unethical.
For those interested, this argument is nicely laid out by Damien Riehl in FLOSS Weekly episode 744. https://twit.tv/shows/floss-weekly/episodes/744
I think you’re anthropomorphising the tech tbh. It’s not a person or an animal, it’s a machine and cramming doesn’t work in the idea of neural networks. They’re a mathematical calculation over a vast multidimensional matrix, effectively solving a polynomial of an unimaginable order. So “cramming” as you put it doesn’t work because by definition an LLM cannot forget information because once it’s applied the calculations, it is in there forever. That information is supposed to be blended together. Overfitting is the closest thing to what you’re describing, which would be inputting similar information (training data) and performing the similar calculations throughout the network, and it would therefore exhibit poor performance should it be asked do anything different to the training.
What I’m arguing over here is language rather than a system so let’s do that and note the flaws. If we’re being intellectually honest we can agree that a flaw like reproducing large portions of a work doesn’t represent true learning and shows a reliance on the training data, i.e. it cant learn unless it has seen similar data before and certain inputs provide a chance it just parrots back the training data.
In the example (repeat book over and over), it has statistically inferred that those are all the correct words to repeat in that order based on the prompt. This isn’t akin to anything human, people can’t repeat pages of text verbatim like this and no toddler can be tricked into repeating a random page from a random book as you say. The data is there, it’s encoded and referenced when the probability is high enough. As another commenter said, language itself is a powerful tool of rules and stipulations that provide guidelines for the machine, but it isn’t crafting its own sentences, it’s using everyone else’s.
Also, calling it “tricking the AI” isn’t really intellectually honest either, as in “it was tricked into exposing it still has the data encoded”. We can state it isn’t preferred or intended behaviour (an exploit of the system) but the system, under certain conditions, exhibits reuse of the training data and the ability to replicate it almost exactly (plagiarism). Therefore it is factually wrong to state that it doesn’t keep the training data in a usable format - which was my original point. This isn’t “cramming”, this is encoding and reusing data that was not created by the machine or the programmer, this is other people’s work that it is reproducing as it’s own. It does this constantly, from reusing StackOverflow code and comments to copying tutorials on how to do things. I was showing a case where it won’t even modify the wording, but it reproduces articles and programs in their structure and their format. This isn’t originality, creativity or anything that it is marketed as. It is storing, encoding and copying information to reproduce in a slightly different format.
EDITS: Sorry for all the edits. I mildly changed what I said and added some extra points so it was a little more intelligible and didn’t make the reader go “WTF is this guy on about”. Not doing well in the written department today so this was largely gobbledegook before but hopefully it is a little clearer what I am saying.