Meta admits using pirated books to train AI, but won't pay for it

Lee Duna@lemmy.nz · 1 year ago

Meta admits using pirated books to train AI, but won't pay for it

TWeaK@lemm.ee · 1 year ago

This is not how copyright has been applied when speaking of other machine learning processes using logical regression that is considered fair use, as in Text and Data Mining classifications(TDM) (proposed class 7(a) and 7(b) (page 102) in Recommendation of the Register of Copyrights 2021).

Your link is merely proposed recommendations. That is not legislation nor case law. Also, the sections on TDM that you reference clearly state (my emphasis):

for the purpose of scholarly research and teaching.

I think this is even more abundantly clear that the research exemption does not apply. AI “research” is in no way “scholarly”, it is commercial product development and thus does not align with fair use copyright exemptions.

It’s also not talking about building AI, but circumventing DRM in order to preserve art. They’re saying that there should be an exemption to the illegal practice of circumventing DRM in certain, limited circumstances. However, they’re still only suggesting this! So not only does this not apply to your argument, it isn’t even actually in force.

To put your other link into context, this also is not law, but comments from legal professors.

Understanding the process of training foundation models is relevant to the generative AI systems’ fair use defenses because the scope of copyright protection does not extend to “statistical information” such as “word frequencies, syntactic patterns, and thematic markers.” Processing in-copyright works to extract “information about the original [work]” does not infringe because it does not “replicat[e] protected expression.

The flaw here is that the work isn’t processed in situ, it is copied into a training database, then processed. The processing may be fine, but the copying is illegal.

If they had a legitimate license to view the work for their purpose, and processed in situ, that might be different.

The difficulty with such arguments is that the harm asserted does not flow from the communication of protected expression to any human audience.

The argument here is that, while it sometimes infringes copyright, the harm it causes isn’t primarily from the infringing act. Not always, though that depends. If AI is used to pass off as someone else, then the AI manufacturer has built a tool that facilitates an illegal act, by copying the original work.

However, this, again, ignores the fact that the commercial enterprise has copied the data into their training database without duly compensating the rightsholder.