Meta admits using pirated books to train AI, but won't pay for it

Lee Duna@lemmy.nz · 1 year ago

Meta admits using pirated books to train AI, but won't pay for it

archomrade [he/him]@midwest.social · 1 year ago

The works are copied in their entirey and reproduced in the training database. AI businesses do not deny this is copying, but instead claim it is research and thus has a fair use exemption.

The copying of the data is not, by itself, infringement. It depends on the use and purpose of the copied data, and the defense argues that training a model against the data is fair use under TDM use-cases.

AI can and has reproduced significant portions of copyrighted work, even in spite of the fact that the finished product allegedly does not include the work in its database (it just read the training database).

The model does not have a ‘database’, it is a series of transform nodes weighted against unstructured data. The transformation of the copyrighted works into a weighted regression model is what is being argued is fair use.

Furthermore, even if a human genuinely and honestly believes they’re writing something original, that does not matter when they reproduce work that they have read before.

yup, and it isn’t the act of that human reading a copyrighted work that is considered as infringement, it is the creation of the work that is substantially similar. In the same analogy, it wouldn’t be the creation of the AI model that is the infringement, but each act of creation thereafter that is substantially similar to a copyrighted work. But this comes with a bunch of other problems for the plaintiffs, and would be a losing case without merit.

The position that I take is that the arguments made against OpenAI and Stability AI in court are not complete

The argument isn’t that the LLM itself is an infringement of copyright, but that the LLM, as designed by the business, infringes copyright in the same way a human would.

Trying really hard not to come off as rude, but there’s a good reason why this isn’t the argument being put forward in the lawsuits. If this was their argument, the LLM could be considered a commissioned agent, placing the liability on the agent commissioning the work (e.g. the human prompting the work) - not OpenAI or Stability - in much the same way a company is held responsible for the work produced by an employee.

I really do understand the anger and frustration apparent in these comments, but I would really like to encourage you to learn a bit more about the basis for these cases before spending substantial effort writing long responses.

TWeaK@lemm.ee · 1 year ago

The copying of the data is not, by itself, infringement.

Copyright is absolute. The rightsholder has complete and total right to dictate how it is copied. Thus, any unauthorised copying is copyright infringement. However, fair use gives exemption to certain types of copying. The copyright is still being infringed, because the rightsholder’s absolute rights are being circumvented, however the penalty is not awarded because of fair use.

This is all just pedantry, though, and has no practical significance. Saying “fair use means copyright has not been infringed” doesn’t change anything.

it is a series of transform nodes weighted against unstructured data.

That’s a database. Or perhaps rather some kind of 3D array - which could just be considered an advanced form of database. But yeah, you’re right here, you win this pedantry round lol. 1-1.

it wouldn’t be the creation of the AI model that is the infringement, but each act of creation thereafter that is substantially similar to a copyrighted work. But this comes with a bunch of other problems for the plaintiffs, and would be a losing case without merit.

Yeah I don’t want to go down the avenue of suing the AI itself for infringement. However…^[1]^[2]^[3]

Trying really hard not to come off as rude

You’re not coming off as rude at all with what you’ve said, in fact I welcome and appreciate your rebuttals.

I really do understand the anger and frustration apparent in these comments, but I would really like to encourage you to learn a bit more about the basis for these cases before spending substantial effort writing long responses.

You say that as if I haven’t enjoyed fleshing out the ideas and sharing them. By the way, right now I’m sharing with you lemmy’s hidden citation feature :o)

Although, I was much happier replying to you before I just saw the downvotes you’ve apparently given me across the board. That’s a bit poor behaviour on your part, you shouldn’t downvote just because you disagree - and you can’t even say that I’m wrong as a justification when the whole thing is being heavily debated and adjudicated over whether it is right or wrong.

I thought we were engaging in a positive manner, but apparently you’ve been spitting in my face.

but there’s a good reason why this isn’t the argument being put forward in the lawsuits.

↩︎
the LLM could be considered a commissioned agent

↩︎
The LLM absolutely could be considered an agent, but the way it acts is merely prompted by the user. The actual behaviour is dictated by the organisation that built it. In any case, this is only my backup argument if you even consider the initial copying to be research - which it isn’t. ↩︎