12 Comments
Dec 27, 2023Liked by jason@calacanis.com

Absolutely on point and highly likely the way things will wind up. This went miles beyond “fair use.”

Expand full comment

Great writing!!! Very energetic and engaging, as well as, smart & neat!!! :)

That said, the "training" context is not the same as "direct fetching" one. It's much easier to justify that one trains based on some framework that then generates new content.

Expand full comment

Jason,

You're directionally correct. The NYT and similarly situated legacy media companies are not going to roll over for OpenAI.

Nevertheless, our current federal copyright laws were not designed with AI in mind. While it may not be "fair use" to steal content that was created at great expense by the author, there's no right to exclude individuals from taking content in the public domain, learn from it, and then produce new/different/better content (provided original expression isn't copied verbatim and republished).

What makes AI different? Easy: AI ingestion and content generation works at a pace and scale that no human cognizer can match. After all, an AI can ingest and generate new stuff at a pace no human can even come close to matching. But the question is whether the content produced - the "generative" AI - really does simply copy and republish the original.

In some cases, the argument will be easy to make because the content generated will be substantially similar or identical (amount of use) and the market impact on the creator will be massive (market effect). In other cases, the content will be different enough to make it hard to characterize as pure copying and republishing.

The issue, Jason, is this: Copyright law is fundamentally not about protecting the effort or expense undertaken by a creator to make something copyrightable. It's about protecting the expression of the creator. So, once you get a technology - whether it's Google, YouTube or OpenAI - that creates new expression while at the same time hovering up all the content created (at great expense, let's say) by the creative class, you need to figure out how to protect creative effort.

One final thing. You said that we all ought to have empathy for creators. I agree. The way to make that empathy meaningful and effective is to get some modern copyright laws on the books - not laws that were built for a world in which it was a lot harder to steal the creative genius of other authors and creators. We need copyright laws that take modern technology into account in a deft and fair way.

Your Friendly Neighborhood IP Lawyer

Expand full comment

How would the NYT be able to take an open source LLM and train it on their data? The open source LLM is also trained on copyrighted data just like GPT is.

I guess it's possible I'm just misunderstanding your argument. Saying it's illegal for ChatGPT or Bing's AI to recreate the text of a NYT article and saying it's illegal for the underlying LLM to be trained on a NYT article even if it never reproduces any copyrighted article are different claims. Which claim are you making (or both)?

Expand full comment

Good summary JC....thanks!

Expand full comment

IANAL but I'd have thought that copyright law is the only thing with enough reach and enough "teeth" for the job. Unlike Google, Napster etc. the LLM trainers don't actually distribute a "copy" of anything. When an LLM does manage to infringe copyright, it only does so if you or I (the user) specifically ask it to. The only winners here will be the lawyers.

Always enjoy your commentary though :-)

Expand full comment

What the lawsuit has surfaced is the reshuffling of the information/data game. From '94 to Dec22' ALL data was vacuumed up by the "machine" (GAALIgFbTT). Perplexity.ai showed the first significant evolution in info/data access in forever. JCal's guest said "Google shows you ten sites where the answer is, perplex shows you 5 different ANSWERS. Big diff.

Show me the money...? (the Machine at the top, has siphoned off all the money since 2008 (FB/IPh). That MIDDLE LAYER is all the data which has NEVER been public. The introduction of an AI OS which can manage the security and access, and distribution of content based on access rights and MONEY is what's next. Followed by PEOPLE, the bottom layer. Instead of me giving Zuck my affinity for titlist golf balls and him getting paid for THEM TO SHOW ME ADS...FING nuts...but here we are.

My personal AI OS will do the same in terms of managing my profile and exposure...does it matter?

Perplexity will NEVER know who I am, until I share it.

Big diff.

Expand full comment

Great blog! Curious to get your thoughts on Google and xAI competitive advantage given their huge proprietary data source on which they can train. Does this put them ahead of OpenAI and Microsoft?

And is Google's and X data proprietary or can this also be disputed by content creators?

Curious to get your take! Raised this point also with Chamat on X today!

Expand full comment

Thanks, Jason, for a cogent analysis. Google = content theft on a grand scale. "Exposure bucks" don't buy anything at the supermarket.

Expand full comment

fuck NYTimes. It is not that valuable as OpenAI

Expand full comment
author

What a refined argument.

Expand full comment