OpenAI's Napster/Google Moment

The New York Times got rolled by Google in the 2000s, but they're not getting rolled this time around.

Dec 27, 2023

A dramatic scene of a bank heist in progress, similar to previous images but in a wider format. Set in a large, opulent bank lobby with marble floors and classical architecture, the thief, wearing a black ski mask with 'Open AI' prominently displayed, is in the act of stealing numerous copies of the New York Times. The thief is dressed in black gloves and dark clothing. The scene captures the moment with bank customers and staff expressing surprise and confusion. The newspapers are clearly labeled 'New York Times'. There is an atmosphere of tension but no violence or weapons are shown. — Made this images using OpenAI’s Dall-E, which is ironic.

Today, the New York Times sued OpenAI and Microsoft for stealing millions of pages of their content to power the most promising technology platform since the iPhone: ChatGPT.

The charges aren't debatable and will be quickly proven if this landmark case ever makes it to trial -- and I think this one will.

Most language models train on an open-source project called "Common Crawl."

Common Crawl is like the Google search index, but it's available to everyone for free -- with some important caveats.

The Common Crawl terms of use are clear: if you want to use the data they've indexed, you must go to every content owner and follow their terms of service.

From that ToU:

"You also acknowledge and agree that all information, data, text, scripts, web pages, web sites, software, html page links, open data APIs, metadata or other materials (collectively, the "Crawled Content ") may be subject to separate terms of use or terms of service from the owners of such Crawled Content."

Of course, technologists, generally speaking, have little to no respect for IP.

I hear many peers say it's yours if you can index it, which is absurd and lacks empathy for content creators.

This attitude started with Napster and extended to Google's approach to using other people's content 20 years ago.

NAPSTER got smashed because the music industry is savvy and hardcore.

News sites got rolled by Google because publishers are dopey and meek.

Google's position back then was, "If you don't want to be in the index, just tell us not to crawl you in your robot.txt file!"

Of course, Google got so big, so fast, that it started sending massive traffic to sites like the NYTimes.

The publishing industry was so fractured and dumb at the time that they never considered how utterly worthless Google would be if the top 500 publications refused to let them index it.

So, Google brilliantly threaded the needle on "fair use" laws by making publishers feel like "a ton of traffic for a snippet of content is a fair deal."

Then Google gutted the publishing industry with its massive advertising platform, creating all kinds of downstream issues for society (a whole other blog post).

The publishing industry is smaller now, battle-scarred by decades of war with technology companies.

The NYTimes isn't the dopey publisher it was in the early 2000s. Today, they run one of the most successful subscription businesses in the world and compete in many areas outside of news.

This time, they won't get rolled; they'll fight to the death.

And this time, they're going to win.

OpenAI can't make the "fair use" argument that Google made because ChatGPT doesn't send traffic to publishers; it simply gives users an answer based on all the content they've liberated.

The NYTimes points this out in their suit, and it's devastating.

OpenAI will settle this suit for hundreds of millions of dollars, perhaps billions, I predict (something the two parties tried to do before the NYTimes filed).

The suit gives many examples of OpenAI using the NYT's content, and they've caught them dead to rights.

For example, the NYT quickly proves that OpenAI stole the Wirecutter's IP and brazenly gives results built off that IP -- but that doesn't link back, and that removes the affiliate links that pay for creating that content.

Game over for OpenAI and Microsoft with that example.

Here’s a screenshot from the lawsuit:

Additionally, the NYTimes points out their content is behind a paywall.

Today, I pay a couple of hundred dollars for the NYTimes.

I am also among the millions who pay OpenAI a couple hundred dollars a year for a subscription.

Using these services for 20+ years, my lifetime value is $6,000 each.

I'm also a massive fan of the wirecutter and use it for my decision-making process before smashing the Amazon buy button.

Recently, I found myself doing product searches with ChatGPT.

These two services are directly competing, and one does all the work: the New York Times.

The other steals their answers from the New York Times (OpenAI).

Now, it’s great that OpenAI is trying to do the right thing here and settle, but let’s be honest: they’re only trying to settle this because they got caught with their hand in the cookie jar.

These are amongst the smartest individuals on the planet, and they stole this content because it would make them rich.

The discovery will show technologists trying to find the best content in the world to train their $100B franchise while they personally sell billions of dollars in shares for personal gain.

Let that sink in: the OpenAI team is reportedly selling billions of shares based on a product trained on other people’s IP.

If OpenAI can do this, well, then the rest of us can train a model on Star Wars, Marvel, Pixar, and Disney IP and make the next generation of superhero and Jedi stories!

The judges and jury in this one will make short work of OpenAI and Microsoft. OpenAI and Microsoft need to pay for the lost revenue that the NYTimes is currently experiencing--that's obvious.

The more significant issue here is the Google v. Publisher rematch.

The New York Times will take an open-source language model, train it on their data, and create a ChapGPT competitor--that much is clear.

What if the NYTimes is successful with that model and they start buying more sites like Wirecutter and The Athletic?

OpenAI and the New York Times are direct competitors; you can’t steal a direct competitors IP.

It’s that simple.

Man, maybe I need to jaytrade ’some NYTimes stock.

These cats at the NYTimes aren't as dumb as they used to be.

The only question is, are they now tigers?

Would they have the audacity to compete with OpenAI?

If I were running the New York Times, I would announce a ChatGPT competitor today and raise ten billion dollars… from Google or Apple.

best, JCal

PS - forgive any typos. I decided to let these blog posts fly without an editor.

Gerry Goldsholle

Absolutely on point and highly likely the way things will wind up. This went miles beyond “fair use.”

Expand full comment

John Eden

Dec 30, 2023Edited

Jason,

You're directionally correct. The NYT and similarly situated legacy media companies are not going to roll over for OpenAI.

Nevertheless, our current federal copyright laws were not designed with AI in mind. While it may not be "fair use" to steal content that was created at great expense by the author, there's no right to exclude individuals from taking content in the public domain, learn from it, and then produce new/different/better content (provided original expression isn't copied verbatim and republished).

What makes AI different? Easy: AI ingestion and content generation works at a pace and scale that no human cognizer can match. After all, an AI can ingest and generate new stuff at a pace no human can even come close to matching. But the question is whether the content produced - the "generative" AI - really does simply copy and republish the original.

In some cases, the argument will be easy to make because the content generated will be substantially similar or identical (amount of use) and the market impact on the creator will be massive (market effect). In other cases, the content will be different enough to make it hard to characterize as pure copying and republishing.

The issue, Jason, is this: Copyright law is fundamentally not about protecting the effort or expense undertaken by a creator to make something copyrightable. It's about protecting the expression of the creator. So, once you get a technology - whether it's Google, YouTube or OpenAI - that creates new expression while at the same time hovering up all the content created (at great expense, let's say) by the creative class, you need to figure out how to protect creative effort.

One final thing. You said that we all ought to have empathy for creators. I agree. The way to make that empathy meaningful and effective is to get some modern copyright laws on the books - not laws that were built for a world in which it was a lot harder to steal the creative genius of other authors and creators. We need copyright laws that take modern technology into account in a deft and fair way.

Your Friendly Neighborhood IP Lawyer

11 more comments...

Jason Calacanis on Startups

Discussion about this post