OpenAI's Napster/Google Moment
The New York Times got rolled by Google in the 2000s, but they're not getting rolled this time around.
Today, the New York Times sued OpenAI and Microsoft for stealing millions of pages of their content to power the most promising technology platform since the iPhone: ChatGPT.
The charges aren't debatable and will be quickly proven if this landmark case ever makes it to trial -- and I think this one will.
Most language models train on an open-source project called "Common Crawl."
Common Crawl is like the Google search index, but it's available to everyone for free -- with some important caveats.
The Common Crawl terms of use are clear: if you want to use the data they've indexed, you must go to every content owner and follow their terms of service.
From that ToU:
"You also acknowledge and agree that all information, data, text, scripts, web pages, web sites, software, html page links, open data APIs, metadata or other materials (collectively, the "Crawled Content ") may be subject to separate terms of use or terms of service from the owners of such Crawled Content."
Of course, technologists, generally speaking, have little to no respect for IP.
I hear many peers say it's yours if you can index it, which is absurd and lacks empathy for content creators.
This attitude started with Napster and extended to Google's approach to using other people's content 20 years ago.
NAPSTER got smashed because the music industry is savvy and hardcore.
News sites got rolled by Google because publishers are dopey and meek.
Google's position back then was, "If you don't want to be in the index, just tell us not to crawl you in your robot.txt file!"
Of course, Google got so big, so fast, that it started sending massive traffic to sites like the NYTimes.
The publishing industry was so fractured and dumb at the time that they never considered how utterly worthless Google would be if the top 500 publications refused to let them index it.
So, Google brilliantly threaded the needle on "fair use" laws by making publishers feel like "a ton of traffic for a snippet of content is a fair deal."
Then Google gutted the publishing industry with its massive advertising platform, creating all kinds of downstream issues for society (a whole other blog post).
The publishing industry is smaller now, battle-scarred by decades of war with technology companies.
The NYTimes isn't the dopey publisher it was in the early 2000s. Today, they run one of the most successful subscription businesses in the world and compete in many areas outside of news.
This time, they won't get rolled; they'll fight to the death.
And this time, they're going to win.
OpenAI can't make the "fair use" argument that Google made because ChatGPT doesn't send traffic to publishers; it simply gives users an answer based on all the content they've liberated.
The NYTimes points this out in their suit, and it's devastating.
OpenAI will settle this suit for hundreds of millions of dollars, perhaps billions, I predict (something the two parties tried to do before the NYTimes filed).
The suit gives many examples of OpenAI using the NYT's content, and they've caught them dead to rights.
For example, the NYT quickly proves that OpenAI stole the Wirecutter's IP and brazenly gives results built off that IP -- but that doesn't link back, and that removes the affiliate links that pay for creating that content.
Game over for OpenAI and Microsoft with that example.
Here’s a screenshot from the lawsuit:
Additionally, the NYTimes points out their content is behind a paywall.
Today, I pay a couple of hundred dollars for the NYTimes.
I am also among the millions who pay OpenAI a couple hundred dollars a year for a subscription.
Using these services for 20+ years, my lifetime value is $6,000 each.
I'm also a massive fan of the wirecutter and use it for my decision-making process before smashing the Amazon buy button.
Recently, I found myself doing product searches with ChatGPT.
These two services are directly competing, and one does all the work: the New York Times.
The other steals their answers from the New York Times (OpenAI).
Now, it’s great that OpenAI is trying to do the right thing here and settle, but let’s be honest: they’re only trying to settle this because they got caught with their hand in the cookie jar.
These are amongst the smartest individuals on the planet, and they stole this content because it would make them rich.
The discovery will show technologists trying to find the best content in the world to train their $100B franchise while they personally sell billions of dollars in shares for personal gain.
Let that sink in: the OpenAI team is reportedly selling billions of shares based on a product trained on other people’s IP.
If OpenAI can do this, well, then the rest of us can train a model on Star Wars, Marvel, Pixar, and Disney IP and make the next generation of superhero and Jedi stories!
The judges and jury in this one will make short work of OpenAI and Microsoft. OpenAI and Microsoft need to pay for the lost revenue that the NYTimes is currently experiencing--that's obvious.
The more significant issue here is the Google v. Publisher rematch.
The New York Times will take an open-source language model, train it on their data, and create a ChapGPT competitor--that much is clear.
What if the NYTimes is successful with that model and they start buying more sites like Wirecutter and The Athletic?
OpenAI and the New York Times are direct competitors; you can’t steal a direct competitors IP.
It’s that simple.
Man, maybe I need to jaytrade ’some NYTimes stock.
These cats at the NYTimes aren't as dumb as they used to be.
The only question is, are they now tigers?
Would they have the audacity to compete with OpenAI?
If I were running the New York Times, I would announce a ChatGPT competitor today and raise ten billion dollars… from Google or Apple.
best, JCal
PS - forgive any typos. I decided to let these blog posts fly without an editor.
Absolutely on point and highly likely the way things will wind up. This went miles beyond “fair use.”
Great writing!!! Very energetic and engaging, as well as, smart & neat!!! :)
That said, the "training" context is not the same as "direct fetching" one. It's much easier to justify that one trains based on some framework that then generates new content.