Generative AI is thought of as the technological evolution to ferry us into a new age. Any person who uses a chatbot such as ChatGPT or Google Bard might agree, but no new technology is without its vehement naysayers. The war going on between AI chatbots like ChatGPT and the sites they scrape isn’t something that you hear in the keynotes or see in the blog posts. However, it’s happening, and it’s only getting more intense.
The mass majority of people who use chatbots don’t really know what’s powering them. Chatbots are massive reservoirs of information that can put any encyclopedia set to shame. Well, where do you think they get their information from? Simply put, they get their information from the internet; they get their information from your favorite websites.
AI companies fired the first shot by data scraping
This isn’t fearmongering, but if you don’t know about data scraping, it’s time that you found out. Chatbots hold an ocean of information about pretty much everything, but that information has to exist somewhere; they don’t just create the information out of thin air (with the exception of hallucinations). They get their vast knowledge from information scraped from across the web.
AI companies use bits of software called crawlers that travel to different websites and scrape data from them. They’ll make copies of the information on the sites and feed them into their LLMs (large language models) that power the chatbots. Now, imagine how much information crawlers can gather with the entire internet as their playground. This is why ChatGPT has an answer for pretty much anything you type into it.
In case you’re wondering, yes, this includes information that YOU create. If you write an article, a crawler can travel to that post, copy that information, and feed it into its LLM. Even if you’re not a writer, you’re still inadvertently contributing to chatbots. Most likely, your tweets, Facebook posts, and other social media posts were scraped to train a chatbot.
Top sites don’t like that, so they’re waging war on AI chatbots
On the surface, it seems that most businesses are accepting of generative AI. They’re all for offering AI services, but the question is: how do they feel about having their data scraped? Many of the biggest sites on the internet don’t feel great about it, actually. In fact, they’ve made a point to block chatbot crawlers from scraping their sites. A report from Originality.AI reveals this.
On August 7th, OpenAI revealed a way for companies to block its crawler, GTPBot, from scraping data from their websites. After only two weeks, 69 of the top 1,000 websites (about 7%) were blocking it. Fast forward to September 17th, and that number more than doubled. About 25.9% of the top 1000 websites chose to block GPTBot. A total of 242 of the top sites in the world are blocking crawlers.
One thing to remember is that these numbers only pertain to the top 1,000 sites. There is an innumerable number of websites out there; 1,000 is infinitesimal compared to that. The number of websites in general that are blocking crawlers is much MUCH larger! Looking at the graph above, you can see that the numbers are steadily going up. More and more sites are fighting against ChatGPT by blocking its crawler.
What sites are blocking ChatGPT?
This list includes sites like Amazon, Pinterest, Quora, Tumblr, Indeed, Dictionary.com, Shutterstock, WikiHow, and several more. Also, it’s no surprise that a large number of news media sites are also at war against AI chatbots. They include CNN, The NY Times, The Verge, Reuters, CNBC News, Insider, The Washington Post, Wired, Polygon, and many more.
Writers for media sites all pretty much fear for their jobs. Chatbots have the ability to generate an entire article in the span of time it takes to read a title. This technology has the potential to put a ton of journalists out of their jobs. This is why media sites are thoroughly opposed to chatbot scraping.
Also, several veteran news sites have published groundbreaking pieces of journalism over the years. These pieces are copyrighted and they’re held as the crowning achievements for the publications. It does not sit well with these sites that crawlers are able to scrape those articles.
And, as if they needed more reasons to oppose chatbot crawlers, they have the ability to actually scrape paywalled content as well.
However, more chatbots have entered the battle
Back when ChatGPT first launched, there was only one chatbot to worry about. However, since the technology has exploded, more companies are looking to scrape websites.
On the chart above, you noticed that there wasn’t only one crawler being blocked, but there were four. Two of them are from ChatGPT; the other two are CCBot and Anthropic AI. We see that GPTBot is the one that’s blocked the most, but sites are having their data scraped by multiple chatbots at once.
We see that CCBot is the second most blocked crawler, being blocked by 13.9% of the top websites. ChatGPT-User, OpenAI’s other bot, is being blocked by about 7% of the top 1,000 sites. Anthropic.AI is being blocked by the least. Only two sites on this list are blocking that bot, and they’re Reuters and Corrier.it.
On top of the four mentioned in the graph, there are also the crawlers for Google Bard and Vertex AI. This makes six chatbot crawlers at minimum that could be scraping websites’ data. Yes, Amazon is blocking GPTBot, but that’s only a fraction of the chatbots that it needs to worry about.
This goes for all of the sites as well. People launching their new sites will need to take crawlers into consideration along with running the site. Fortunately, Google recently released a way to block Bard and Vertex AI from scraping data from websites called Google-Extended.
This war will continue to rage until governmental regulation
So, it’s evident that the transition into this new AI era isn’t graceful. Behind the curtains, the real drama is unfolding. There’s a war going on between AI and the sites that it’s learning from. This is the kind of future that we couldn’t avoid with how generative AI works.
This is going to continue as more chatbots come to the surface. We don’t know how many crawlers will be scraping data a year from now. What about 5 years from now or 10 years? It seems that the only thing that could turn the tide is government intervention.
Right now, AI companies have free reign to develop their technology as they seek fit. That’s not comforting, as we’re talking about large and rather greedy corporations. They talk about making AI safe in their user-facing keynotes and blog posts, but the end goal is always a dollar sign for most companies. This means that companies will tend to go overboard with their technology to compete. This is why they’re free to crawl and scrape whatever data they want; they’re on a mad dash to make their chatbot smarter than the competition’s.
Enter, the government
However, at the time of writing this article, several governmental bodies and lawmakers are trying to figure out how to handle this AI revolution. For example, the UK government is fighting for more transparency in how AI works.
The keyword here is “Regulation”. If the government steps in and regulates how companies can obtain their data, then the battle will be more in favor of the sites being scraped. It won’t be as easy for chatbots to scrape the sites without permission.
The war rages on
Right now, we’re at a point where the future of the tech industry is still a mystery. AI technology is being implemented into more of the services and technology that we take for granted. As AI continues to develop, there will always be companies that oppose its progress. 25.9% of the top sites are blocking ChatGPT, and the number is rising. The number could double again in a month’s time for all we know. What we know is that, as long as AI chatbots exist, there will always be a war going on.
2023-10-02 15:05:51