Apparently, since the beginning of the year, AI bots have been ensuring that websites can only respond to regular inquiries with a delay. The founder of Linux Weekly News (LWN-net), Jonathan Corbet, reports that the news site is therefore often slow to respond.
The AI scraper bots cause a DDoS, a distributed denial-of-service attack. At times, the AI bots would clog the lines with hundreds of IP addresses simultaneously as soon as they decided to access the site’s content. Corbet explains on Mastodon that only a small proportion of the traffic currently serves real human readers.
↫ Dirk Knop at Heise.de
I’m sure someone will tell me we just have to accept that a large percentage of our bandwidth is going to overpriced bullshit generators, and that we should just suck it up and pay for Sam Altman’s new house. I hope these same people realise “AI” is destroying the last vestiges of the internet that haven’t fallen victim to all the other techbro fads so far, and that sooner rather than later there won’t be anything left to browse to.
The coming few years are going to be fun.
This sounds like a really big problem. However, I would like to point out that it is really just a continuation of a problem that has existed for a long time.
I used to work for a website that had 40 million pages. About 40% of the traffic went to serve web crawlers ( search engines ). The resulting web traffic from search engines made the trade-off perhaps ok in that case although most incoming traffic was from Google and ( at that time ) Bing created most of the load. Google was much smarter about not constantly indexing the same content.
Obviously automated bot traffic has been going on for decades. It’s usually just background traffic and not typically a big problem.
It’s usually in the interest of bot authors to make them behave well and not motivate admins to block them. I’m not really convinced that the bots scrapping the web for AI use more resources than anyone else, like google or the internet archive as examples. But I understand that some admins may not welcome some bots regardless of their technical behaviors.
Yes, scraping the web for AI *does* use more resources than anything else. Easily 100 if not 1000 times more than Google and Bing combined. They change IP on every request, use a different made up user agent on every reqeust (Firefox 134 on Windows 98 SE, for example), ignore robots.txt and don’t rate limit anything. They actively try everything to circumvent bans. They know full well their behavior is abusive, but they do not care. I know you are a huge AI fanboy who is unwilling to look at any problems with AI because it would challenge your predefined opinion, but please, try running a webserver and see for yourself.
js,
I’m curious how are you concluding that chatgpt’s bots specifically are faking user agents?
I’m not seeing anything like that on servers I admin. I’m skeptical of the 100X 1000X claims and I don’t understand what they would get out of downloading pages ~10k times per day as another comment alleges. If the allegations aren’t being made up, then it seems like something is wrong and not functioning as intended by the bot creator.
Hypothetically if websites are trying to create AI bot tarpits, as was suggested in an earlier OSnews article, then yes of course they might get 1000X of hits from affected bots. But websites with such mechanisms in place have deliberately put themselves in that position of tying up the bot bandwidth using fake links & content. There such an outcome would be entirely expected.
I already do and while I do see hits from chatgpt, I don’t see anything more abusive than any other bot.
I posted another comment that aggregated the bot hits I was seeing. (BTW Thom did it get taken down)?
Other bots including google/bing/apple account for way more traffic. I don’t see the alleged activity happening on my servers and I would suspect it to be a bug more than an intended result. A bot scrapping data for AI doesn’t need 100X or 1000X more traffic compared to other bots, something isn’t adding up with that claim. I’d like to get to the bottom of it, but the allegations alone don’t seem to provide much clarity.
As for this reputation of being “a huge AI fanboy”…well haha, I don’t find it entirely accurate. I believe AI has a lot potential to break new grounds in the coming years, and this seems to be why I’m given this label. However it doesn’t automatically mean I’m a fan of the changes that AI will bring, especially when it comes to jobs. I worry a lot about what an AI revolution means for the working class under capitalism as owners increasingly lean to automation to productivity. From my perspective, many people are denial about AI productivity. They’re only looking at sudden changes rather than less interesting but more realistic gradual job displacement as more jobs become automated over time. That’s how I see this transformation going over the long term:: not a single AHA moment, but an increasing AI presence over time. Cue boiling frog analogy.
Once this happens I don’t see how we back out because human employees will be deemed too expensive to replace AI.
When you own the web server’s logs, that’s 95% of the work right there.
gus3,
?
I’m genuinely trying to understand the accusations but In my logs it’s less than the level of other bots: google / bing / apple / etc.
I receive AI Bot traffic in two of my ten sites, it’s not on every website, but when it happens sometimes the server dies. I have WordPress on many sites, those are the most common to fall, my blog has my own CMS made by my, so I can filter the traffic with different options and be creative. And yes, they are almos 50% of the traffic, most of them only sees a cached version so CPU use is low, a 30% sees Cloudflare cached version, but the bandwidth use is high anyways, and most of it is not human traffic (Bing and Google Search are worst offenders here, AI second)
Fabio,
Does it actually report itself to be chatgpt bot? Or are you assuming that unidentified traffic is AI bots as js posted?
To clarify, my intention is not to absolve bots of abusive behavior if the evidence supports it. I haven’t seen this evidence personally, hence all my questions. I’m not necessarily convinced it’s deliberate though.
Several years ago I experienced bingbot bombing websites with endless requests. I contacted microsoft and they made fixes so it stopped happening. Did everyone experience this? Probably not. Was it a real problem? Yes clearly. Was it intentional? No, it was a bug with the bot. There’s no inherent reason bots scrapping the web for AI need to hit servers harder than any of the other scrapping bots. Certainly not hitting pages 10k times per day. If the claims are true then it’s very likely a bug in the scrapper and not an intentionally designed mechanic.
> I’m curious how are you concluding that chatgpt’s bots specifically are faking user agents?
Where did I say ChatGPT’s bots are faking user agents? Nobody mentioned ChatGPT here, it’s AI bots in general. Why the straw man?
> I’m not seeing anything like that on servers I admin. I’m skeptical of the 100X 1000X claims and I don’t understand what they would get out of downloading pages ~10k times per day as another comment alleges.
They download the same content again and again though even though the link is slightly different, it’s the same content. Try hosting a git web or similar and you’ll see. In my logs, there is easily 1000x more traffic from these AI bots that from Google.
> If the allegations aren’t being made up, then it seems like something is wrong and not functioning as intended by the bot creator.
I’m not sure why you are making claims of this being made up when several people here report this. Are you saying LWN, the source of this article, is making this up, too?
And whether the bots are not functioning correctly or not isn’t relevant: The bot owners know they are abusive and decided to rather circumvent bans than change behavior. This is not an accident.
> Hypothetically if websites are trying to create AI bot tarpits, as was suggested in an earlier OSnews article, then yes of course they might get 1000X of hits from affected bots.
I guess hosting a source code repo is a bot tarpit in your opinion then.
> I already do and while I do see hits from chatgpt, I don’t see anything more abusive than any other bot.
Then probably their strategy of hiding as regular users by using residential IPs (probably of hacked computers, you can cheaply buy VPNs that use those without having to hack people’s computers yourself) and faked user agents worked to fool you. You can’t easily grep them, you need to look at the access patterns. And then it becomes very clear that IP 1 just discovered a link that IP 2 then requests. And if you follow the chain you get one only a bot would do, as no human would behave this way. No human would click on every single line of a blame of a source code repo, for example, and do this from a different IP with a different user agent per request.
> A bot scrapping data for AI doesn’t need 100X or 1000X more traffic compared to other bots, something isn’t adding up with that claim.
Except it’s not one bot but several abusive bots. They add up because everybody wants to win the AI race.
js,
Sorry that I made a bad assumption. Your post was vague and I wrongly assumed that you and jujulu where talking about the same thing. You don’t know who’s doing this then? How did you came to the conclusion that this unidentified traffic is from an “AI bot”? Is it just an assumption?
So this actually does shed a lot light on why the bots are downloading the exact same pages thousands of times. Something didn’t add up there, but now it makes a lot more sense. From a bot’s perspective it’s fetching new unseen URLs, the bot doesn’t realize the URLs fetch the same page on the server. This definitely explains why a page would get 100X or 1000X different hits given different urls for it.
It may be quite difficult to write a bot that differentiates between URI parameters that change the content versus ones that are merely used for other purposes in a UI postback model. This behavior can even be inconsistent on the same site because parameters on some pages may change the content whereas others don’t. Some heuristic or framework specific logic may be needed and apparently the bot involved here didn’t get it right. It needs to be fixed, it doesn’t benefit bot authors either to download the same content thousands of times.
I didn’t claim it was made up, I qualified the statement because I don’t see the allegations being backed up with evidence. This doesn’t mean the allegations are untrue, but it does mean your taking someone else’s word for it. It’s logically important to use qualifiers when you can’t verify something as fact. A lot of accusations are flying around, but unfortunately no one here, and not even the article have provided much in terms of specific evidence. I think it’s responsible to qualify lines the reasoning. When you want to get to the bottom of something, specific details matter an awful lot! For example just as I wrongly assumed your case involved the same bot as jujulu’s, well your case may not be the same as LWN’s either. It’s hard to be 100% confident in conclusions when no one has shared their evidence.
You’re assuming all of this.
This brings to mind a quote “Never attribute to malice that which is adequately explained by stupidity.”
https://en.wikipedia.org/wiki/Hanlon%27s_razor
I’m very glad you’re willing to offer clarifications over your original post and I’m not questioning your assertion that such actives are being done by bots. That much seems clear. However since you aren’t able to identify the source nor the source’s motive, some of the original claims don’t seem to be very defensible. This one in particular bothered me “Yes, scraping the web for AI *does* use more resources than anything else. Easily 100 if not 1000 times more than Google and Bing combined.”
I get that there’s a lot of anger against AI companies, but I can’t help but notice that many people are trying to paint them all with the same brush to smear accusations across all AI companies, even those that are being run responsibly.
I have two small websites with tens of pages each. One was hit about 200 000 times in the span of two and a half days by the ChatGPT bot, the other one about 130 000 times… doesn’t seem like there was a lot of “I” in that bot.
This indeed has become an oppressing issues and something will need to be done about it to keep a lot of internet communities alive. AI bots are just hammering things like internet forums and similar. The situation is not sustainable on the long run.
I’m using Cloudflare just for this problem, it’s not perfect but they have an option to detect AI cralwers and give them an cached version of your page or you can create a rule and send them elsewhere (that’s what i’m doing)
Fabio,
It’s worth mentioning that the bots you mentioned earlier “Bing and Google Search are worst offenders here, AI second” observe robots.txt. Anyone having a problem can stop the crawling by adding appropriate disallow sections.
https://www.bing.com/webmasters/help/how-to-create-a-robots-txt-file-cb7c31ec
https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt
https://www.eff.org/deeplinks/2023/12/no-robotstxt-how-ask-chatgpt-and-google-bard-not-use-your-website-training