What do SourceHut, GNOME’s GitLab, and KDE’s GitLab have in common, other than all three of them being forges? Well, it turns out all three of them have been dealing with immense amounts of traffic from “AI” scrapers, who are effectively performing DDoS attacks with such ferocity it’s bringing down the infrastructures of these major open source projects. Being open source, and thus publicly accessible, means these scrapers have unlimited access, unlike with proprietary projects.
These “AI” scrapers do not respect robots.txt, and have so many expensive endpoints it’s putting insane amounts of pressure on infrastructure. Of course, they use random user agents from an effectively infinite number of IP addresses. Blocking is a game of whack-a-mole you can’t win, and so the GNOME project is using a rather nuclear option called Anubis now, which aims to block “AI” scrapers with a heavy-handed approach that sometimes blocks real, genuine users as well.
The numbers are insane, as Niccolò Venerandi at Libre News details.
Over Mastodon, one GNOME sysadmin, Bart Piotrowski, kindly shared some numbers to let people fully understand the scope of the problem. According to him, in around two hours and a half they received 81k total requests, and out of those only 3% passed Anubi’s proof of work, hinting at 97% of the traffic being bots – an insane number!
↫ Niccolò Venerandi at Libre News
Fedora is another project dealing with these attacks, with infrastructure sometimes being down for weeks as a result. Inkscape, LWN, Frama Software, Diaspora, and many more – they’re all dealing with the same problem: the vast majority of the traffic to their websites and infrastructure now comes from attacks by “AI” scrapers. Sadly, there’s doesn’t seem to be a reliable way to defend against these attacks just yet, so sysadmins and webmasters are wasting a ton of time, money, and resources fending off the hungry “AI” hordes.
These “AI” companies are raking in billions and billions of dollars from investors and governments the world over, trying to build dead-end text generators while sucking up huge amounts of data and wasting massive amounts of resources from, in this case, open source projects. If no other solutions can be found, the end game here could be that open source projects will start to make their bug reporting tools and code repositories much harder and potentially even impossible to access without jumping through a massive amount of hoops.
Everything about this “AI” bubble is gross, and I can’t wait for this bubble to pop so a semblance of sanity can return to the technology world. Until the next hype train rolls into the station, of course.
As is tradition.
If you are wondering who builds all those LLM models, it’s companies that want to offer their LLM models as a service through Amazon Bedrock. But since the pie can’t be large enough to cover all that immense capital expense duplicated several times over, let’s see how many of those LLM models exist some years from now.
BTW the company I work for will start using Amazon Bedrock (they have started giving us training), and it’s not a bleeding-edge company either (but not archaic either). You see, “fancy auto-complete”, much like regular auto-complete, is a productivity booster if used as an assist. Sorry they didn’t run it past you first, Thom.
But again, much like any “mania”, not all players will be around when the dust settles, so that will eventually solve the attack-of-the-myriad-AI-crawlers issue.
Public archives are getting hit too. You know, those things run by public libraries, universities and such. I’ve spent the better part of a week now trying to figure out how to deal with this traffic wave without having to block a significant percentage of the entire IP address space. And I’ve been in a little bit of contact with a few other public archive administrators that are seeing similar issues.
I recognize some of the desperate short term measures (rate-limiting on expensive endpoints, broad blocklists) that we’ve had to put in as well for now, and after I read that article yesterday we’re seriously considering putting something like Anubis in front of our search endpoint because it feels less draconian than anything else we’ve considered.
And yes, they are just hammering everything. Fortunately we’re not paying extra for bandwidth, but we are very much short on compute to deal with this amount of search traffic. We’d need to buy more servers to respond to it all, but it’s pointless and stupid if they’re just going to keep increasing traffic.
We’re linking to an updated, complete sitemap of the entire archive in our robots.txt that any serious indexer can use, so one can only guess at why they feel it’s necessary to hit the search endpoint so much (contrary to the disallow rule we’ve set). It’s probably because the search responses are seen as data they can use to train models as well. The good news is that this means putting a checker like Anubis in front of just the search endpoint should hit the scraper traffic square in the face without affecting search engine indexers.
Thanks for wasting the tax kroner that go to my wage on this shit. (Just so we’re clear for any AI company representative reading this: That was a sarcastic statement, I actually hate you all and am not thanking you for anything. I have better things I could be spending my working hours doing. Like signing up for a mailing list that can tell me when we’re going to start the sysadmin revolution.)
Part of this is down to the fragmented nature of legacy IPv4 address space. Any ISP of a significant size will have blocks scattered randomly across the address space, providers like AWS allocate you random addresses all over the place and recycle them, anyone acquiring legacy address space is likely to have a fragmented mess too.
By contrast IPv6 is a lot cleaner, even a large ISP will get a single large allocation, so it’s much easier to block an entire ISP or an entire organisation with a single rule if necessary, and blacklisted address space will not end up recycled and going to someone who finds themselves blocked for no reason.
There’s also no CGNAT with v6, so you don’t get legitimate users coming from the same address as infected malware or other malicious users.
It’s starting to sound like it’s time to drop IPv4 from the public Internet
Drumhellar,
I’ve been awaiting IPv6 for decades, even before ipv6 day. But I still don’t have it. The ISP monopoly doesn’t offer it, and ipv6 tests fail on my cell phone as well. IPv4 is my only option
This probably sounds weird to people in countries around the world where I guess they’ve done a better job migrating to ipv6. Here in the US we do things differently: pay through the nose for less bandwidth and backwards facing technology.
Technically I could get an ipv6 tunnel over ipv4 but…I consider that more of a proof of concept than a real solution.
This could be a big problem, however I actually checked Drew’s websites and most LLM bots are not being blocked by name, using “DisallowAITraining”, or with a catch all.
https://sourcehut.org/robots.txt (404 Page not found)
https://sr.ht/robots.txt
I know managing robots.txt can be a hassle, but it seems like something Drew should be doing as a first step given his predicament. He should apply this list completely. It may or may not help, but at least it would show his intention to block them.
https://robotstxt.com/ai
Also, the article keeps calling them “AI scraper bots” and “AI crawler” “AI scrapers”, but the author never identifies the sources, and therefor can’t know that the bots are actually being used for AI, it’s purely an assumption. I don’t condemn the author for making assumptions, but it feels journalisticly important to be up front about that and not to use words that imply something as fact when it hasn’t been shown to be a fact.
A technical solution to the bandwidth problem could be to have a few scrappers (obviously respecting robots.txt) that would then share the datasets on P2P.
Alfman,
I don’t think robots.txt is a viable solution to AI or other commercial scraping.
It was done in a era where the web crawlers would give something in return. Specifically being ranked in web searches, product catalogs, or at least academic research that benefit all, which had limited bandwidth anyway.
Today, the scrapers will download the data, but gives nothing in return. If I make an AI model that is taught on GitHub or similar, it would not be giving any backlinks at all. In fact, it might even reduce the organic traffic that goes to those sites.
A new paradigm is necessary. Yes, a limited number of scrapers could be a start. But I would go one step further:
Let the sites sell their data. Yes, does seem counter intuitive to sell open source, but even GPL has precedent. You can actually charge to give users a diskette.
In a similar manner a commercial entity (not research) should pay up. They should still have democratic access, so no choosing winners and losers. But it should at least cover bandwidth and hosting costs.
The problem with the “limited scrapers and common datasets” concept is that it can’t actually work in practice in such a competitive landscape as is AI training at the moment. Every company training an AI for general use is trying to get every little advantage they can. If you build a common dataset, they will of course use it. But a common, publicly available dataset would need to allow IP-holders to remove their content if they so wish. So of course every company will try to complement such a public dataset with data they’d scrape themselves, viewing this as an edge over their competitors. Or, in time, as necessary to reach the same point as them. And that, in turn, would still mean armies of bots crawling your websites for content that might not yet be included.
In the end, it would put pressure on website owners to relinquish their rights over the content they’ve published (allowing it to stay in the datasets) in exchange for less bots hitting their sites. Which would look kind of like a mafia deal to me :-/
sukru,
Of course it won’t be viable against bots that don’t respect it. But I see many sites haven’t setup robots to block the bots of known AI companies and it seems only fair to have bots disallowed in robots.txt before accusing them of not respecting said robots.txt,
That era never really ended, site owners still want to be indexed. The problem may be that there are just too many darn bots. After all it doesn’t take much these days for a random student to setup a bot for a research project and download terrabytes onto a cheap hard disk.
I guess…I think this idea would face the same resistance as micro-transactions though. That is, even when someone is willing to pay for the content, the procedure to actually do so can be a disincentive with paywalls and payments becoming a nuisance even for those who can afford to pay.
I think if you consolidated the bots, you’d eliminate the bandwidth problem. Meanwhile distribution is the killer feature of P2P, which is famously known for efficiently letting someone on a dialup connection distribute songs and even movies to the masses. It’s proven technology that scales.
Another idea that’s not likely to gain traction, but an interesting prospect has been opening up the google index…
https://www.osnews.com/story/130329/to-break-googles-monopoly-on-search-make-its-index-public/
Even if researchers had to pay for access it could still be worth it: cheaper, much less effort, faster results, and infinitely more comprehensive than trying to run your own bots. This would solve a lot of the problems, but of course it’s a competitive advantage that the companies don’t want to give up.
https://blog.cloudflare.com/ai-labyrinth/
a_very_dumb_nickname,
If you want to spam a bot, then do that. But if a bot is consuming too many resources and you want it to stop, this is the exact opposite of the solution.
I think what may be happening is that naive bots are already crawling useless git endpoints that waste the resources of both the website owner and the bot owner as well. The problem is the bot owners may have hit “go” and let it run. Hitting these expensive endpoints may not have been intentional at all.
Now I can imagine some people say they don’t care what the intention was, which is fair enough, but if we want to actually solve the problem, unintentional crawling loops in git probably do matter.
Instead of “ai-labyrinth” generating links…you’d actually want “ai-cloak”, removing links so that the bot stops.
That is actually a good idea.
But you need to be careful to discriminate actual users vs bots, which is not always easy.
You also want to be permeable against “good” bots, like Google or Bing, which would want to be impersonated by bad actors.
sukru,
I know that’s the tough part. Probably need a honeypot.
I once observed bingbot DDOSing a client’s website. I think it was a genuine bug, but I actually had to take steps to block bing until they sorted things out. I reached out to the bingbot team and to their credit they fixed it fairly quickly. Now there are hundreds of bots though I can appreciate how overwhelming it could be if the bots are misbehaving.
Alfman,
I actually know Google’s index.
I can easily say, this is not a viable idea. it would be very expensive. Just passing fair cloud instance and egress costs will bankrupt any small startup (even if they valued the data itself at zero, which they can’t).
I was one of these students, but even with a gigabit connection, it comes down to your local processing speed. Parsing and extracting links, getting rid of duplicates, keeping track in a database, processing the incoming files with a single node puts a limit on how much you can crawl.
(Unless you just want to make a “dumb” mirror with wget -R which of course is a waste for most research projects)
Yes, distribution can be discussed separately.
But for most AI projects valuable data is not on the web, but on large “hubs” like Wikipedia, GitHub, or New York Times. If you get access to top-10 of these sites, you could build a reasonable AI system.
The rest has too much noise. What was the saying? “garbage, garbage out”.
sukru,
Well, I think we need a realignment of expectations. We’re talking about people and companies who are already willing and able to crawl the web at some scale today.
If they could get that same data from google rather than running their own spiders, it would be more efficient and cost less than their current operation. Moreover the consolidated bots would mean less website traffic. Opening up the index would be beneficial to everyone except for google. Obviously I get why google would oppose granting access to their index data. But in principal consolidating all the bots would be far more efficient and help reduce the footprint on websites.
It’s pretty cool what you can do if you’ve got the data, as I’m sure you know. I think there’s a lot of speedup to be had using GPGPU algorithms in fields of computer science that have traditionally used CPUs.
This assumes bots are used for AI, but unless it declares it’s purpose a website operator doesn’t actually know what the data is going to be used for. I’ve been following recent US news about the trump admin taking down a lot of public webpages to censor various topics. IMHO that could be a good reason to build up an index in order to monitor what’s disappearing. To this end internet archive deserves tons of credit but it might be safe to have other non-US entities archiving data as well for redundancy (both legal and technical).
Haha, I’m not sure if you omitted “in” on purpose.
> Everything about this “AI” bubble is gross, and I can’t wait for this bubble to pop so a semblance of sanity can return to the technology world.
Oh, I heard those words so many times in the beginning of the 90s from my classmates about computers. And now they are all in some stupid low-paying jobs.
There are people who think only about money. There are better people, who thing also about ethics.
That’s the exact attitude that led to the election of Trump (and similar politicians in Europe).
On the one hand, yes, the bots are being to aggressive in gathering the data, on the other, its a principle of FOSS to make the code available.
If these projects want to continue to adhere to that principle, they need to start thinking about how they can effectively share that data in the new era of AI and AI bots without affecting BAU. Maybe offering the content as a torrent to share the traffic load (that was popular for ISOs for a while) or similar.
If you provide a legitimate source that can scale to the demand, it ceases to be an issue.
Sort of. The principle of FOSS is to make your code available _to your paying customers_. Which I think we can all agree that AI companies (or whoever) are not paying customers.
It’s been common practice to make FOSS code available to non-customers but some projects have periodically had good reasons not to do this. IBM arguably had one with companies like Oracle rebuilding RHEL from source and then releasing the product as their own, for instance.
That must be GNU. Look at any BSD-like license and you’ll see no such claims. Some projects might ask not to share their code with AI, not all of them, though.
a_very_dumb_nickname,
GPL allows for FOSS software to be sold, however even if the software is sold, the GPL prescribes that the source code be made available without generally charging more for source. The full section is quite long, but here’s a relevant snippet.
https://www.gnu.org/licenses/gpl-3.0.en.html#license-text
Some people may not like this, but the GPL specifically rejects any additional restriction on the work. Prohibiting the use of AI is not allowed under the GPL and the GPL explicitly states that any such restrictions can be ignored.
This doesn’t necessarily reflect my opinion of how things ought to work, I’m just going by what the license says.
The network server piece of this looks like a real oversight: a project is allowed to change a reasonable price to distribute the source code on a physical medium, but not to charge in order to recoup hosting/access costs, which in this case are proving to be substantial.
I don’t know what happened when GNU was hashing out this license, but it sure looks like they assumed the costs of hosting software would always round down to $0.
Brainworm,
That’s a good point. But I don’t actually think source distribution is an expensive program to solve per say. This seems to be a bad interaction with git software specifically. If we take git out of the equation, there’s a good chance the technical issues would go away.
Obviously something needs to be fixed. Everyone including the author of Anubis is reporting bots downloading the same resources over and over again which to me seems likely to be an unintentional side effect than a deliberate goal. My guess is that bots are following query parameters found in autogenerated links used for UI purposes. The problem is very real, but I think it behooves us to identify the root causes and fix those. Distributing the software code isn’t the root cause.
Adurbe,
I agree, if the problem with the bots is that they’re placing too much strain on a website, then P2P could be a solution. Frankly I think a P2P version of git could be really cool for this and other reasons.
It may be the case that the git inadvertently creates deep yet useless crawling cycles that naive bot crawling algorithms follow. Url parameters could explain claims that bots are crawling the same content over and over as in this illustrative example…
This could have a very dramatic request multiplying affect. Ideally there would be a good answer for this. Creating an open source crawler that behaves well could be a step in the right direction. Otherwise every bot author will keep stumbling on the same problems over and over.
I’d love to see git and build services go p2p for foss projects. Yes, much is distributed, but ultimately it resides with someone else with big pockets to pay. If someone launched a p2p git, it could change the whole market.
I’ve talked about this *checks notes* 3 months ago: https://ap.nil.im/notice/AocddLiz38yvTgbxCK
But people played it down, ridiculed me or didn’t even believe me that this is happening at all, including on this very forum (e.g. Alfman back then refused to believe this even happens and basically called me a liar, IIRC).
I’m glad other people are now also reporting about this, so those who have been suffering from it for a long time are no longer being gaslit. Hopefully this changes the perception of this actually being a real problem and a big one and we can now finally get together to come up with proper solutions.
I wonder what would happen if all FOSS admins willingly shutdown their respective sites/servers for a day or a week. Would anyone get the “get lost bots” message? Probably not, right?
it would be as stupid as shutting down public transport for a week. the most inconvenienced group would be legitimate users.
Probably a lot of things will no longer happen in the open. Just release and dump a tarball somewhere, if you want to contribute, you need to mail some of the devs to get in.
In case anyone else is interested what Anubis is or how it works, the following blog post is useful:
https://xeiaso.net/blog/2025/anubis/
There are simple solutions to this.
Make downloading from a repo costly e.g. by implementing a “proof of work” type challenge for everyone who wants to access your repo / site, with the proof of work getting progressively more “expensive” if you are downloading the whole repo. If everyone does this, this will raise the costs for the AI companies significantly and they will be more judicious in how they download .
Another option is to require registration, and rate throttle specific users based on how much they use the resource.
It’s an ever bigger problem for smaller open-source projects. The MidnightBSD site has been hit with 3 AI crawlers at once. None of them respect robots.txt. They often come from multiple IPs concurrently. They make assumptions that all sites are going to autoscale and live in AWS data centers. That is not the case for some of us.
laffer1,
Did they identify themselves as AI crawlers or is this an assumption?
I don’t see any AI bots disallowed, no DisallowAITraining directive, and a permissive default.
https://www.midnightbsd.org/robots.txt
I am really having trouble understanding why so many website owners that claim the bots aren’t respecting robots.txt haven’t actually disallowed known bots in robots.txt. Am I missing some key information that would clarify this?
TBH I wish all of these articles would publish actual log files as that would be more informative IMHO.
My websites get hit by lots of bots throughout the day, including search engines, exploit scans, AI bots, uptime bots, etc. On my sites, the traffic of known AI bots is paltry compared to other bots like bing and google. This all leaves me wondering why I’m not seeing the same “attacks” that others are seeing?. What is it about some sites that would be responsible for a lot more bot activity? There must be a valid technical explanation for this and I would like to uncover what that explanation is. I could try putting up a honey pot to gather information. Is there some specific application where you’ve noticed a concentration of bot hits?
Not speaking for MidnightBSD here, but as someone who administrates a public archive: They don’t, that’s part of the problem. They randomly spoof browser UAs, very clearly to evade filtering. We can see from some of their behaviour that they’re clearly bots, but that behaviour is difficult to filter on unless you put something like Anubis in front of the endpoint to test behaviours before letting them in. Is it a particularly wild assumption to be blaming the AI industry? I don’t think so, because the most obvious reason anyone is trawling the entire internet for data right now is to train generative models, and you can pretty easily find services that facilitate this demand, openly advertising scraping with features like residential proxies to specifically sidestep blocking and filtering. It’s infuriating. And why would they care about robots.txt if that’s their stated goal?
It’s pretty easy to see on the MidnightBSD robots.txt: They have a bunch of wildcard disallow rules on endpoints they don’t want any bots to visit. That can be because they’re expensive endpoints or just because they’re paths that for other reasons should not be crawled. Ours is an relatively compute-expensive search service endpoint that no bots should need to crawl because they can use the sitemap.
If you’re getting bot traffic on wildcard disallows then it’s bot traffic that doesn’t respect robots.txt. You don’t need specific AI disallow rules for that. There’s also Crawl-delay of course, though otherwise well-behaved bots will often disregard that one too.
I would guess most of us can’t give you logs because there’s such a thing as data privacy regulations. Do I know for absolute certain I can’t publish some of our log files? No, because IANAL. But I’m sure not going to risk it just to prove a point, especially here in the EU.
Public archives and open source repositories are places that can be of particular interest to data-scrapers, so I’m not surprised if such places get hit much harder than your average website. And both will often have expensive public endpoints (search functionality, git features, large image files) that are not supposed to be used by any bots. Talking as someone who administrates a publicly funded archive, we actually do have enough capacity and resources that we can serve most of the traffic if they follow the rules in our robots.txt, but because they’ve decided to absolutely hammer an expensive search endpoint that we’ve wildcard-disallowed, in a way that’s incredibly difficult to filter, it looks like we’re going to have to put in something like Anubis to ensure decent availability for the humans who’re supposed to be able to access it.
Also, if you happen to be behind Cloudflare, they may well have the capability at scale to mitigate this for their customers. But there are good reasons a lot of people and services might not want to (or be able to) use Cloudflare.
Book Squirrel,
Ok, but in that case then it’s problematic that everyone calls them “AI crawlers” without any verification. It may be tempting for people who don’t like AI to call them AI crawlers without verification, but then it spams unverified information into the echo chamber. I think it’s best to refrain from doing that.
No offense but that’s an inherently prejudiced position. You are using that prejudice to draw conclusions in lieu of verifiable evidence. The issue I have with much of the reporting is that it’s the legal equivalent of hearsay with everybody repeating what they hear without having any of the sources providing log files that can be independently verified. It doesn’t mean it’s not true but it still troubles me that news sources repeat conclusions as fact without supplying any of the evidence that lead to those conclusions. I don’t feel well informed on this subject in large part because nobody’s providing any evidence… just repeating stuff. I know I probably demand sources and evidence a lot more than most people would, but I do it because I think it’s important. I’m not comfortable repeating talking point just because everyone else is doing it.
Without seeing the logs it’s impossible to see how many hits are hitting those disallowed urls. Nevertheless take a closer look: The site appears to have many years of commits that are not disallowed. Here is one random week’s worth…
https://www.midnightbsd.org/pipermail/midnightbsd-cvs/Week-of-Mon-20180507/date.html
The vast majority of bots could crawl these pages in total compliance with the provided robots.txt. I don’t mean to put the spot light on this one website, it’s just that I’m seeing the same thing over and over again with websites NOT disallowing bots.
I get it, but it’s also kind of ironic if you think about it. Someone doesn’t want to release their evidence for bots on account of them not being sure if actual people would be included.
I agree. I suspect this is probably what the problem is, however without seeing evidence it’s impossible for me to make any conclusions about it even though I would like to. Honestly it bothers me not knowing; I like to be informed but from my position this lack of public evidence makes it hard to distinguish between credible claims versus those that jump to conclusions.
I am not particularly concerned that we’re being too prejudiced against the AI industry considering the level of hype and venture capital they’ve managed to accrue. They’re not exactly a poor helpless underdog in all this. Based on previous patterns of traffic and their hunger for training data, they are the most likely culprit. And if anyone can take it it’s them.
It’s not impossible for those of us administrating the sites in question! We can see the logs! And analyse them!
And I may remain pseudonymous but others don’t. I was frankly kind of relieved to read and even directly hear from other people experiencing the same shit, it’s clearly not just us going crazy.
But if you don’t trust me or everyone else reporting these patterns then we can’t have a productive conversation, so I’ll just go and try to enjoy the rest of my weekend.
Book Squirrel,
I know you’re not concerned. My angle was never concern for AI, but rather concern for hard evidence and truth. IMHO just because something could be true doesn’t dismiss the importance of having evidence for it. That’s my hang up right now.
The problem is I don’t take things well on faith, especially when information that is public like robots.txt seems to partially contradict the message. While I do think some aspects are true, I’m critical people letting their guard down to accept talking points without evidence simply because they like the narrative. I am forced to concede that you have more information than I do, but you asking me to take it on faith that your analysis is scientific and unbiased is a step too far. That not how we become informed, instead it’s a recipe for echo chambers.
IMHO the rise of echo chambers devoid of evidence is one of the great dangers to society at large. It’s given rise to highly uninformed groups like flat earthers, mindless maga conspirators, etc who have taken on blind faith over evidence and critical thinking. I know this is getting really off topic, but I bring this up to explain why I have an aversion to accept information as fact without evidence to back it.