If you don’t want OpenAI’s, Apple’s, Google’s, or other companies’ crawlers sucking up the content on your website, there isn’t much you can do. They generally don’t care about the venerable robots.txt, and while people like Aaron Schwartz were legally bullied into suicide for downloading scientific articles using a guest account, corporations are free to take whatever they want, permission or no. If corporations don’t respect us, why should we respect them?
There are ways to fight back against these scrapers, and the latest is especially nasty in all the right ways.
This is a tarpit intended to catch web crawlers. Specifically, it’s targeting crawlers that scrape data for LLM’s – but really, like the plants it is named after, it’ll eat just about anything that finds its way inside.
It works by generating an endless sequences of pages, each of which with dozens of links, that simply go back into the tarpit. Pages are randomly generated, but in a deterministic way, causing them to appear to be flat files that never change. Intentional delay is added to prevent crawlers from bogging down your server, in addition to wasting their time. Lastly, optional Markov-babble can be added to the pages, to give the crawlers something to scrape up and train their LLMs on, hopefully accelerating model collapse.
↫ ZADZMO.org
You really have to know what you’re doing when you set up this tool. It is intentionally designed to cause harm to LLM web crawlers, but it makes no distinction between LLM crawlers and, say, search engine crawlers, so it will definitely get you removed from search results. On top of that, because Nepenthes is designed to feed LLM crawlers what they’re looking for, they’re going to love your servers and thus spike your CPU load constantly. I can’t reiterate enough that you should not be using this if you don’t know what you’re doing.
Setting it all up is fairly straightforward, but of note is that if you want to use the Markov generation feature, you’ll need to provide your own corpus for it to feed from. None is included to make sure every installation of Nepenthes will be different and unique because users will choose their own corpus to set up. You can use whatever texts you want, like Wikipedia articles, royalty-free books, open research corpuses, and so on. Nepenthes will also provide you with statistics to see what cats you’ve dragged in.
You can use Nepenthes defensively to prevent LLM crawlers from reaching your real content, while also collecting the IP ranges of the crawlers so you can start blocking them. If you’ve got enough bandwith and horsepower, you can also opt to use Nepenthes offensively, and you can have some real fun with this.
Let’s say you’ve got horsepower and bandwidth to burn, and just want to see these AI models burn. Nepenthes has what you need:
Don’t make any attempt to block crawlers with the IP stats. Put the delay times as low as you are comfortable with. Train a big Markov corpus and leave the Markov module enabled, set the maximum babble size to something big. In short, let them suck down as much bullshit as they have diskspace for and choke on it.
↫ ZADZMO.org
In a world where we can’t fight back against LLM crawlers in a sensible and respectful way, tools like these are exactly what we need. After all, the imbalance of power between us normal people and corporations is growing so insanely out of any and all proportions, that we don’t have much choice but to attempt to burn it all down with more… Destructive methods. I doubt this will do much to stop LLM crawlers from taking whatever they want without consent – as I’ve repeatedly said, Silicon Valley does not understand consent – but at least it’s joyfully cathartic.