If you don’t want OpenAI’s, Apple’s, Google’s, or other companies’ crawlers sucking up the content on your website, there isn’t much you can do. They generally don’t care about the venerable robots.txt, and while people like Aaron Schwartz were legally bullied into suicide for downloading scientific articles using a guest account, corporations are free to take whatever they want, permission or no. If corporations don’t respect us, why should we respect them?
There are ways to fight back against these scrapers, and the latest is especially nasty in all the right ways.
This is a tarpit intended to catch web crawlers. Specifically, it’s targeting crawlers that scrape data for LLM’s – but really, like the plants it is named after, it’ll eat just about anything that finds its way inside.
It works by generating an endless sequences of pages, each of which with dozens of links, that simply go back into the tarpit. Pages are randomly generated, but in a deterministic way, causing them to appear to be flat files that never change. Intentional delay is added to prevent crawlers from bogging down your server, in addition to wasting their time. Lastly, optional Markov-babble can be added to the pages, to give the crawlers something to scrape up and train their LLMs on, hopefully accelerating model collapse.
↫ ZADZMO.org
You really have to know what you’re doing when you set up this tool. It is intentionally designed to cause harm to LLM web crawlers, but it makes no distinction between LLM crawlers and, say, search engine crawlers, so it will definitely get you removed from search results. On top of that, because Nepenthes is designed to feed LLM crawlers what they’re looking for, they’re going to love your servers and thus spike your CPU load constantly. I can’t reiterate enough that you should not be using this if you don’t know what you’re doing.
Setting it all up is fairly straightforward, but of note is that if you want to use the Markov generation feature, you’ll need to provide your own corpus for it to feed from. None is included to make sure every installation of Nepenthes will be different and unique because users will choose their own corpus to set up. You can use whatever texts you want, like Wikipedia articles, royalty-free books, open research corpuses, and so on. Nepenthes will also provide you with statistics to see what cats you’ve dragged in.
You can use Nepenthes defensively to prevent LLM crawlers from reaching your real content, while also collecting the IP ranges of the crawlers so you can start blocking them. If you’ve got enough bandwith and horsepower, you can also opt to use Nepenthes offensively, and you can have some real fun with this.
Let’s say you’ve got horsepower and bandwidth to burn, and just want to see these AI models burn. Nepenthes has what you need:
Don’t make any attempt to block crawlers with the IP stats. Put the delay times as low as you are comfortable with. Train a big Markov corpus and leave the Markov module enabled, set the maximum babble size to something big. In short, let them suck down as much bullshit as they have diskspace for and choke on it.
↫ ZADZMO.org
In a world where we can’t fight back against LLM crawlers in a sensible and respectful way, tools like these are exactly what we need. After all, the imbalance of power between us normal people and corporations is growing so insanely out of any and all proportions, that we don’t have much choice but to attempt to burn it all down with more… Destructive methods. I doubt this will do much to stop LLM crawlers from taking whatever they want without consent – as I’ve repeatedly said, Silicon Valley does not understand consent – but at least it’s joyfully cathartic.
This is quite an old (by Internet standards) idea. There was a script at least as far back as the mid-late 90s called “webpoison,” which would do the exact same thing — endlessly generate random links to pages with bogus content, as a trap for bots (in that case, for for the EMail address-harvesting bots used by spammers).
What?? This is a complete waste of computational power and energy. You’re burning fossil fuels (and money) to run it and continually generate a tarpit, and also the spider is doing the same… forever
Any semi-competent admin can get the IP of the crawlers in moments and block them, or make a rule that does it dynamically.
This is wasting finite world resources to “stick it to the man” for 0 material gain.
Also in the US at least, business internet accounts pay by the amount of data uploaded. Huge waste of money to use something like this.
Just basic text and a very simple page, as far as I can tell. Shouldn’t blow anyone’s budget.
Shiunbird,
It’s dynamically generated. I’ve written a program that does the same thing they’re doing. There’s really not much to it. The power consumption per page shouldn’t be that high, but naturally it has to be measured against it’s usefulness.
Alas, I don’t think this very useful. The output is so consistent that you don’t even a NN to detect it (NNs are excellent classifiers BTW), every generated page can be detected by a classic regex. For something like this to even have a chance it’s output needs to become a lot more realistic. Hypothetically they could improve the quality of output using a real LLM to generate more realistic output. Making content that’s only subtly fake makes it genuinely harder to detect, but that would increase the expense and use a lot more power. Also you end up chasing conflicting goals because the better the output, the less harm the fake website can do in LLM training.
People running this may feel good about themselves. But it does not seems likey that this project will serve it’s purpose as it stands.
Don’t be silly – this runs in a simple Docker container. I’m quite certain you could run one with a Raspberry Pi or equivalent. Your average smart refrigerator or smart doorbell probably wastes more computational power and energy.
Whenever someone tries to do something new, there are people who want to see it burn. It’s gonna be OK.
Whenever someone tries to steal all the intellectual property in the world and sell it for tens of trillions on the stonks markets, there are people who want to see it burn. It’s gonna be OK.
Fixed it for you.
What?
Tarpit traps, though neat idea in theory, didn’t work out for email spam and it’s not likely to work out here either.
1) The bandwidth costs far outweigh the costs of keeping a TCP socket open for a few seconds.
2) Your probably paying a lot more for your CPU & bandwidth usage than they do at scale and they have more money.
3) If lots of sites started doing this, the bot operators could ban sites that are comprised of tar traps without harming their own interests.
4) The more effective you hide your fake content inside of a real site, the more likely you’ll create negative consequences for real users & search engines.
It’s fundamentally difficult to create a website that’s open for public consumption for everything except for AI training. Even if something is a bot, you have no way of knowing it’s purpose unless it truthfully identifies its purpose. A search engine or internet archive bot downloads the same data, should those be blocked too? It’s kind of naive to think we can trick some of them but not others.
Why should anyone care about robots.txt? What legal power does robots.txt hold? If you make content available on the web, I have the right to make transient copies of it to my hard drive under fair use. You consented to your data used that way when you made them available on the web. As long as I don’t distribute any copies, I am legally in the clear. I hate all these proposals to extend copyright in the name of fighting LLMs. Poisoning LLMs is fair game though, since it doesn’t require any expansion of copyright.
Such duplication of content will get you downranked in search engines. Your best bet is to hide all this babble in pages that robots.txt has marked as “not suitable for crawling” (which is the real use of robots.txt btw) and hope AI scrappers will keep ignoring robots.txt and Google Search will keep honoring it. But that’s not guaranteed obviously.
kurkosdr,
I get why people dislike AI, but I agree with you that copyright protects specific expressions and not strictly the general ideas they are expressing. If an LLM can be shown to duplicate the original work, then it should be subject to copyright infringement just like any other infringement case. But if it doesn’t duplicate the original…then traditional copyright laws allow this kind of generalization and this is intentional! LLMs just happen to be able to generalize exceptionally well, but doing it is nothing new. Heck newspapers, magazines, television shows, comedians, etc were doing this long before AI.
I am extremely wary of expanding the scope of copyrights even further. I think this would only give corporations like Disney more ammo to claim ownership of ideas above and beyond Disney’s expressions of those ideas.
In my experience, google is good about honoring robots.txt, The one I’ve had problems with is bingbot Of course it’s just a voluntary spec.
This was a long time ago and I don’t remember the source, but there was an article or paper about how robots.txt ended up giving google unfair monopoly advantages. There are many websites that only allow google and “disallow” everything else. It’s not that the robots.txt standard itself is biased, but it ends up amplifying existing market biases and those in the long tail get screwed by it.
https://stackoverflow.com/questions/56049660/how-to-exclude-all-robots-except-googlebot-and-bingbot-with-both-robots-txt-and
It kind of sucks that those who want to be “responsible” by obeying robots.txt are rewarded with a loosing hand due to no fault of their own. This makes me wonder if any of them are following googlebot’s rules just to make things more fair. I’m sure this will be criticized, but robots.txt is voluntary anyway.