Google’s robots.txt parser is now open source

Thom Holwerda 2019-07-01 Google 4 Comments

Today, we announced that we’re spearheading the effort to make the REP an internet standard. While this is an important step, it means extra work for developers who parse robots.txt files.
We’re here to help: we open sourced the C++ library that our production systems use for parsing and matching rules in robots.txt files. This library has been around for 20 years and it contains pieces of code that were written in the 90’s. Since then, the library evolved; we learned a lot about how webmasters write robots.txt files and corner cases that we had to cover for, and added what we learned over the years also to the internet draft when it made sense.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

4 Comments

2019-07-02 8:53 am
avgalen
Today I learned: Robots.txt was never a formal standard????

2019-07-02 9:05 am
ahferroin7
Yeah, it’s always been a de-facto standard. The Wikipedia page has a quick blurb about the history, but doesn’t really conclusively mention it either.
2019-07-02 9:37 pm
kwan_e
avgalen,
There are no formal standards above the link layer. Everything above that layer is an RFC, or a W3C (and now WHATWG) recommendation, or a googleamazonfacebookgramapp power play.

2019-07-03 9:07 am
avgalen
Despite the unofficial looking name an RFC is absolutely a formal standard.. “The Internet Engineering Task Force (IETF) publishes its specifications as Requests for Comments (RFCs).” (Source: https://www.interlinknetworks.com/what-does-it-mean-to-be-rfc-compliant/)
https://www.robotstxt.org/norobots-rfc.txt mentions several other RFC’s, but isn’t an RFC itself.
It seems that the RFC is actually about the content and structure of the robots.txt. The idea of Robots.txt was indeed part of the standard, it just wasn’t well defined: https://www.w3.org/TR/html401/appendix/notes.html#h-B.4