Parsing robots.txt Like Google

Simon Lesser October 7, 2019

Product Updates

Parsing robots.txt like Google

Dragon Metrics is now using Google’s open source robots.txt parser for all site crawls. This means that when a URL is marked as crawlable or non-crawlable in Dragon Metrics, Google will be interpreting it the same way.

For many reading, this tl;dr version may be enough. But those interested in the nitty-gritty of crawling should read on.

How we got here

While a very basic de facto standard for robots.txt has existed for 25 years, the lack of a formally agreed-upon standard has led to a wild west of interpretations and implementations by both webmasters and search engines.

The basic rules were almost always interpreted the same way by most search engines. Webmasters could feel confident that a simple directive like “Disallow: /account” would always work as expected.

But dozens of edge cases were treated differently (and opaquely) by each crawler, leading to great uncertainty among webmasters. Does Baidu support the wildcard character? Does Bing support the “noindex” directive? What about “crawl-delay”? Some search engines supported these rules, some didn’t. Worse still was that since most search engines did not thoroughly document their parsing rules, in many cases there was no way of knowing which search engines supported which feature without painstaking trial and error.

Google’s summer surprise

Back in July Google made a couple of major announcements. First, they proposed formalizing a robots.txt specification with the IETF. (You can read it here). If adopted by all major search engines and crawlers, ambiguity on how certain rules are treated will finally be removed, and webmasters can rest assured that it will be treated the same way by all crawlers.

Second, they open sourced their own robots.txt parser so that anyone can have access to the same code that Google uses to interpret a robots.txt file. This means any crawler using Google’s open source parser will treat robots.txt exactly as Google does — and since it’s based on the new standard, it also means that it will interpret the same way any other search engine or crawler that adopts the standard as well.

What we did

Dragon Metrics built our own robots.txt parser close to 7 years ago, and it’s been working reasonably well ever since. But due to the standardization issues mentioned above, we had to make certain decisions that may or may not have matched every search engine.

Starting late last month, Dragonbot (Dragon Metrics' crawler) began using this open source parser for all site crawls.

What this means

For customers, this means that you can feel confident about Dragon Metrics accuracy when determining the crawlability of a URL. For large sites with complex rules, you may also see your crawls completed a bit faster as well.

For us, this means much faster parsing, which will in turn make crawls faster and utilize less resources. Google has been using this code in production for almost 20 years. Considering how many URLs Google crawls per second, and that they’ll need to hit this function for each one to see if it’s crawlable or not, you better believe Google has optimized this code to the absolute maximum.

It will also help deal with some of the facepalm-inducing edge cases we see from time to time. A couple of my personal favorites that have crashed our parser include:

A robots.txt file over 200 MB in size with more than 99% duplicate lines
A file including hundreds of inscrutable rules such as both "Disallow: /*/*/*/*/*/*/*/*/*/*/*" and "Allow: /*/*/webapp/*/*/*/*/*/*".

Now, Google’s parser handles these and other cases like a champ, and your crawls are better off for it.

Happy crawling, and let us know if you have any questions by live chat!