Insight: After 20 Years Google Web Crawler Is Open Source
After 20 years, Google has open-sourced parser. Google’s Robot Exclusion Protocol (REP), also known as , is popular among the web developers. REP was developed by a Dutch software engineer Martijn Koster in 1994.
Robots Exclusion Protocol (REP) is a standard that enables website owners to control which URLs may be accessed by crawlers through a simple text file with a specific syntax.
Part of the statement from Google:
Today, we announced that we’re spearheading the effort to make the REP an internet standard. While this is an important step, it means extra work for developers who parse robots.txt files. We’re here to help: we open sourced the C++ library that our production systems use for parsing and matching rules in robots.txt files. This library has been around for 20 years and it contains pieces of code that were written in the 90’s. Since then, the library evolved; we learned a lot about how webmasters write robots.txt files and corner cases that we had to cover for, and added what we learned over the years also to the internet draft when it made sense.