Parse and Test Robots Exclusion Protocol Files and Rules
The ‘Robots Exclusion Protocol’ https://www.robotstxt.org/orig.html documents a set of standards for allowing or excluding robot/spider crawling of different areas of site content. Tools are provided which wrap The ‘rep-cpp’ https://github.com/seomoz/rep-cpp C++ library for processing these ‘robots.txt’ files.
The following functions are implemented:
can_fetch
: Test URL paths against a robxp robots.txt objectcrawl_delays
: Retrieve all agent crawl delay values in a robxp robots.txt objectprint.robxp
: Custom printer for ’robxp“ objectsrobxp
: Parse a ‘robots.txt’ file & create a ‘robxp’ objectsitemaps
: Retrieve a character vector of sitemaps from a parsed robots.txt objectinstall.packages("spiderbar", repos = c("https://cinc.rud.is", "https://cloud.r-project.org/")) # or remotes::install_git("https://git.rud.is/hrbrmstr/spiderbar.git") # or remotes::install_git("https://git.sr.ht/~hrbrmstr/spiderbar") # or remotes::install_gitlab("hrbrmstr/spiderbar") # or remotes::install_bitbucket("hrbrmstr/spiderbar") # or remotes::install_github("hrbrmstr/spiderbar")
NOTE: To use the ‘remotes’ install options you will need to have the {remotes} package installed.
library(spiderbar) library(robotstxt) # current verison packageVersion("spiderbar") ## [1] '0.2.3' # use helpers from the robotstxt package rt <- robxp(get_robotstxt("https://cdc.gov")) print(rt) ## <Robots Exclusion Protocol Object> # or rt <- robxp(url("https://cdc.gov/robots.txt")) can_fetch(rt, "/asthma/asthma_stats/default.htm", "*") ## [1] TRUE can_fetch(rt, "/_borders", "*") ## [1] FALSE gh_rt <- robxp(robotstxt::get_robotstxt("github.com")) can_fetch(gh_rt, "/humans.txt", "*") # TRUE ## [1] TRUE can_fetch(gh_rt, "/login", "*") # FALSE ## [1] TRUE can_fetch(gh_rt, "/oembed", "CCBot") # FALSE ## [1] TRUE can_fetch(gh_rt, c("/humans.txt", "/login", "/oembed")) ## [1] TRUE TRUE TRUE crawl_delays(gh_rt)
imdb_rt <- robxp(robotstxt::get_robotstxt("imdb.com")) crawl_delays(imdb_rt)
sitemaps(imdb_rt) ## character(0)