Reap Information from Websites
There’s no longer need to fear getting at the gnarly bits of web pages. For the vast majority of web scraping tasks, the ‘rvest’ package does a phenomenal job providing just enough of what you need to get by. But, if you want more of the details of the site you’re scraping, some handy shortcuts to page elements in use and the ability to not have to think too hard about serialization during scraping tasks, then you may be interested in reaping more than harvesting. Tools are provided to interact with web sites content and metadata more granular level than ‘rvest’ but at a higher level than ‘httr’/‘curl’.
This is very much a WIP but there are enough basic features to let others kick the tyres and see what’s woefully busted or in need of attention.
The following functions are implemented:
reap_url
: Read HTML content from a URLmill
: Turn a ‘reapr_doc’ into plain text without cruftreapr
: Reap Information from Websitesreap_attr
: Reap text, names and attributes from HTMLreap_attrs
: Reap text, names and attributes from HTMLreap_children
: Reap text, names and attributes from HTMLreap_name
: Reap text, names and attributes from HTMLreap_node
: Reap nodes from an reaped HTML documentreap_nodes
: Reap nodes from an reaped HTML documentreap_table
: Extract data from HTML tablesreap_text
: Reap text, names and attributes from HTMLadd_response_url_from
: Add a ‘reapr_doc’ response prefix URL to a data framedevtools::install_git("https://git.sr.ht/~hrbrmstr/reapr")
# or
devtools::install_git("https://gitlab.com/hrbrmstr/reapr.git")
# or
devtools::install_github("hrbrmstr/reapr")
library(reapr)
library(hrbrthemes) # sr.hr/~hrbrmstr/hrbrthemes | git[la|hu]b.com/hrbrmstr/hrbrthemes
library(tidyverse) # for some examples only
# current version
packageVersion("reapr")
## [1] '0.1.0'
x <- reap_url("http://rud.is/b")
x
## Title: rud.is | "In God we trust. All others must bring data"
## Original URL: http://rud.is/b
## Final URL: https://rud.is/b/
## Crawl-Date: 2019-03-02 11:56:19
## Status: 200
## Content-Type: text/html; charset=UTF-8
## Size: 44.6 kB
## IP Address: 2602:ff16:3::4dfb:9ac5
## Tags: body[1], center[1], form[1], h2[1], head[1], hgroup[1], html[1],
## label[1], noscript[1], section[1], title[1],
## aside[2], nav[2], ul[2], img[3], style[5],
## input[6], article[8], time[8], footer[9], h1[9],
## header[9], p[10], meta[20], li[21], div[28],
## script[39], span[49], link[51], a[85]
## # Comments: 17
## Total Request Time: 0.213s
The formatted object print-output shows much of what you get with a reaped URL.
reapr::real_url()
:
httr::GET()
to make web connections and retrieve content. This enables it to behave more like an actual (non-javascript-enabled) browser. You can pass anything httr::GET()
can handle to ...
(e.g. httr::user_agent()
) to have as much granular control over the interaction as possible.httr::response
object is obtained many tasks are performed including:
xml_document
) HTML<title>
(if any)httr::response
object with the returned objectFinally, it works with other package member functions to check the validity of the parsed xml_document
and auto-regen the parse (since it has the full content available to it) prior to any other operations. This also makes reapr_doc
object serializable without having to spend your own cycles on that.
If you need more or need the above in different ways please file issues.
Unlike rvest
with it’s magical and wonderful html_table()
reapr
provides more raw control over the content of <table>
elements. Let’s look at the “population change over time” table from the Wikipedia page on the demography of the UK (https://en.wikipedia.org/wiki/Demography_of_the_United_Kingdom):
x <- reap_url("https://en.wikipedia.org/wiki/Demography_of_the_United_Kingdom")
reap_node(x, ".//table[contains(., 'Intercensal')]") %>%
reap_table()
## # A tibble: 18 x 8
## V1 V2 V3 V4 V5 V6 V7 V8
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Intercens… Populationat … Average annua… Average annu… Average annu… Average annu… Average annu… Populationdensityat…
## 2 Intercens… Populationat … Overallchange Births Deaths Net naturalc… Netmigration* Populationdensityat…
## 3 1851–1861 27,368,800 154,910 Unknown Unknown Unknown Unknown 87
## 4 1861–1871 28,917,900 256,680 Unknown Unknown Unknown Unknown 92
## 5 1871–1881 31,484,700 344,980 Unknown Unknown Unknown Unknown 100
## 6 1881–1891 34,934,500 286,790 Unknown Unknown Unknown Unknown 111
## 7 1891–1901 37,802,400 373,580 Unknown Unknown Unknown Unknown 120
## 8 1901–1911 38,237,000 385,000 1,091,000 624,000 467,000 −82,000 156
## 9 1911–1921 42,082,000 195,000 975,000 689,000 286,000 −92,000 172
## 10 1921–1931 44,027,000 201,000 824,000 555,000 268,000 −67,000 180
## 11 1931–1951 46,038,000 213,000 793,000 603,000 190,000 22,000 188
## 12 1951–1961 50,225,000 258,000 839,000 593,000 246,000 12,000 205
## 13 1961–1971 52,807,000 312,000 962,000 638,000 324,000 −12,000 216
## 14 1971–1981 55,928,000 42,000 736,000 666,000 69,000 −27,000 229
## 15 1981–1991 56,357,000 108,000 757,000 655,000 103,000 5,000 231
## 16 1991–2001 57,439,000 161,000 731,000 631,000 100,000 61,000 235
## 17 2001–2011 59,113,000 324,000 722,000 588,000 134,000 191,000 242
## 18 2011–2021 63,182,000 N/A N/A N/A N/A N/A 259
As you can see, it doesn’t do the cleanup work for you and has no way to even say there’s a header. That’s because you can do that with rvest::html_table()
. The equivalent reapr
function gives you the raw table and handles colspan
and rowspan
insanity by adding the missing cells and filling in the gaps. You can use docxtractr::assign_colnames()
to make a given row the column titles and docxtractr::mcga()
or janitor::clean_names()
to name them proper R names then readr::type_convert()
to finish the task.
While that may seem overkill for this example (it is), it wouldn’t be if the table were more gnarly (I’m working on an example for that which will replace this one when it’s done).
For truly gnarly tables you can get an overview of the structure (without the data frame conversion):
reap_node(x, ".//table[contains(., 'Intercensal')]") %>%
reap_table(raw = TRUE) -> raw_tbl
raw_tbl
## <table class='wikitable...'>
## <row (noattrs)>
## <cell rowspan='2'>
## <cell rowspan='2'>
## <cell colspan='5'>
## <cell rowspan='2'>
## <row (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <row (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell style='backgroun...' class='unknown t...'>
## <cell style='backgroun...' class='unknown t...'>
## <cell style='backgroun...' class='unknown t...'>
## <cell style='backgroun...' class='unknown t...'>
## <cell (noattrs)>
## <row (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell style='backgroun...' class='unknown t...'>
## <cell style='backgroun...' class='unknown t...'>
## <cell style='backgroun...' class='unknown t...'>
## <cell style='backgroun...' class='unknown t...'>
## <cell (noattrs)>
## <row (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell style='backgroun...' class='unknown t...'>
## <cell style='backgroun...' class='unknown t...'>
## <cell style='backgroun...' class='unknown t...'>
## <cell style='backgroun...' class='unknown t...'>
## <cell (noattrs)>
## <row (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell style='backgroun...' class='unknown t...'>
## <cell style='backgroun...' class='unknown t...'>
## <cell style='backgroun...' class='unknown t...'>
## <cell style='backgroun...' class='unknown t...'>
## <cell (noattrs)>
## <row (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell style='backgroun...' class='unknown t...'>
## <cell style='backgroun...' class='unknown t...'>
## <cell style='backgroun...' class='unknown t...'>
## <cell style='backgroun...' class='unknown t...'>
## <cell (noattrs)>
## <row (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <row (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <row (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <row (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <row (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <row (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <row (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <row (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <row (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <row (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <row (noattrs)>
## <cell (noattrs)>
## <cell (noattrs)>
## <cell data-sort-value='' style='backgroun...' class='table-na'>
## <cell data-sort-value='' style='backgroun...' class='table-na'>
## <cell data-sort-value='' style='backgroun...' class='table-na'>
## <cell data-sort-value='' style='backgroun...' class='table-na'>
## <cell data-sort-value='' style='backgroun...' class='table-na'>
## <cell (noattrs)>
And work with the list
it gives back (which contains all the HTML element attributes as R attributes so you can pull data stored in them if need be).
Lang | # Files | (%) | LoC | (%) | Blank lines | (%) | # Lines | (%) |
---|---|---|---|---|---|---|---|---|
R | 13 | 0.81 | 417 | 0.87 | 155 | 0.73 | 250 | 0.69 |
Rmd | 1 | 0.06 | 44 | 0.09 | 53 | 0.25 | 110 | 0.30 |
C | 2 | 0.12 | 17 | 0.04 | 5 | 0.02 | 4 | 0.01 |
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.