Use a JavaScript-enabled browser context to read and render HTML from a URL.

hu_read_html(
  url,
  emulate = c("best", "chrome", "firefox", "ie"),
  ret = c("html_document", "text"),
  js_delay = 2000L,
  timeout = 30000L,
  ignore_ssl_errors = TRUE,
  enable_dnt = FALSE,
  download_images = FALSE,
  options = c("RECOVER", "NOERROR", "NOBLANKS")
)

Arguments

url

URL to retrieve

emulate

browser to emulate; one of "best", "chrome", "firefox", "ie"

ret

what to return; if html_document (the default) then the HTML created by the HtmlUnit emulated browser context is passed to xml2::read_html() and an xml2 html_document/xml_document is returned. Note that this causes further HTML processing by xml2/libxml2 so is not exactly what HtmlUnit generated. If you want the HTML code (text) without any further processing then use text as the value.

js_delay

time (ms) to let loaded javascript to execute; default is 2 seconds (2000 ms)

timeout

overall timeout (ms); 0 == infinite wait (not recommended); note: the timeout is used twice: first in making the socket connection, second for data retrieval. If the time is critical you must allow for twice the time specified here. Default 30s (30000 ms)

ignore_ssl_errors

Should SSL/TLS errors be ignored. The default (TRUE) is a current hack due to how HtmlUnit seems to handle virtual hosted sites with multiple vhosts and multiple certificates. You can try it with FALSE initially and revert back to TRUE if you encounter issues.

enable_dnt

Enable the "Do Not Track" header. Default: FALSE.

download_images

Download images as the page is loaded? Since this function is a high-level wrapper designed to do a read of HTML, it is recommended that you leave this the default FALSE to save time/bandwidth.

options

options to pass to xml2::read_html() if ret == html_document.

Value

an xml2 html_document/xml_document if ret == html_document else the HTML document text generated by HtmlUnit.

Details

For the code in the examples, this is the site that is being scraped:

Figure: test-url-table.png

Note that it has a table of values but it is rendered via JavaScript.

Examples

if (FALSE) {
test_url <- "https://hrbrmstr.github.io/htmlunitjars/index.html"
hu_read_html(test_url)
}