Quantify ‘URL’ Diversity and Apply Popular Biodiversity Indices to a ‘URL’ Collection

Description

Methods are provided to compute the ‘WSDL Diversity Index’ http://ws-dl.blogspot.com/2018/05/2018-05-04-exploration-of-url-diversity.html along with selected biodiversity indidces to a corpus (collection) of ‘URLs’.

NOTE

All credit goes to Alexander Nwala for the algorithm research and original Python implementation.

TODO

  • [ ] Handle some edge cases
  • [ ] Tests
  • [ ] Better documentation
  • [ ] Vignette with many citations from the WSDL blog post

What’s Inside The Tin

The following functions are implemented:

Core function:

  • uri_diversity: Quantify URL diversity
  • url_diversity: (an alias for ^^ b/c I regularly forget it’s rightlfully uri)

Processing Helpers:

  • clean_index_factors: Clean up diversity and evenness names

Scraping Helpers:

  • body_anchor_urls: Extract all body anchor hypertext references
  • body_img_urls: Extract all body image URLs
  • safeGET: Safer version of ‘httr::GET()’
  • safePOST: Safer version of ‘httr::POST()’

Installation

devtools::install_github("hrbrmstr/urldiversity")

Usage

library(urldiversity)

# current verison
packageVersion("urldiversity")
## [1] '0.1.0'
collection <- readLines(system.file("extdat", "corpus.txt", package = "urldiversity"))

print(collection)
##  [1] "http://www.niaid.nih.gov/topics/ebolaMarburg/understandingEbola/"             
##  [2] "http://www.niaid.nih.gov/topics/ebolaMarburg/understandingEbola/"             
##  [3] "http://www.niaid.nih.gov/topics/ebolaMarburg/understandingEbola/"             
##  [4] "http://www.niaid.nih.gov/topics/ebolaMarburg/understandingEbola/"             
##  [5] "http://www.niaid.nih.gov/topics/ebolaMarburg/understandingEbola/"             
##  [6] "http://www.cdc.gov/vhf/ebola/pdf/facts-about-ebola-french.pdf"                
##  [7] "http://www.cdc.gov/vhf/ebola/pdf/facts-about-ebola-french.pdf"                
##  [8] "http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/previous-updates.html"
##  [9] "http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/previous-updates.html"
## [10] "http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/previous-updates.html"
## [11] "http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/previous-updates.html"
## [12] "http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/previous-updates.html"
## [13] "http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/previous-updates.html"
## [14] "http://www.cdc.gov/vhf/ebola/french/2014-west-africa/previous-updates.html"   
## [15] "http://www.cdc.gov/vhf/ebola/french/2014-west-africa/previous-updates.html"   
## [16] "http://www.cdc.gov/vhf/ebola/french/2014-west-africa/index.html"
x <- uri_diversity(collection)

dplyr::glimpse(x)
## Observations: 1
## Variables: 9
## $ n_urls                  <int> 16
## $ wsdl_uri_diversity      <dbl> 0.2666667
## $ wsdl_hostname_diversity <dbl> 0.06666667
## $ wsdl_domain_diversity   <dbl> 0.06666667
## $ simpson_uri_diversity   <dbl> 0.775
## $ shannon_uri_evenness    <dbl> 0.8850561
## $ simpson_host_diversity  <dbl> 0.4583333
## $ shannon_host_evenness   <dbl> 0.8960382
## $ corpus_id               <chr> "94301ed3-2fef-43f9-8a3c-8db1341e140e"
x
## URI diversity report for 16 URIs:
## 
## WSDL URI diversity:
##   URI: 0.2666667
##   Hostname: 0.06666667
##   Domain: 0.06666667
## 
## Simpson's diversity index:
##   URI: 0.775
##   Unified (Species: URI, Individuals: Paths): 0.4583333
## 
## Shannon's evenness index:
##   URI: 0.8850561
##   Unified (Species: URI, Individuals: Paths): 0.8960382

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.