Quantify ‘URL’ Diversity and Apply Popular Biodiversity Indices to a ‘URL’ Collection
Methods are provided to compute the ‘WSDL Diversity Index’ http://ws-dl.blogspot.com/2018/05/2018-05-04-exploration-of-url-diversity.html along with selected biodiversity indidces to a corpus (collection) of ‘URLs’.
All credit goes to Alexander Nwala for the algorithm research and original Python implementation.
The following functions are implemented:
library(urldiversity)
# current verison
packageVersion("urldiversity")
## [1] '0.1.0'
collection <- readLines(system.file("extdat", "corpus.txt", package = "urldiversity"))
print(collection)
## [1] "http://www.niaid.nih.gov/topics/ebolaMarburg/understandingEbola/"
## [2] "http://www.niaid.nih.gov/topics/ebolaMarburg/understandingEbola/"
## [3] "http://www.niaid.nih.gov/topics/ebolaMarburg/understandingEbola/"
## [4] "http://www.niaid.nih.gov/topics/ebolaMarburg/understandingEbola/"
## [5] "http://www.niaid.nih.gov/topics/ebolaMarburg/understandingEbola/"
## [6] "http://www.cdc.gov/vhf/ebola/pdf/facts-about-ebola-french.pdf"
## [7] "http://www.cdc.gov/vhf/ebola/pdf/facts-about-ebola-french.pdf"
## [8] "http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/previous-updates.html"
## [9] "http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/previous-updates.html"
## [10] "http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/previous-updates.html"
## [11] "http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/previous-updates.html"
## [12] "http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/previous-updates.html"
## [13] "http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/previous-updates.html"
## [14] "http://www.cdc.gov/vhf/ebola/french/2014-west-africa/previous-updates.html"
## [15] "http://www.cdc.gov/vhf/ebola/french/2014-west-africa/previous-updates.html"
## [16] "http://www.cdc.gov/vhf/ebola/french/2014-west-africa/index.html"
x <- uri_diversity(collection)
dplyr::glimpse(x)
## Observations: 1
## Variables: 9
## $ n_urls <int> 16
## $ wsdl_uri_diversity <dbl> 0.2666667
## $ wsdl_hostname_diversity <dbl> 0.06666667
## $ wsdl_domain_diversity <dbl> 0.06666667
## $ simpson_uri_diversity <dbl> 0.775
## $ shannon_uri_evenness <dbl> 0.8850561
## $ simpson_host_diversity <dbl> 0.4583333
## $ shannon_host_evenness <dbl> 0.8960382
## $ corpus_id <chr> "94301ed3-2fef-43f9-8a3c-8db1341e140e"
x
## URI diversity report for 16 URIs:
##
## WSDL URI diversity:
## URI: 0.2666667
## Hostname: 0.06666667
## Domain: 0.06666667
##
## Simpson's diversity index:
## URI: 0.775
## Unified (Species: URI, Individuals: Paths): 0.4583333
##
## Shannon's evenness index:
## URI: 0.8850561
## Unified (Species: URI, Individuals: Paths): 0.8960382
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.