Compute WSDL Diversity Index, Shannon's evenness index, and Simpson's diversity index for a corpus (collection) of URLs.

uri_diversity(corpus, corpus_id = uuid::UUIDgenerate(),
  exception_domains = NULL)

url_diversity(corpus, corpus_id = uuid::UUIDgenerate(),
  exception_domains = NULL)

Arguments

corpus

a collection (character vector) of URLs

corpus_id

an identifier (ideally unique) for the collection; will be generated if not provided.

exception_domains

a character vector of domains; use this to specify domains where the query string is important. Normally, the query string is excluded from the canonicalized URI but in some cases (e.g. youtube.com) it is desirable to have the query string influence the diversity computations.

Value

a data frame (tibble) with WSDL, Shannon and Simpson diversity indices for canonical URIs and hostnames.

Note

Algorithm creator: Alexander C. Nwala

References

http://ws-dl.blogspot.com/2018/05/2018-05-04-exploration-of-url-diversity.html

Examples

# NOT RUN {
collection <- readLines(system.file("extdat", "corpus.txt", package = "urldiversity"))
uri_diversity(collection)
# }