Compute WSDL Diversity Index, Shannon's evenness index, and Simpson's diversity index for a corpus (collection) of URLs.
uri_diversity(corpus, corpus_id = uuid::UUIDgenerate(), exception_domains = NULL) url_diversity(corpus, corpus_id = uuid::UUIDgenerate(), exception_domains = NULL)
corpus | a collection (character vector) of URLs |
---|---|
corpus_id | an identifier (ideally unique) for the collection; will be generated if not provided. |
exception_domains | a character vector of domains; use this to specify domains
where the query string is important. Normally, the query string is excluded from
the canonicalized URI but in some cases (e.g. |
a data frame (tibble) with WSDL, Shannon and Simpson diversity indices for canonical URIs and hostnames.
Algorithm creator: Alexander C. Nwala
http://ws-dl.blogspot.com/2018/05/2018-05-04-exploration-of-url-diversity.html
# NOT RUN { collection <- readLines(system.file("extdat", "corpus.txt", package = "urldiversity")) uri_diversity(collection) # }