Wicked Fast, Accurate Quantiles Using ‘t-Digests’
The t-Digest construction algorithm uses a variant of 1-dimensional k-means clustering to produce a very compact data structure that allows accurate estimation of quantiles. This t-Digest data structure can be used to estimate quantiles, compute other rank statistics or even to estimate related measures like trimmed means. The advantage of the t-Digest over previous digests for this purpose is that the t-Digest handles data with full floating point resolution. The accuracy of quantile estimates produced by t-Digests can be orders of magnitude more accurate than those produced by previous digest algorithms. Methods are provided to create and update t-Digests and retreive quantiles from the accumulated distributions.
See the original paper by Ted Dunning & Otmar Ertl for more details on t-Digests.
The following functions are implemented:
as.list.tdigest
: Serialize a tdigest object to an R list or unserialize a serialized tdigest list back into a tdigest objecttd_add
: Add a value to the t-Digest with the specified counttd_create
: Allocate a new histogramtd_merge
: Merge one t-Digest into anothertd_quantile_of
: Return the quantile of the valuetd_total_count
: Total items contained in the t-Digesttd_value_at
: Return the value at the specified quantiletquantile
: Calculate sample quantiles from a t-Digestinstall.packages("tdigest", repos = "https://cinc.rud.is")
# or
remotes::install_git("https://git.rud.is/hrbrmstr/tdigest.git")
# or
remotes::install_git("https://git.sr.ht/~hrbrmstr/tdigest")
# or
remotes::install_gitlab("hrbrmstr/tdigest")
# or
remotes::install_bitbucket("hrbrmstr/tdigest")
# or
remotes::install_github("hrbrmstr/tdigest")
NOTE: To use the ‘remotes’ install options you will need to have the {remotes} package installed.
library(tdigest)
# current version
packageVersion("tdigest")
## [1] '0.4.0'
td <- td_create(10)
td
## <tdigest; size=0>
td_total_count(td)
## [1] 0
td_add(td, 0, 1) %>%
td_add(10, 1)
## <tdigest; size=2>
td_total_count(td)
## [1] 2
td_value_at(td, 0.1) == 0
## [1] TRUE
td_value_at(td, 0.5) == 5
## [1] TRUE
quantile(td)
## [1] 0 0 5 10 10
td <- tdigest(c(0, 10), 10)
is_tdigest(td)
## [1] TRUE
td_value_at(td, 0.1) == 0
## [1] TRUE
td_value_at(td, 0.5) == 5
## [1] TRUE
set.seed(1492)
x <- sample(0:100, 1000000, replace = TRUE)
td <- tdigest(x, 1000)
td_total_count(td)
## [1] 1e+06
tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
## [1] 0.0000000 0.5940803 9.9130132 19.7148329 29.7725116 39.9715426 50.0034984 60.0860567 70.1951621
## [10] 80.2785864 90.0849039 99.4739015 100.0000000
quantile(td)
## [1] 0.00000 24.74052 50.00350 75.23076 100.00000
These [de]serialization functions make it possible to create & populate a tdigest, serialize it out, read it in at a later time and continue populating it enabling compact distribution accumulation & storage for large, “continuous” datasets.
set.seed(1492)
x <- sample(0:100, 1000000, replace = TRUE)
td <- tdigest(x, 1000)
tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
## [1] 0.0000000 0.5940803 9.9130132 19.7148329 29.7725116 39.9715426 50.0034984 60.0860567 70.1951621
## [10] 80.2785864 90.0849039 99.4739015 100.0000000
str(in_r <- as.list(td), 1)
## List of 7
## $ compression : num 1000
## $ cap : int 6010
## $ merged_nodes : int 403
## $ unmerged_nodes: int 0
## $ merged_count : num 1e+06
## $ unmerged_count: num 0
## $ nodes :List of 2
## - attr(*, "class")= chr [1:2] "tdigest_list" "list"
td2 <- as_tdigest(in_r)
tquantile(td2, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
## [1] 0.0000000 0.5940803 9.9130132 19.7148329 29.7725116 39.9715426 50.0034984 60.0860567 70.1951621
## [10] 80.2785864 90.0849039 99.4739015 100.0000000
identical(in_r, as.list(td2))
## [1] TRUE
microbenchmark::microbenchmark(
tdigest = tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)),
r_quantile = quantile(x, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
)
## Unit: microseconds
## expr min lq mean median uq max neval
## tdigest 29.635 37.1035 83.20912 107.219 112.09 167.888 100
## r_quantile 59713.834 61003.0275 63461.35011 63031.413 65732.04 70920.833 100