Wicked Fast, Accurate Quantiles Using ‘t-Digests’

Description

The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to produce a very compact data structure that allows accurate estimation of quantiles. This t-digest data structure can be used to estimate quantiles, compute other rank statistics or even to estimate related measures like trimmed means. The advantage of the t-digest over previous digests for this purpose is that the t-digest handles data with full floating point resolution. With small changes, the t-digest can handle values from any ordered set for which we can compute something akin to a mean. The accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than those produced by previous digest algorithms.

See the original paper by Ted Dunning for more details on t-Digests.

What’s Inside The Tin

The following functions are implemented:

  • is_tdigest: Test to see if an object is classed as tdigest
  • tdigest: Create a new t-digest histogram from a vector
  • td_add: Add a value to the t-digest with the specified count
  • td_create: Allocate a new histogram
  • td_merge: Merge one t-digest into another
  • td_quantile_of: Return the quantile of the value
  • td_total_count: Total items contained in the t-digest
  • td_value_at: Return the value at the specified quantile
  • tquantile: Calcuate sample quantiles from a t-digest

Installation

install.packages("tdigest", repos="https://cinc.rud.is/")
# or 
devtools::install_git("https://sr.ht.com/~hrbrmstr/tdigest.git")
# or
devtools::install_gitlab("hrbrmstr/tdigest")
# or (if you must)
devtools::install_github("hrbrmstr/tdigest")

Usage

library(tdigest)

# current version
packageVersion("tdigest")
## [1] '0.2.0'

Basic (Low-level interface)

td <- td_create(10)

td
## <tdigest; size=0>

td_total_count(td)
## [1] 0

td_add(td, 0, 1) %>% 
  td_add(10, 1)
## <tdigest; size=2>

td_total_count(td)
## [1] 2

td_value_at(td, 0.1) == 0
## [1] TRUE
td_value_at(td, 0.5) == 5
## [1] TRUE

quantile(td)
## [1]  0  0  5 10 10

Bigger (and Vectorised)

td <- tdigest(c(0, 10), 10)

is_tdigest(td)
## [1] TRUE

td_value_at(td, 0.1) == 0
## [1] TRUE
td_value_at(td, 0.5) == 5
## [1] TRUE

set.seed(1492)
x <- sample(0:100, 1000000, replace = TRUE)
td <- tdigest(x, 1000)

td_total_count(td)
## [1] 1e+06

tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
##  [1]   0.0000000   0.5940803   9.9130132  19.7148329  29.7725116  39.9715426  50.0034984  60.0860567  70.1951621
## [10]  80.2785864  90.0849039  99.4739015 100.0000000

quantile(td)
## [1]   0.00000  24.74052  50.00350  75.23076 100.00000

Proof it’s faster

microbenchmark::microbenchmark(
  tdigest = tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)),
  r_quantile = quantile(x, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
)
## Unit: microseconds
##        expr      min         lq       mean    median         uq      max neval
##     tdigest    35.00    41.2275    73.3681    87.056    91.8895   136.26   100
##  r_quantile 59278.32 59915.1860 61293.3386 60656.448 62530.5600 65922.69   100

tdigest Metrics

Lang # Files (%) LoC (%) Blank lines (%) # Lines (%)
C 3 0.27 337 0.68 45 0.38 26 0.11
R 6 0.55 120 0.24 25 0.21 133 0.56
Rmd 1 0.09 32 0.06 37 0.32 51 0.21
C/C++ Header 1 0.09 10 0.02 10 0.09 28 0.12

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.