tesser.math

Folds over numbers! Calculate sums, means, variance, standard deviation, covariance and linear correlations, and matrices thereof, plus quantiles and histograms estimates backed by probabilistic QDigests.

correlation

(correlation & args)

Like correlation+count, but only returns the correlation.

correlation+count

(correlation+count fx fy)(correlation+count fx fy fold__6851__auto__)

Given two functions: (fx input) and (fy input), each of which returns a number, estimates the unbiased linear correlation coefficient between fx and fy over inputs. Ignores any records where fx or fy are nil. If there are no records with values for fx and fy, the correlation is nil. See http://mathworld.wolfram.com/CorrelationCoefficient.html.

This function returns a map of correlation and count, like

{:correlation 0.34 :count 142}

which is useful for significance testing.

correlation+count-matrix

(correlation+count-matrix & args)

Given a map of key names to functions that extract values for those keys from an input, computes the correlations for each of the n^2 key pairs, returning a map of name pairs to the their correlations and counts. See correlation+count. For example:

(t/correlation-matrix {:name-length #(.length (:name %))
                      :age         :age
                      :num-cats    (comp count :cats)})

will, when executed, returns a map like

{[:name-length :age]      {:count 150 :correlation 0.56}
 [:name-length :num-cats] {:count 150 :correlation 0.95}
 ...}

correlation-matrix

(correlation-matrix & args)

Like correlation+count-matrix, but returns just correlations coefficients instead of maps of :correlation and :count.

covariance

(covariance fx fy)(covariance fx fy fold__6851__auto__)

Given two functions of an input (fx input) and (fy input), each of which returns a number, estimates the unbiased covariance of those functions over inputs.

Ignores any inputs where (fx input) or (fy input) are nil. If no inputs have both x and y, returns nil.

covariance-matrix

(covariance-matrix & args)

Given a map of key names to functions that extract values for those keys from an input, computes the covariance for each of the n^2 key pairs, returning a map of name pairs to the their covariance. For example:

(t/covariance-matrix {:name-length #(.length (:name %))
                      :age         :age
                      :num-cats    (comp count :cats)})

digest

(digest digest-generator)(digest digest-generator fold__6851__auto__)

You’ve got a set of numeric inputs and want to know their quantiles distribution, histogram, etc. This fold takes numeric inputs and produces a statistical estimate of their distribution.

digest takes a function that returns a tesser.quantiles/Digest. The fold returns an instance of that digest.

For example, to compute an HDRHistogram over both positive and negative doubles (or longs, rationals, etc):

Compute a digest using e.g.

(def digest (->> (m/digest q/hdr-histogram)
                 (t/tesser [[1 1 1 1 1 1 2 2 2 3 3 4 5]])))
; => #<DoubleHistogram ...>

To specify options for the digest, just use partial or (fn [] …)

(m/digest (partial q/hdr-histogram {:significant-value-digits 4
                                    :highest-to-lowest-value-ratio 1e6}))

DoubleHistogram, like many quantile estimators, only works over positive values. To cover positives and negatives together, use tesser.quantiles/dual:

(m/digest #(q/dual q/hdr-histogram {:significant-value-digits 2}))

Once you’ve computed a digest, you can find a particular quantile using tesser.quantiles/quantile

(q/quantile digest 0)   ; => 1.0
(q/quantile digest 0.5) ; => 1.0
(q/quantile digest 4/5) ; => 2.0009765625
(q/quantile digest 1)   ; => 3.0009765625

The total number of points in the sample:

(q/point-count digest) ; => 5

Minima and maxima:

(q/min digest) ; => 1.0
(q/max digest) ; => 3.0009765625

Or find the distribution of values less than or equal to each point, with resolution given by the internal granularity of the digest:

(q/distribution digest)
; => ([1.0 3] [2.0009765625 1] [3.0009765625 1])

(q/cumulative-distribution digest)
; => ([1.0 3] [2.0009765625 4] [3.0009765625 5])

You don’t have to return the whole digest; any of these derivative operations can be merged directly into the fold via tesser.core/post-combine.

(->> (m/digest q/hdr-histogram)
     (t/post-combine #(q/quantile % 1/2))
     (t/tesser [[1 2 2 3 3 3 3 3 3 3 3]]))
; => 3.0009765625

You may also use tesser.cardinality/hll for estimating the cardinality of a set. HLL+ uses a probabilistic data-structure to compute set cardinality using very little memory with accuracy tradeoffs.

The HLL digest can be used like the above mentioned histograms:

(def digest (->> (m/digest cardinality/hll)
                 (t/tesser [[1 1 1 1 1 1 2 2 2 3 3 4 5]])))
; => #<HyperLogLogPlus...>

Getting the cardinality out through a post-combine step:

(->> (m/digest cardinality/hll)
     (t/post-combine #(q/point-count %))
     (t/tesser [[1 2 2 3 3 3 3 3 3 3 3]]))
; => 3

I want to emphasize that depending on the size of your data, its distribution, and the number of digests you want to compute, you may need different digest algorithms and widely varying tuning parameters. Until we have a better grasp of the space/error tradeoffs here, I won’t choose defaults for you.

fuse-matrix

(fuse-matrix fold keymap & [downstream])

Given:

  1. A function like covariance that takes two functions of an input and yields a fold, and
  2. A map of key names to functions that extract values for those keys from an input,

pairwise-matrix computes that fold over each pair of keys, returning a map of name pairs to the result of that pairwise fold over the inputs. You can think of this like an N^2 version of fuse.

mean

(mean)(mean fold__6851__auto__)

Finds the arithmetic mean of numeric inputs.

standard-deviation

(standard-deviation & [f])

Estimates the standard deviation of numeric inputs.

sum

(sum)(sum fold__6851__auto__)

Finds the sum of numeric elements.

variance

(variance)(variance fold__6851__auto__)

Unbiased variance estimation. Given numeric inputs, returns their variance.