Day 26: Statistics

JuliaStats is a meta-project which consolidates various packages related to statistics and machine learning in Julia. Well worth taking a look if you plan on working in this domain.

x = rand(10);
mean(x)

0.5287191472784906

```julia
std(x)
0.2885446536178459

Julia already has some builtin support for statistical operations, so additional packages are not strictly necessary. However they do increase the scope and ease of possible operations (as we’ll see below).Julia already has some builtin support for statistical operations. Let’s kick off by loading all the packages that we’ll be looking at today.

using StatsBase, StatsFuns, StreamStats

StatsBase

The documentation for StatsBase can be found here. As the package name implies, it provides support for basic statistical operations in Julia.

High level summary statistics are generated by summarystats().

summarystats(x)
Summary Stats:
Mean: 0.528719
Minimum: 0.064803
1st Quartile: 0.317819
Median: 0.529662
3rd Quartile: 0.649787
Maximum: 0.974760

Weighted versions of the mean, variance and standard deviation are implemented. There’re also geometric and harmonic means.

w = WeightVec(rand(1:10, 10)); 				# A weight vector.
mean(x, w) 									# Weighted mean.
0.48819933297961043
var(x, w) 									# Weighted variance.
0.08303843715334995
std(x, w) 									# Weighted standard deviation.
0.2881639067498738
skewness(x, w)
0.11688162715805048
kurtosis(x, w)
-0.9210456851144664
mean_and_std(x, w)
(0.48819933297961043,0.2881639067498738)

There’s a weighted median as well as functions for calculating quantiles.

median(x) # Median.
0.5296622773635412
median(x, w) 								# Weighted median.
0.5729104703595038
quantile(x)
5-element Array{Float64,1}:
 0.0648032
 0.317819
 0.529662
 0.649787
 0.97476
nquantile(x, 8)
9-element Array{Float64,1}:
 0.0648032
 0.256172
 0.317819
 0.465001
 0.529662
 0.60472
 0.649787
 0.893513
 0.97476
iqr(x) # Inter-quartile range.
0.3319677541313941

Sampling from a population is also catered for, with a range of algorithms which can be applied to the sampling procedure.

sample(['a':'z'], 5) 						# Sampling (with replacement).
5-element Array{Char,1}:
 'w'
 'x'
 'e'
 'e'
 'o'
wsample(['T', 'F'], [5, 1], 10) 				# Weighted sampling (with replacement).
10-element Array{Char,1}:
 'F'
 'T'
 'T'
 'T'
 'F'
 'T'
 'T'
 'T'
 'T'
 'T'

There’s also functionality for empirical estimation of distributions from histograms and a range of other interesting and useful goodies.

StatsFuns

The StatsFuns package provides constants and functions for statistical computing. The constants are by no means essential but certainly very handy. Take, for example, twoπ and sqrt2.

There are some mildly exotic mathematical functions available like logistic, logit and softmax.

logistic(-5)
0.0066928509242848554
logistic(5)
0.9933071490757153
logit(0.25)
-1.0986122886681098
logit(0.75)
1.0986122886681096
softmax([1, 3, 2, 5, 3])
5-element Array{Float64,1}:
 0.0136809
 0.101089
 0.0371886
 0.746952
 0.101089

Finally there is a suite of functions relating to various statistical distributions. The functions for the Normal distribution are illustrated below, but there’re functions for Beta and Binomial distribution, the Gamma and Hypergeometric distribution and many others. The function naming convention is consistent across all distributions.

normpdf(0); 									# PDF
normlogpdf(0); 								# log PDF
normcdf(0); 									# CDF
normccdf(0); 								# Complementary CDF
normlogcdf(0); 								# log CDF
normlogccdf(0); 								# log Complementary CDF
norminvcdf(0.5); 							# inverse-CDF
norminvccdf(0.99); 							# inverse-Complementary CDF
norminvlogcdf(-0.693147180559945); 			# inverse-log CDF
norminvlogccdf(-0.693147180559945); 			# inverse-log Complementary CDF

StreamStats

Finally, the StreamStats package supports calculating online statistics for a stream of data which is being continuously updated.

average = StreamStats.Mean()
Online Mean
 * Mean: 0.000000
 * N:    0
variance = StreamStats.Var()
Online Variance
 * Variance: NaN
 * N:        0
for x in rand(10)
			update!(average, x)
			update!(variance, x)
			@printf("x = %3.f: mean = %.3f | variance = %.3f\n", x, state(average), state(variance))
		end
x = 0.928564: mean = 0.929 | variance = NaN
x = 0.087779: mean = 0.508 | variance = 0.353
x = 0.253300: mean = 0.423 | variance = 0.198
x = 0.778306: mean = 0.512 | variance = 0.164
x = 0.566764: mean = 0.523 | variance = 0.123
x = 0.812629: mean = 0.571 | variance = 0.113
x = 0.760074: mean = 0.598 | variance = 0.099
x = 0.328495: mean = 0.564 | variance = 0.094
x = 0.303542: mean = 0.535 | variance = 0.090
x = 0.492716: mean = 0.531 | variance = 0.080

In addition to the mean and variance illustrated above, the package also supports online versions of min() and max(), and can be used to generate incremental confidence intervals for Bernoulli and Poisson processes.

That’s it for today. Check out the full code on github and watch the video below.