# Day 26: Statistics

JuliaStats is a meta-project which consolidates various packages related to statistics and machine learning in Julia. Well worth taking a look if you plan on working in this domain.

x = rand(10);
mean(x)


0.5287191472784906

julia
std(x)

0.2885446536178459


Julia already has some builtin support for statistical operations, so additional packages are not strictly necessary. However they do increase the scope and ease of possible operations (as we’ll see below).Julia already has some builtin support for statistical operations. Let’s kick off by loading all the packages that we’ll be looking at today.

using StatsBase, StatsFuns, StreamStats


## StatsBase

The documentation for StatsBase can be found here. As the package name implies, it provides support for basic statistical operations in Julia.

High level summary statistics are generated by summarystats().

summarystats(x)

Summary Stats:
Mean: 0.528719
Minimum: 0.064803
1st Quartile: 0.317819
Median: 0.529662
3rd Quartile: 0.649787
Maximum: 0.974760


Weighted versions of the mean, variance and standard deviation are implemented. There’re also geometric and harmonic means.

w = WeightVec(rand(1:10, 10)); 				# A weight vector.
mean(x, w) 									# Weighted mean.

0.48819933297961043

var(x, w) 									# Weighted variance.

0.08303843715334995

std(x, w) 									# Weighted standard deviation.

0.2881639067498738

skewness(x, w)

0.11688162715805048

kurtosis(x, w)

-0.9210456851144664

mean_and_std(x, w)

(0.48819933297961043,0.2881639067498738)


There’s a weighted median as well as functions for calculating quantiles.

median(x) # Median.

0.5296622773635412

median(x, w) 								# Weighted median.

0.5729104703595038

quantile(x)

5-element Array{Float64,1}:
0.0648032
0.317819
0.529662
0.649787
0.97476

nquantile(x, 8)

9-element Array{Float64,1}:
0.0648032
0.256172
0.317819
0.465001
0.529662
0.60472
0.649787
0.893513
0.97476

iqr(x) # Inter-quartile range.

0.3319677541313941


Sampling from a population is also catered for, with a range of algorithms which can be applied to the sampling procedure.

sample(['a':'z'], 5) 						# Sampling (with replacement).

5-element Array{Char,1}:
'w'
'x'
'e'
'e'
'o'

wsample(['T', 'F'], [5, 1], 10) 				# Weighted sampling (with replacement).

10-element Array{Char,1}:
'F'
'T'
'T'
'T'
'F'
'T'
'T'
'T'
'T'
'T'


There’s also functionality for empirical estimation of distributions from histograms and a range of other interesting and useful goodies.

## StatsFuns

The StatsFuns package provides constants and functions for statistical computing. The constants are by no means essential but certainly very handy. Take, for example, twoπ and sqrt2.

There are some mildly exotic mathematical functions available like logistic, logit and softmax.

logistic(-5)

0.0066928509242848554

logistic(5)

0.9933071490757153

logit(0.25)

-1.0986122886681098

logit(0.75)

1.0986122886681096

softmax([1, 3, 2, 5, 3])

5-element Array{Float64,1}:
0.0136809
0.101089
0.0371886
0.746952
0.101089


Finally there is a suite of functions relating to various statistical distributions. The functions for the Normal distribution are illustrated below, but there’re functions for Beta and Binomial distribution, the Gamma and Hypergeometric distribution and many others. The function naming convention is consistent across all distributions.

normpdf(0); 									# PDF
normlogpdf(0); 								# log PDF
normcdf(0); 									# CDF
normccdf(0); 								# Complementary CDF
normlogcdf(0); 								# log CDF
normlogccdf(0); 								# log Complementary CDF
norminvcdf(0.5); 							# inverse-CDF
norminvccdf(0.99); 							# inverse-Complementary CDF
norminvlogcdf(-0.693147180559945); 			# inverse-log CDF
norminvlogccdf(-0.693147180559945); 			# inverse-log Complementary CDF


## StreamStats

Finally, the StreamStats package supports calculating online statistics for a stream of data which is being continuously updated.

average = StreamStats.Mean()

Online Mean
* Mean: 0.000000
* N:    0

variance = StreamStats.Var()

Online Variance
* Variance: NaN
* N:        0

for x in rand(10)
update!(average, x)
update!(variance, x)
@printf("x = %3.f: mean = %.3f | variance = %.3f\n", x, state(average), state(variance))
end
x = 0.928564: mean = 0.929 | variance = NaN
x = 0.087779: mean = 0.508 | variance = 0.353
x = 0.253300: mean = 0.423 | variance = 0.198
x = 0.778306: mean = 0.512 | variance = 0.164
x = 0.566764: mean = 0.523 | variance = 0.123
x = 0.812629: mean = 0.571 | variance = 0.113
x = 0.760074: mean = 0.598 | variance = 0.099
x = 0.328495: mean = 0.564 | variance = 0.094
x = 0.303542: mean = 0.535 | variance = 0.090
x = 0.492716: mean = 0.531 | variance = 0.080
`

In addition to the mean and variance illustrated above, the package also supports online versions of min() and max(), and can be used to generate incremental confidence intervals for Bernoulli and Poisson processes.

That’s it for today. Check out the full code on github and watch the video below.