Using sums and counts for statistics
The definitions of the arithmetic mean have an appealingly trivial definition based on sum() and len(), which is as follows:
def mean(items): return sum(items)/len(items)
While elegant, this doesn't actually work for iterables. This definition only works for collections that support the len() function. This is easy to discover when trying to write proper type annotations. The definition of mean(items: Iterable)->float won't work because Iterable types don't support len().
Indeed, we have a hard time performing a simple computation of mean or standard deviation based on iterables. In Python, we must either materialize a sequence object or resort to somewhat more complex operations.
The definition needs to look like this:
from collections import Sequence
def mean(items: Sequence) -> float:
return sum(items)/len(items)
This includes the appropriate type hints to assure that sum() and len() will both work.
We have some alternative and elegant expressions for mean and standard deviation in the following definitions:
import math s0 = len(data) # sum(x**0 for x in data) s1 = sum(data) # sum(x**1 for x in data) s2 = sum(x**2 for x in data)
mean = s1/s0 stdev = math.sqrt(s2/s0 - (s1/s0)**2)
These three sums, s0, s1, and s2, have a tidy, parallel structure. We can easily compute the mean from two of the sums. The standard deviation is a bit more complex, but it's based on the three available sums.
This kind of pleasant symmetry also works for more complex statistical functions, such as correlation and even least-squares linear regression.
The moment of correlation between two sets of samples can be computed from their standardized value. The following is a function to compute the standardized value:
def z(x: float, m_x: float, s_x: float) -> float:
return (x-m_x)/s_x
The calculation is simply to subtract the mean, μ_x, from each sample, x, and divide by the standard deviation, σ_x. This gives us a value measured in units of sigma, σ. A value ±1 σ is expected about two-thirds of the time. Larger values should be less common. A value outside ±3 σ should happen less than one percent of the time.
We can use this scalar function as follows:
>>> d = [2, 4, 4, 4, 5, 5, 7, 9] >>> list(z(x, mean(d), stdev(d)) for x in d) [-1.5, -0.5, -0.5, -0.5, 0.0, 0.0, 1.0, 2.0]
We've materialized a list that consists of normalized scores based on some raw data in the variable, d. We used a generator expression to apply the scalar function, z(), to the sequence object.
The mean() and stdev() functions are simply based on the examples shown previously:
def mean(samples: Sequence) -> float:
return s1(samples)/s0(samples)
def stdev(samples: Sequence) -> float:
N= s0(samples)
return sqrt((s2(samples)/N)-(s1(samples)/N)**2)
The three sum functions, similarly, can be defined, instead of the lambda forms, as shown in the following code:
def s0(samples: Sequence) -> float:
return sum(1 for x in samples) # or len(data) def s1(samples: Sequence) -> float:
return sum(x for x in samples) # or sum(data) def s2(samples: Sequence) -> float:
return sum(x*x for x in samples)
While this is very expressive and succinct, it's a little frustrating because we can't simply use an iterable here. When computing a mean, both a sum of the iterable and a count of the iterable are required. For the standard deviation, two sums and a count of the iterable are all required. For this kind of statistical processing, we must materialize a sequence object so that we can examine the data multiple times.
The following code shows how we can compute the correlation between two sets of samples:
def corr(samples1: Sequence, samples2: Sequence) -> float:
m_1, s_1 = mean(samples1), stdev(samples1)
m_2, s_2 = mean(samples2), stdev(samples2)
z_1 = (z( x, m_1, s_1 ) for x in samples1)
z_2 = (z( x, m_2, s_2 ) for x in samples2)
r = (sum(zx1*zx2 for zx1, zx2 in zip(z_1, z_2))
/ len(samples1))
return r
This correlation function gathers basic statistical summaries of the two sets of samples: the mean and standard deviation. Given these summaries, we define two generator functions that will create normalized values for each set of samples. We can then use the zip() function (see the next example) to pair up items from the two sequences of normalized values and compute the product of those two normalized values. The average of the product of the normalized scores is the correlation.
The following code is an example of gathering the correlation between two sets of samples:
>>> # Height (m)
>>> xi= [1.47, 1.50, 1.52, 1.55, 1.57, 1.60, 1.63, 1.65,
... 1.68, 1.70, 1.73, 1.75, 1.78, 1.80, 1.83,] >>> # Mass (kg)
>>> yi= [52.21,53.12,54.48,55.84,57.20,58.57,59.93,61.29,
... 63.11, 64.47, 66.28, 68.10, 69.92, 72.19, 74.46,] >>> round(corr( xi, yi ), 5) 0.99458
We've shown you two sequences of data points, xi and yi. The correlation is over .99, which shows a very strong relationship between the two sequences.
This shows us one of the strengths of functional programming. We've created a handy statistical module using a half-dozen functions with definitions that are single expressions. The counterexample is the corr() function that can be reduced to a single very long expression. Each internal variable in this function is used just once; a local variable can be replaced with a copy and paste of the expression that created it. This shows us that the corr() function has a functional design, even though it's written out in six separate lines of Python.