Covariance matrices

Square matrices capturing covariance among random vectors or data.

Covariance matrices captures covariance among variables in data sets and they are among the most basic tools used in data analysis.

Stats preparations: sample mean

In our context, a sample is simply a list of numbers. In statistics, "samples" are samples of data collected from "random variables".

E.g., the list of numbers \[ 8, 9, 11, 13, 9 \] may represent samples of the ages of trees in a forest (which may have more trees) . We may write these numbers as a column (or row) vector.

The sample mean in this example is \[ \frac{1}{5} ( 8 + 9 + 11 + 13 + 9 ) = 10 \]

In general, the sample mean of a list of numbers $x_1,\ldots,x_n$ is \[ \frac{1}{n} (x_1 + \cdots + x_n) = \frac{1}{n} \sum_{i=1}^n x_n. \]

Very often, we use the notation $\overline{x}$ for the sample mean of $x_1,\ldots,x_n$.

Stats preparations: sample variance

In statistics and data science, variance measures how far a set of values is spread out from their mean value.

8
9
11
13
9
Mean value 10, small variance.
1
2
0
20
27
Mean value 10, larger variance.

The (unbiased) sample variance of $x_1,\dots,x_n$ is \[ \frac{1}{n-1} \sum_{i=1}^n (x_i - \overline{x})^2. \]

The divisor $(n-1)$ may look very strange. It is here to correct for the bias introduced by the estimator of the population variance. (Long story!)

Covariance

In statistics, covariance is a generalization of variance to a pair of random variables. It measures their joint variability.

Here, we focus on the "sample" covariance, i.e., the covariance we can compute from a finite number of samples.

Suppose, for a pair of variables $X$ and $Y$, we have $n$ samples \[ (x_1,y_1), \ldots, (x_n,y_n), \] the (unbiased) covariance is \[ \frac{1}{n-1} \sum_{i=1}^n (x_i - \overline{x}) (y_i - \overline{y}) \] where $\overline{x}$ and $\overline{y}$ are the sample means of $x$'s and $y$'s.

Covariance matrix (column samples)

Suppose $X$ is a matrix that capture some sample data, in which rows represent different variables, and columns represent different samples.

The sample covariance matrix is \[ Q_{XX} = \frac{1}{n-1} (X - \overline{X}) (X-\overline{X})^\top \] where $n$ is the number of columns in $X$, and $\overline{X}$ is a matrix of the same size as $X$ that contains the mean row values of $X$.

Covariance matrix (row samples)

Similarly, suppose $X$ is a matrix that capture some sample data, in which columns represent different variables, and rows represent different samples. (The opposite arrangement)

The sample covariance matrix is \[ Q_{XX} = \frac{1}{n-1} (X - \overline{X})^\top (X-\overline{X}) \] where $n$ is the number of rows in $X$, and $\overline{X}$ is a matrix of the same size as $X$ that contains the mean column values of $X$.

Red wine quality data

Fixed acidity Volatile acidity Citric acid Residual sugar Chlorides
7.4 0.7 0.0 1.9 0.076
7.8 0.88 0.0 2.6 0.098
7.8 0.76 0.04 2.3 0.092
11.2 0.28 0.56 1.9 0.075
7.4 0.7 0.0 1.9 0.076
7.4 0.66 0.0 1.8 0.075
7.9 0.6 0.06 1.6 0.069
7.3 0.65 0.0 1.2 0.065

This is a very large data set from UCI machine learning repository, we are only showing 5 variables and a few rows as examples, and consequently covariance calculation is rather meaningless.