Square matrices capturing covariance among random vectors or data.
Covariance matrices captures covariance among variables in data sets and they are among the most basic tools used in data analysis.
In our context, a sample is simply a list of numbers. In statistics, "samples" are samples of data collected from "random variables".
E.g., the list of numbers \[ 8, 9, 11, 13, 9 \] may represent samples of the ages of trees in a forest (which may have more trees) . We may write these numbers as a column (or row) vector.
The sample mean in this example is \[ \frac{1}{5} ( 8 + 9 + 11 + 13 + 9 ) = 10 \]
In general, the sample mean of a list of numbers $x_1,\ldots,x_n$ is \[ \frac{1}{n} (x_1 + \cdots + x_n) = \frac{1}{n} \sum_{i=1}^n x_n. \]
Very often, we use the notation $\overline{x}$ for the sample mean of $x_1,\ldots,x_n$.
In statistics and data science, variance measures how far a set of values is spread out from their mean value.
8 |
9 |
11 |
13 |
9 |
1 |
2 |
0 |
20 |
27 |
The (unbiased) sample variance of $x_1,\dots,x_n$ is \[ \frac{1}{n-1} \sum_{i=1}^n (x_i - \overline{x})^2. \]
The divisor $(n-1)$ may look very strange. It is here to correct for the bias introduced by the estimator of the population variance. (Long story!)
In statistics, covariance is a generalization of variance to a pair of random variables. It measures their joint variability.
Here, we focus on the "sample" covariance, i.e., the covariance we can compute from a finite number of samples.
Suppose, for a pair of variables $X$ and $Y$, we have $n$ samples \[ (x_1,y_1), \ldots, (x_n,y_n), \] the (unbiased) covariance is \[ \frac{1}{n-1} \sum_{i=1}^n (x_i - \overline{x}) (y_i - \overline{y}) \] where $\overline{x}$ and $\overline{y}$ are the sample means of $x$'s and $y$'s.
Suppose $X$ is a matrix that capture some sample data, in which rows represent different variables, and columns represent different samples.
Similarly, suppose $X$ is a matrix that capture some sample data, in which columns represent different variables, and rows represent different samples. (The opposite arrangement)
Fixed acidity | Volatile acidity | Citric acid | Residual sugar | Chlorides |
---|---|---|---|---|
7.4 | 0.7 | 0.0 | 1.9 | 0.076 |
7.8 | 0.88 | 0.0 | 2.6 | 0.098 |
7.8 | 0.76 | 0.04 | 2.3 | 0.092 |
11.2 | 0.28 | 0.56 | 1.9 | 0.075 |
7.4 | 0.7 | 0.0 | 1.9 | 0.076 |
7.4 | 0.66 | 0.0 | 1.8 | 0.075 |
7.9 | 0.6 | 0.06 | 1.6 | 0.069 |
7.3 | 0.65 | 0.0 | 1.2 | 0.065 |
This is a very large data set from UCI machine learning repository, we are only showing 5 variables and a few rows as examples, and consequently covariance calculation is rather meaningless.