next up previous contents
Next: Dimension 2 Up: Computing PDFs: Gaussians Previous: Computing PDFs: Gaussians

One Gaussian per cluster

Suppose I have a cluster of points in ${\fam11\tenbbb R}^n$, say the heights and weights of a number of adult males, as in chapter one, Fig. 1.2, looking at only the Xs. In low dimensions such as this we can look at them and say to ourselves `yes, that can be modelled by a gaussian.' Well of course it can. Any set of points can be. Let us remind ourselves of the elementary methods of doing so.

Let

\begin{displaymath}
% latex2html id marker 3591
{\cal X} = \left\{ \left(\begin{...
 ...M \\  x^2_M \\  \vdots 
\\  x^n_M \end{array} \right) \right\} \end{displaymath}

be a set of M points in ${\fam11\tenbbb R}^n$. The centre of the M points (or centroid, or centre of gravity), is the point

\begin{displaymath}
% latex2html id marker 3592
{{\bf m}} = \left(\begin{array}
...
 ... \\  
 \bar{x}^2 \\  \vdots \\  \bar{x}^n \end{array} 
\right) \end{displaymath}

where $\bar{x}^i = \frac{1}{M} \sum_{j = 1}^{j= 
M} x^i_j $, i.e. each component is the mean or average of the M values of that component obtained from the M points.

In vector notation we write simply:

\begin{displaymath}
{\bf m}= \frac{1}{M} \sum_{{\bf x}\in {\cal X}} {\bf x}\end{displaymath}

We can take it that the centroid of a set of points contains some first order approximation to saying something about the set in terms of a simple vector. To say more about the set we proceed to the second order moments. These are the average values of the terms $(x^i- \bar{x}^i) (x^j - \bar{x}^j) $ in the given vectors, and there are n2 of them, so we arrange them in an n by n matrix called the covariance matrix. Formally, the covariance matrix, V will have the entry in row i and column j given by

\begin{displaymath}
{\bf V}_j^i = \frac{1}{(M-1)} \sum_{r=1}^M 
(x_r^i - \bar{x}^i)(x_r^j - \bar{x}^j) \end{displaymath}

In vector notation this may be compactly written as

\begin{displaymath}
{\bf V} = \frac{1}{(M-1)} \sum_{{\bf x}\in {\cal 
X}} ({\bf x}- {\bf m}) ({\bf x}- {\bf m})^T \end{displaymath}

The use of M-1, where the innocent might have expected M, occurs for reasons into which I shall not go; intuitively one can argue that in subtracting off the mean, ${ {\bf m}}$, from every point, we have thrown away one point's worth of information. It is not unlike subtracting off a point from a set of M points in linear algebra in order to see if the original points lie on a line or plane; one looks at the M-1 non-zero points to see if they are linearly dependent. Here we have not subtracted off one of the points, we have subtracted off $(\frac{1}{M})^{th}$ of each of them.

The matrix ${\bf V}$ is of course symmetric by construction, and (although this is not so immediately evident) positive semi-definite. Since it is symmetric, elementary linear algebra assures us that it may be diagonalised by a rotation matrix, i.e. there is an orthogonal matrix with determinant 1, ${\bf Q}$, and a diagonal matrix ${\bf \Lambda}$, so that ${\bf Q} {\bf \Lambda}{\bf Q}^{-1} = 
{\bf
V} $. The diagonal elements of ${\bf \Lambda}$ are called the eigenvalues of ${\bf V}$, the image of the standard basis in ${\fam11\tenbbb R}^n$ by Q is called the eigenbasis for V and the image of the basis vectors is called the set of eigenvectors. Traditionally, any multiple of an eigenvector is also an eigenvector, so when the books refer to an eigenvector they often mean a one dimensional eigenspace, and sometimes they just meant eigenspace.

${\bf Q}$ is unique provided that the eigenvalues are all different, whereupon the eigenspaces are all one dimensional. Each eigenvalue is non-negative (an immediate consequence of the positive semi-definiteness). If the eigenvalues are all positive, then the matrix ${\bf V}$ is non-singular and is positive definite. It is again a trivial consequence of elementary linear algebra (and immediately apparent to the meanest[*] intellect) that in this case ${\bf V^{-1}}$ is diagonalisable by the same matrix ${\bf Q}$ and has diagonal matrix the inverse of ${\bf \Lambda}$, which simply has the reciprocals of the eigenvalues in the corresponding places. Moreover, there is a symmetric square root of ${\bf V}, conveniently called {\bf V}^{\frac{1}{2}}$, which can be diagonalised by the same matrix ${\bf Q}$ and has diagonal terms the square roots of the eigenvalues in the corresponding places. By a square root, I mean simply that $ {\bf V}^{\frac{1}{2}} {\bf
V}^{\frac{1}{2}} = {\bf V} $.

Having obtained from the original data the centre m and this matrix ${\bf V}$, I shall now define the function

\begin{displaymath}
g_{[{\bf m},{\bf V}]}: {\fam11\tenbbb R}^n \longrightarrow {\fam11\tenbbb R}\end{displaymath}

by

\begin{displaymath}
g_{[{\bf m},{\bf V}]} ({\bf x}) =
 \frac{1}{(\sqrt{2\pi})^n\...
 ...\frac{({\bf x}-{\bf m})^T {\bf V}^{-1} ({\bf x}-{\bf m}) }{2}} \end{displaymath}

and assert that this function is going to be my probabilistic model for the process which produced the data set, or alternatively my preferred device for representing the data compactly.

If a rationale for this choice rather than some other is required, and the enquiring mind might well ask for one, I offer several:

First, it is not too hard to prove that of all choices of ${ {\bf m}}$ and ${\bf V}$, this choice gives the maximum likelihood for the original data. So given that I have a preference for gaussians, this particular gaussian would seem to be the best choice.

Second, I can argue that just as the centre ${ {\bf m}}$ contains first order information telling us about the data, so ${\bf V}$ gives us second order information- the central moments of second order are being computed, up to a scalar multiple.

Third, it is easy to compute the vector ${ {\bf m}}$ and the matrix ${\bf V}$.

Fourth, I may rationalise my preference for gaussians via the central limit theorem; this allows me to observe that quite a lot of data is produced by some process which aims at a single value and which is perturbed by `noise' so that a large number of small independent factors influence the final value by each adding some disturbance to the target value. In the case where the number of factors gets larger and larger and the additive influence of each one becomes smaller and smaller, we get a gaussian distribution. Of course, it is hard to see how such an assumption about the data could be verified directly, but many people find this argument comforting.

And finally, if it doesn't look to be doing a good job, it is possible to discover this fact and abandon the model class, which is comforting. In that event, I have other recourses of which you will learn more ere long.

These justifications are not likely to satisfy the committed Platonist philosopher, who will want to be persuaded that this choice is transcendentally right, or at least as close as can be got. But then, keeping Platonist philosophers happy is not my job.

In dimension one we can draw the graph of the function and the data set from which it was obtained by the above process. The covariance matrix ${\bf V}$ reduces to a single positive number, the variance, and its square root is usually called the standard deviation and written $\sigma$.

The points satisfying $({\bf x}-{\bf m})^T {\bf V}^{-1} ({\bf x}-{\bf m}) 
= 1$ are therefore the numbers ${\bf m}\pm \sigma$.

In dimension two, it is possible to draw the graph of the function, but also possible to sketch the sections, the curves of constant height. These are ellipses. In particular, the points ${\bf x}$ satisfying

\begin{displaymath}
({\bf x}-{\bf m})^T {\bf V}^{-1} ({\bf x}-{\bf m}) = 1\end{displaymath}

fall along an ellipse in the plane. A simple computation shows that the integral of the function over the interior of this ellipse has the value $1-e^{- \frac{1}{2}}$ which is about 0.4. Thus almost 40% of the points distributed according to a gaussian lie within one `standard deviation' of the centre in two dimensions. This contrasts with the one dimensional case of course, where it is nearer 67%.

In dimension n, the set

\begin{displaymath}
\{ {\bf x}\in {\fam11\tenbbb R}^n : ({\bf x}-{\bf m})^T {\bf V}^{-1} ({\bf x}-{\bf m}) 
= 1 \}\end{displaymath}

which might reasonably be called the set of points at one standard deviation, is a hyperellipsoid. This follows of course from ${\bf V}$ being positive definite. If any of the eigenvalues are zero, the hyperellipsoid is said to be degenerate. The proportion of the distribution within one standard deviation goes down as the dimension goes up.

It is easy enough for even the least imaginative of computer programmers to visualise a cluster of points sitting in three dimensions and a sort of squashed football sitting around them so as to enclose a reasonable percentage. Much past three, only the brave and the insane dare venture; into which category we fall I leave to you to determine.



 
next up previous contents
Next: Dimension 2 Up: Computing PDFs: Gaussians Previous: Computing PDFs: Gaussians
Mike Alder
9/19/1997