Suppose I have a cluster of points in
,
say the heights and weights of a number of adult
males, as in chapter one, Fig. 1.2, looking
at only the Xs. In low dimensions such as this
we can look at them and say to ourselves `yes,
that can be modelled by a gaussian.' Well
of course it can. Any set of points can
be. Let us remind ourselves of the elementary
methods
of doing so.
Let

be a set of M points in
. The centre
of the M points (or centroid, or
centre of gravity), is the point

where
, i.e. each component is the mean or
average of the M values of that component obtained
from the M points.
In vector notation we write simply:


In vector notation this may be compactly written as

The matrix
is of course symmetric by
construction, and (although this is not so immediately
evident) positive semi-definite. Since it is symmetric,
elementary linear algebra assures us that it
may be diagonalised by a rotation matrix, i.e.
there is an orthogonal matrix with determinant
1,
, and a diagonal matrix
,
so that
. The diagonal elements of
are called the eigenvalues of
,
the image of
the standard basis in
by Q is called
the eigenbasis for V and the image
of the basis vectors is called the set of
eigenvectors. Traditionally, any multiple of
an eigenvector is also an eigenvector, so when
the books refer to an eigenvector they often
mean a one dimensional eigenspace, and sometimes they
just meant eigenspace.
is unique provided that the eigenvalues
are all different, whereupon the eigenspaces
are all
one dimensional. Each eigenvalue is non-negative
(an immediate consequence of the
positive semi-definiteness). If the eigenvalues
are all positive, then the matrix
is
non-singular and is positive definite. It is
again a trivial consequence of elementary linear
algebra (and immediately apparent to the meanest
intellect) that in this case
is
diagonalisable by the same matrix
and
has diagonal matrix the inverse of
, which simply has the reciprocals of
the eigenvalues in the corresponding places.
Moreover, there is a symmetric square root of
, which can be
diagonalised by the same matrix
and
has diagonal terms the square roots of the eigenvalues
in the corresponding places. By a square root, I mean
simply that
.
Having obtained from the original data the centre
m and this matrix
, I shall now
define the function
![]()
![\begin{displaymath}
g_{[{\bf m},{\bf V}]} ({\bf x}) =
\frac{1}{(\sqrt{2\pi})^n\...
...\frac{({\bf x}-{\bf m})^T {\bf V}^{-1} ({\bf x}-{\bf m}) }{2}} \end{displaymath}](img49.gif)
and assert that this function is going to be my probabilistic model for the process which produced the data set, or alternatively my preferred device for representing the data compactly.
If a rationale for this choice rather than some other is required, and the enquiring mind might well ask for one, I offer several:
First, it is not too hard to prove that of all
choices of
and
, this choice gives the
maximum likelihood for the original data. So given
that I have a preference for gaussians, this
particular gaussian would seem to be the best
choice.
Second, I can argue that just as the centre
contains first order information telling us about
the data, so
gives us second order information-
the central moments of second order are being
computed, up to a scalar multiple.
Third, it is easy to compute the vector
and
the matrix
.
Fourth, I may rationalise my preference for gaussians via the central limit theorem; this allows me to observe that quite a lot of data is produced by some process which aims at a single value and which is perturbed by `noise' so that a large number of small independent factors influence the final value by each adding some disturbance to the target value. In the case where the number of factors gets larger and larger and the additive influence of each one becomes smaller and smaller, we get a gaussian distribution. Of course, it is hard to see how such an assumption about the data could be verified directly, but many people find this argument comforting.
And finally, if it doesn't look to be doing a good job, it is possible to discover this fact and abandon the model class, which is comforting. In that event, I have other recourses of which you will learn more ere long.
These justifications are not likely to satisfy the committed Platonist philosopher, who will want to be persuaded that this choice is transcendentally right, or at least as close as can be got. But then, keeping Platonist philosophers happy is not my job.
In dimension one we can draw the graph of the
function and the data set from which it was obtained
by the above process. The covariance matrix
reduces to a single positive number, the
variance, and its square root is usually called
the standard deviation and written
.
The points satisfying
are therefore the numbers
.
In dimension two, it is possible to draw the graph
of the function, but also possible to sketch
the sections, the curves of constant height. These
are ellipses. In particular, the points
satisfying
![]()
In dimension n, the set
![]()
It is easy enough for even the least imaginative of computer programmers to visualise a cluster of points sitting in three dimensions and a sort of squashed football sitting around them so as to enclose a reasonable percentage. Much past three, only the brave and the insane dare venture; into which category we fall I leave to you to determine.