Returning to the data set of the guys and the gals, you will, if you have had any amount of statistical education (and if you haven't, go to the Information Theory Notes to acquire some), have immediately thought that the cluster of men looked very like what would be described by a bivariate normal or gaussian distribution, and that the cluster of women looked very like another. In elementary books introducing the one dimensional normal distribution, it is quite common to picture the distribution by getting people to stand with their backs to a wall, with people of the same height standing in front of each other. Then the curve passing through the people furthest from the wall is the familiar bell shaped one of Fig.1.8., with its largest value at the average height of the sample.
The function family for the one dimensional (univariate)
gaussian distribution has two
parameters, the centre,
and the standard
deviation,
.
Once these are assigned values, then the function
is specified
(so long as
is positive!) and of course
we all know well
the expression

The distribution of heights of a sample of men
may be modelled approximately
by the gaussian function in dimension 1 for
suitably chosen values of
. The modelling process means that
if you want an estimate of
the proportion of the sample between, say, 170
and 190 cm. tall, it
can be found by integrating the function between
those values.
The gaussian
takes only positive
values, and the
integral from
to
is 1,
so we are simply measuring
the area under the curve between two vertical
lines, one at 170 and the
other at 190. It also follows that there is
some fraction of the
sample having heights between -50 and -12
cm. This
should convince you of the risk of using models
without due thought.
In low dimensions, the thought is easy, in higher
dimensions it may not be.
To the philosopher, using a model
known to be `wrong' is a kind of sin, but in statistics
and probability modelling,
we do not have the luxury of being given models
which are `true', except possibly
in very simple cases.
To visualise the data of men's heights and weights as modelled by a gaussian function in two dimensions, we need to imagine a `gaussian hill' sitting over the data, as sketched rather amateurishly in Fig.1 .9. Don't shoot the author, he's doing his best.
This time the gaussian function is of two variables,
say x and y,
and its parameters now are more complicated. The
centre,
is now a
point in the space
, while the
has become changed rather more radically. Casting
your mind back to your elementary
linear
algebra
education, you will recall that quadratic
functions of two variables may be
conveniently represented by symmetric matrices,
for example the function
![]()


and in general for quadratic functions of two variables we can write

for the function usually written ax2 + 2bxy + cy2. Multiplying out the matrices gives the correct result.
Since in one dimension the gaussian function
exponentiates a quadratic form, it
is no surprise that it does the same in two or
more dimensions.
The n-dimensional gaussian family is parametrised
by a centre,
which is a point in
and
which is an n by n
invertible positive definite symmetric matrix
representing the
quadratic map which takes
to
. The symbol
denotes the
transpose of the column matrix to a row matrix.
The formula for a gaussian function is
therefore
![\begin{displaymath}
g_{[{\bf m},{\bf V}]} ({\bf x}) =
\frac{1}{(\sqrt{2\pi})^n\...
...\frac{({\bf x}-{\bf m})^T {\bf V}^{-1} ({\bf x}-{\bf m}) }{2}} \end{displaymath}](img49.gif)
and we shall refer to
as the centre
of the gaussian and
as its
covariance matrix. The normal or gaussian
function with centre
and covariance
matrix
is often written N(
)
for short. All this may be found
explained and justified, to some extent, in the
undergraduate textbooks on statistics.
See Feller, An Introduction to Probability
Theory and Applications volume 2,
John Wiley 1971, for a rather old fashioned treatment. Go to
Information Theory
for a more modern explanation.
The parameters
and
when given actual numerical values
determine just one gaussian hill, but we have the
problem of working out which of the numerical values
to select. The parameter space of possible values
allows an infinite
family of possible gaussian hills. If we believe
that there is some suitable
choice of
and
which will give,
of all possible choices, the best fit
gaussian hill to the data of Fig.1.9.,
then we can rely on the statisticians
to have found a way of calculating it from the
data. We shall go into this matter
in more depth later, but indeed the statisticians
have been diligent
and algorithms exist for computing a suitable
and
.
These will, in effect, give a function the graph
of which is
a gaussian hill sitting over the points. And the
same algorithms applied to the
female data points of Fig.1.2. will give
a second gaussian hill
sitting over the female points. The two
hills will intersect in some curve, but we shall
imagine each of them
sitting in place over their
respective data points- and also over each others.
Let us call them
gm and gf for the male and female gaussian
functions respectively.
If a new data point
is provided, we can
calculate the height of
the two hills at that point,
and
respectively.
It is intuitively appealing to argue that if the
male hill
is higher than the female hill at the new point,
then it is more likely
that the new point is male than female. Indeed,
we can say how much more
likely by looking at the ratio of the two numbers,
the so called likelihood ratio

Moreover, we can fairly easily tell if a point
is a long way from any data we
have seen before because both the likelihoods
and
will
be small. What `small' means is going to depend
on the dimension, but not on the data.
It is somewhat easier to visualise this in the one dimensional case: Fig.1.10. shows a new point, and the two gaussian functions sitting over it; the argument that says it is more likely to belong to the function giving the greater height may be quantified and made more respectable, but is intuitively appealing. The (relatively) respectable version of this is called Bayesian Decision Theory, and will be described properly later.
The advantage of the parametric statistical approach is that we have an explicit (statistical) model of a process by which the data was generated. In this case, we imagine that the data points were generated by a process which can keep producing new points. In Fig.1.10. one can imagine that two darts players of different degrees of inebriation are throwing darts at a line. One is aiming at the centre a and the other, somewhat drunker, at the centre b. The two distributions tell you something about the way the players are likely to place the darts; then we ask, for the new point, x, what is the probability that it was thrown by each of the two players? If the b curve is twice the height of the a curve over x, then if all other things were equal, we should be inclined to think it twice as likely that it was aimed by the b player than the a.
We do not usually believe in the existence of inebriated darts players as the source of the data, but we do suppose that the data is generated in much the same way; there is an ideal centre which is, so to speak, aimed at, and in various directions, different amounts of scatter can be expected. In the case of height and weight, we imagine that when mother nature, god, allah or the blind forces of evolution designed human beings, there is some height and weight and shape for each sex which is most likely to occur, and lots of factors of a genetic and environmental sort which militate in one direction or another for a particular individual. Seeing mother nature as throwing a drunken dart instead of casting some genetic dice is, after all, merely a more geometric metaphor.
Whenever we make a stab at guessing which is the more likely source of a given data point coding an object, or alternatively making a decision as to which category an object belongs, we have some kind of tacit model of the production process or at least some of its properties. In the metric method, we postulate that the metric on the space is a measure of similarity of the objects, in the neural net method we postulate that at least some sort of convexity property holds for the generating process.
Note that in the case of the statistical model, something like the relevant metric to use is generated automatically, so the problem of Fig.1.4. is solved by the calculation of the two gaussians (and the X-axis gets shrunk, in effect). The rationale is rather dependent on the choice of gaussians to model the data. In the case discussed, of heights and weights of human beings, it looks fairly plausible, up to a point, but it may be rather difficult to tell if it is reasonable in higher dimensions. Also, it is not altogether clear what to do when the data does not look as if a gaussian model is appropriate. Parametric models have been used, subject to these reservations, for some centuries, and undoubtedly have their uses. There are techniques in existence for coping with the problems of non-gaussian distributions of data, and some will be discussed later. The (Bayesian) use of the likelihood ratio to select the best bet has its own rationale, which can extend to the case where we have some prior expectations about which category is most likely. Again, we shall return to this in more detail later.