Suppose we have two models for some set of data; we may think of the data as coming in sequentially for our purposes, in which case we can imagine the two models, q1 and q2, as predictors. The amount of information, which I called the surprise, which the data supplies to each model, is simply the sum over the data set of the logarithm of the improbability of each datum, as assessed by each model. Then all other things being equal, the less surprise or information supplied by the data to the model, the better the model. In the extreme case where one model is quite certain of its predictions and always correct, the data supplies no information at all. In the other extreme case the information supplied on each occasion is the logarithm of the number of possible states the datum point can occupy.
We suppose that there is some genuine random process
specified by a
operating to produce
the data, and that the model we have is represented
by q. The sum
gives the idealised value of the information
supplied to the model q by data generated in
fact by process p. This is larger than
the idealised
value of the information supplied to the model
by data generated from the model, the entropy
of p, except in the case when p = q. The difference,

known as the Kullback-Leibler distance, is therefore always non-negative, and is used as a natural measure of the goodness of fit of one pdf to another. It is not symmetric. In our case we are obliged to estimate it since we do not know p, and we can estimate it only up to the unknown entropy of the actual distribution p. At least this allows us to compare two pdfs q1 and q2 on a given data set to see which is better.
It is fairly easy to see that if the two models q1 and q2 were predicting repeatedly from a finite list of events, and if they were to gamble with each other by making bets based on their separate predictions, using their probabilities to establish the odds for the bet and the stake size in any one of a number of sensible ways, then the better predictor, in the information sense, will make money from the worse one. The result in practical terms is simply to maximise the log-likelihood of the data for each model. If instead of two models we have a manifold of models, we simply maximise the log-likelihood over the manifold.
If other things are not equal, we may choose to penalise a model which has more parameters than another by some rate of exchange between the mean information supplied to the model by the data per data point, and the number of parameters in the model. The essential reason for this is our desire to use the model to predict what is going to happen next, and our confidence in the ability of our model to extract information from the data set maximally, while not being unduly affected by noise, is at issue. Occams Razor is the doctrine `Entities are not to be multiplied unnecessarily', or more popularly: `Given two theories, both of which account for the data, choose the simpler'. Many centuries after William of Occam, we may be in a position to say it in algebra with some precision instead of in Latin.
Why should we penalise a model with more parameters? One's intuitions, if one has messed around with models and data to any serious extent, should tell one that estimating too many parameters from too little data is unlikely to give anything very effective at predicting what is going to happen next. It seems reasonable that this state of affairs should be formulated as a penalty for number of parameters, the only question that remains to be solved is the rate of exchange between log likelihood (which will almost always improve for more parameters) and the number of parameters themselves.
The ideal rate of exchange would be determined
by testing the model on later data to see how
well it
performs. After all, that is what we want models
for
. The measure of better performance
ought to be
the same as the measure above; the amount of information
supplied to the model by later data, generated
by the same process that gave the original data,
will be less for the better model. What grounds
have we for believing that the model with fewer parameters
will do rather better than expected, while
the model with more parameters will do rather
worse? The model with too many parameters will
be trying to model the randomness of any particular data
set as if it were significant, when it isn't.
It has
actually extracted too much information from the
data, finding pattern and structure when there
isn't any. This is easily done; brains do it when they
see butterflies or body parts in Rorschach blots,
faces in flames and so on. This line of thought can
lead to the Akaike Information Criterion.
Here, we ask the question, if the data is a sample, D,
generated by a pdf p and if we model it by
a pdf
q, what is the expected log-likelihood of the
data relative to q? It turns out that in many
cases
the maximum log-likelihood choice of q is a
biased estimate of the mean expected log-likelihood
of
p itself, and subtracting off the number of
parameters removes the bias. See Akaike and also
Sakamoto,
et al. for an elucidation of these somewhat
technical points. See also Shibata.
Alternatively, we can argue that since we know nothing of the model class which actually generated the data, we should abandon this attempt and go to more robust considerations. We can argue that compression is pertinent.
A pdf over
can be regarded as a method
of compressing a set of points in
.
In doing this we simply think of the pdf as
telling us how much to stretch bits of the space
or alternatively to re-measure it so that we can
code points efficiently, just as with Shannon
coding.
The amount of compression is greatest when the
pdf is a good model of the distribution of
the set.
We can turn this on its head and define a `good
model' as one which gives good compression.
Certainly for two different models of the same
family, the better model is the one with greater
log likelihood. The `same family' of models means
that there is some finite set of parameters
which specify the model within the family,
which is to say the `same family' of models comprise
a manifold. So far, then, this is merely a new
way of writing Maximum Likelihood models. But
in the case where the class is larger than this
and comprises, say, several manifolds with different
dimension, we can also code the model parameters,
and store the extra overhead. This allows
us to compute the total length of the data as
coded
using a model, plus the overhead of specifying
the model. It is not particularly difficult to
compute, for such things as gaussian mixture models
with different numbers of gaussians what the
net saving is. There is an advantage in having
a better model, offset by the drawback of having
to count extra parameters. Penalising the number
of parameters is an obvious thing to do, but
Rissanen offers a coherent explanation of precisely
how to do this. The Rissanen approach, under
certain circumstances, agrees with the Akaike
approach asymptotically. Some might find this
comforting, others mysterious.
Other work has been done from a Bayesian point
of view on attempting to put in an Occam's Razor,
that is to say a device for penalising a large
number of parameters or entities. The idea here
is that we can penalise a model family of higher
complexity, that is to say with more parameters,
by observing that the normalising constant
will be bigger when the integration
is over a higher dimensional manifold of models
m thus reducing p(m|x) in comparison with
some alternative and `smaller' such manifold. The fact
that the integral depends on the parametrisation
of the manifold of models is grounds for some degree
of nervousness among the innocent, but one can
be persuaded this is not fatal. The result is that
we favour a model from a class when the particular
model accounts for the data well, when we have
no strong prior prejudice against the model,
and when on average the other members of the model class do
fairly badly, possibly because there are so many
of them. This has a certain charm. See MacKay's Doctoral
Thesis in the bibliography for a well written
propaganda job and some good references.
In the case of gaussian mixture modelling, the method implicitly recommended in the next chapter for dealing with pattern recognition problems if we opt for statistical methods, the choice of how many gaussians to use can be solved using the Akaike Information Criterion (AIC), the MDL or stochastic complexity criterion of Rissanen, or others, notably the BIC. Experiments, conducted on data sets actually generated by a known number of gaussians, suggest that the Rissannen criterion works tolerably well over a range of dimensions and numbers of gaussians, and performs better than the AIC. A rough guess of a sensible number may often be made by eyeballing the data under projection provided the dimension is not too high, and more sophisticated methods of automating the eyeballing to some extent will occur to the reasonably ingenious. Then some experiments with different numbers of gaussians and some arithmetic along the lines indicated above can yield a sensible number of gaussians to use in the model.
I shall return to the matter of compression and optimal modelling when we come to Neural Net models, which are notorious for having enormous dimensions for representing the data and consequently large numbers of parameters.