Simple models which give reasonably close results are more attractive that very complicated models which give exactly the right answer. In fact we may decline to believe a sufficiently right model, and we always feel uncomfortable with very complicated ones. So model complexity is a consideration, and the question of how we measure it arises naturally. This is where Rissanen comes in.
Rissanen has argued roughly as follows: a model
of a data set allows you to compress the
data for purposes of transmission or storage.
If, for example you had a data set consisting
of a
sequence of 100
digits which went 314159.... and agreed with
until the end of the sequence, sending the
information coded as `first hundred digits of
, forget the decimal point' would compress
the
sequence rather considerably. As long as the guy
at the other end knew what
meant. If you
don't feel safe about this, you may feel obliged
to tell him, by, for example, also sending the
algorithm for generating
. This extra, I
shall call the `overhead'. Now one
model is better than another in Rissanen's sense
if it does more compression, taking into account
the overhead. If you reason that
the more you compress the data, the more information
you are extracting from it, this sounds very
plausible. This principle allows us to reject
the maximum likelihood `raw data' model in favour
of a gaussian model when the number of data points
exceeds some quite small number. It also allows
us to make
sensible decisions on how many gaussians one ought
to use in modelling a point set as a mixture
of gaussians, a matter of practical importance when
we get to syntactic pattern recognition. The
ideas are related to Chaitin's attempts to characterise
randomness. See Chaitin (1975) and Chaitin
(1987) for an antidote to some of the cynicism
of the present and an earlier section.
In this section I sketch the recently developed and indeed still developing theory of Rissanen on Stochastic Complexity, as described in his book Rissanen(1989) and papers (see the bibliography). It offers a philosophically quite different approach to finding the `best' model to account for the data, and an approach which avoids some of the objections I have pointed out to the classical theory. In particular it appears to solve elegantly the issue of how many gaussians one ought to have in a gaussian mixture model for a particular data set. (The answer may be `none', if the data set is small enough!) This will concern us in the next chapter, where we meet such beasts.
The general question as to order of a model occurs in other settings also. At the simple level of fitting a polynomial to a set of data points, the question as to what degree of polynomial is appropriate arises. At one extreme one can have the degree so high that every point is fitted, at the other we can assume it is constant. The same problem arises in the case of Neural Nets, as we shall see in a later chapter.
The gaussian mixture modelling problem, how many gaussians should we use, makes the problem quite stark. There will be an increase in the log-likelihood of the data with respect to the model whenever we increase the number of gaussians, until the degenerate case of non-invertible covariance matrices makes it infinite. But there is a penalty to be paid in terms of the number of parameters needed to specify the increasing number of gaussians. It is sensible to try to penalise the model in some way by offsetting the increase in log-likelihood with a corresponding increase in the number of parameters, and seeking to minimise the combination. But then the question arises, what should be the rate of exchange between log-likelihood and the parameter count? Rissanen gives us a natural and plausible rate of exchange. Because the ideas have not yet percolated through to the Pattern Recognition community in its entirety, I shall explain them in elementary terms.
I would like to be able to assume a moderate familiarity with Shannon's Information Theory, which is by now an essential part of the mental furniture of anybody with pretensions to an education, but it seems unsafe to do so. Typically, an engineers mathematical education seems to consist increasingly of having memorised some recipes and then forgotten them, and computer scientists, so called, have left out most mathematics except a little finitary material such as logic and search algorithms. So I shall explain the basics as I proceed. If I insult your intelligence and knowledge then I put my trust in your tolerance.