next up previous contents
Next: Codes: Information theoretic preliminaries Up: Statistical Ideas Previous: Subjective Bayesians

Minimum Description Length Models

So far we have considered the issue of selecting, from some family of models, that model which is `best' by considering which model assigns the data the higher probability, possibly bearing in mind prior prejudices about reasonable results. Another issue in comparing two models is not just the likelihood of the data having arisen from the model, but the model complexity.

Simple models which give reasonably close results are more attractive that very complicated models which give exactly the right answer. In fact we may decline to believe a sufficiently right model, and we always feel uncomfortable with very complicated ones. So model complexity is a consideration, and the question of how we measure it arises naturally. This is where Rissanen comes in.

Rissanen has argued roughly as follows: a model of a data set allows you to compress the data for purposes of transmission or storage. If, for example you had a data set consisting of a sequence of 100 digits which went 314159.... and agreed with $\pi$ until the end of the sequence, sending the information coded as `first hundred digits of $\pi$, forget the decimal point' would compress the sequence rather considerably. As long as the guy at the other end knew what $\pi$ meant. If you don't feel safe about this, you may feel obliged to tell him, by, for example, also sending the algorithm for generating $\pi$. This extra, I shall call the `overhead'. Now one model is better than another in Rissanen's sense if it does more compression, taking into account the overhead. If you reason that the more you compress the data, the more information you are extracting from it, this sounds very plausible. This principle allows us to reject the maximum likelihood `raw data' model in favour of a gaussian model when the number of data points exceeds some quite small number. It also allows us to make sensible decisions on how many gaussians one ought to use in modelling a point set as a mixture of gaussians, a matter of practical importance when we get to syntactic pattern recognition. The ideas are related to Chaitin's attempts to characterise randomness. See Chaitin (1975) and Chaitin (1987) for an antidote to some of the cynicism of the present and an earlier section.

In this section I sketch the recently developed and indeed still developing theory of Rissanen on Stochastic Complexity, as described in his book Rissanen(1989) and papers (see the bibliography). It offers a philosophically quite different approach to finding the `best' model to account for the data, and an approach which avoids some of the objections I have pointed out to the classical theory. In particular it appears to solve elegantly the issue of how many gaussians one ought to have in a gaussian mixture model for a particular data set. (The answer may be `none', if the data set is small enough!) This will concern us in the next chapter, where we meet such beasts.

The general question as to order of a model occurs in other settings also. At the simple level of fitting a polynomial to a set of data points, the question as to what degree of polynomial is appropriate arises. At one extreme one can have the degree so high that every point is fitted, at the other we can assume it is constant. The same problem arises in the case of Neural Nets, as we shall see in a later chapter.

The gaussian mixture modelling problem, how many gaussians should we use, makes the problem quite stark. There will be an increase in the log-likelihood of the data with respect to the model whenever we increase the number of gaussians, until the degenerate case of non-invertible covariance matrices makes it infinite. But there is a penalty to be paid in terms of the number of parameters needed to specify the increasing number of gaussians. It is sensible to try to penalise the model in some way by offsetting the increase in log-likelihood with a corresponding increase in the number of parameters, and seeking to minimise the combination. But then the question arises, what should be the rate of exchange between log-likelihood and the parameter count? Rissanen gives us a natural and plausible rate of exchange. Because the ideas have not yet percolated through to the Pattern Recognition community in its entirety, I shall explain them in elementary terms.

I would like to be able to assume a moderate familiarity with Shannon's Information Theory, which is by now an essential part of the mental furniture of anybody with pretensions to an education, but it seems unsafe to do so. Typically, an engineers mathematical education seems to consist increasingly of having memorised some recipes and then forgotten them, and computer scientists, so called, have left out most mathematics except a little finitary material such as logic and search algorithms. So I shall explain the basics as I proceed. If I insult your intelligence and knowledge then I put my trust in your tolerance.



 
next up previous contents
Next: Codes: Information theoretic preliminaries Up: Statistical Ideas Previous: Subjective Bayesians
Mike Alder
9/19/1997