next up previous contents
Next: Summary of the chapter Up: Minimum Description Length Models Previous: Compression for pdfs

Summary of Rissanen Complexity

Suppose we have two models for some set of data; we may think of the data as coming in sequentially for our purposes, in which case we can imagine the two models, q1 and q2, as predictors. The amount of information, which I called the surprise, which the data supplies to each model, is simply the sum over the data set of the logarithm of the improbability of each datum, as assessed by each model. Then all other things being equal, the less surprise or information supplied by the data to the model, the better the model. In the extreme case where one model is quite certain of its predictions and always correct, the data supplies no information at all. In the other extreme case the information supplied on each occasion is the logarithm of the number of possible states the datum point can occupy.

We suppose that there is some genuine random process specified by a $pdf,\ p$ operating to produce the data, and that the model we have is represented by q. The sum $\sum_{{\bf x}\in D} p({\bf x}) \log_2({1 
\over q({\bf x})})$ gives the idealised value of the information supplied to the model q by data generated in fact by process p. This is larger than $\sum_{{\bf x}
\in D} p(x) \log_2({1 \over {p(x)}})$ the idealised value of the information supplied to the model by data generated from the model, the entropy of p, except in the case when p = q. The difference,

\begin{displaymath}
\sum_{{\bf x}\in D} p({\bf x}) \log_2({1 \over {q({\bf x})}}...
 ...D} p({\bf x}) \log_2\left( \frac{p({\bf x})}{q({\bf x})}\right)\end{displaymath}

known as the Kullback-Leibler distance, is therefore always non-negative, and is used as a natural measure of the goodness of fit of one pdf to another. It is not symmetric. In our case we are obliged to estimate it since we do not know p, and we can estimate it only up to the unknown entropy of the actual distribution p. At least this allows us to compare two pdfs q1 and q2 on a given data set to see which is better.

It is fairly easy to see that if the two models q1 and q2 were predicting repeatedly from a finite list of events, and if they were to gamble with each other by making bets based on their separate predictions, using their probabilities to establish the odds for the bet and the stake size in any one of a number of sensible ways, then the better predictor, in the information sense, will make money from the worse one. The result in practical terms is simply to maximise the log-likelihood of the data for each model. If instead of two models we have a manifold of models, we simply maximise the log-likelihood over the manifold.

If other things are not equal, we may choose to penalise a model which has more parameters than another by some rate of exchange between the mean information supplied to the model by the data per data point, and the number of parameters in the model. The essential reason for this is our desire to use the model to predict what is going to happen next, and our confidence in the ability of our model to extract information from the data set maximally, while not being unduly affected by noise, is at issue. Occams Razor is the doctrine `Entities are not to be multiplied unnecessarily', or more popularly: `Given two theories, both of which account for the data, choose the simpler'. Many centuries after William of Occam, we may be in a position to say it in algebra with some precision instead of in Latin.

Why should we penalise a model with more parameters? One's intuitions, if one has messed around with models and data to any serious extent, should tell one that estimating too many parameters from too little data is unlikely to give anything very effective at predicting what is going to happen next. It seems reasonable that this state of affairs should be formulated as a penalty for number of parameters, the only question that remains to be solved is the rate of exchange between log likelihood (which will almost always improve for more parameters) and the number of parameters themselves.

The ideal rate of exchange would be determined by testing the model on later data to see how well it performs. After all, that is what we want models for[*]. The measure of better performance ought to be the same as the measure above; the amount of information supplied to the model by later data, generated by the same process that gave the original data, will be less for the better model. What grounds have we for believing that the model with fewer parameters will do rather better than expected, while the model with more parameters will do rather worse? The model with too many parameters will be trying to model the randomness of any particular data set as if it were significant, when it isn't. It has actually extracted too much information from the data, finding pattern and structure when there isn't any. This is easily done; brains do it when they see butterflies or body parts in Rorschach blots, faces in flames and so on. This line of thought can lead to the Akaike Information Criterion. Here, we ask the question, if the data is a sample, D, generated by a pdf p and if we model it by a pdf q, what is the expected log-likelihood of the data relative to q? It turns out that in many cases the maximum log-likelihood choice of q is a biased estimate of the mean expected log-likelihood of p itself, and subtracting off the number of parameters removes the bias. See Akaike and also Sakamoto, et al. for an elucidation of these somewhat technical points. See also Shibata.

Alternatively, we can argue that since we know nothing of the model class which actually generated the data, we should abandon this attempt and go to more robust considerations. We can argue that compression is pertinent.

A pdf over ${\fam11\tenbbb R}^n$ can be regarded as a method of compressing a set of points in ${\fam11\tenbbb R}^n$. In doing this we simply think of the pdf as telling us how much to stretch bits of the space or alternatively to re-measure it so that we can code points efficiently, just as with Shannon coding. The amount of compression is greatest when the pdf is a good model of the distribution of the set. We can turn this on its head and define a `good model' as one which gives good compression. Certainly for two different models of the same family, the better model is the one with greater log likelihood. The `same family' of models means that there is some finite set of parameters which specify the model within the family, which is to say the `same family' of models comprise a manifold. So far, then, this is merely a new way of writing Maximum Likelihood models. But in the case where the class is larger than this and comprises, say, several manifolds with different dimension, we can also code the model parameters, and store the extra overhead. This allows us to compute the total length of the data as coded using a model, plus the overhead of specifying the model. It is not particularly difficult to compute, for such things as gaussian mixture models with different numbers of gaussians what the net saving is. There is an advantage in having a better model, offset by the drawback of having to count extra parameters. Penalising the number of parameters is an obvious thing to do, but Rissanen offers a coherent explanation of precisely how to do this. The Rissanen approach, under certain circumstances, agrees with the Akaike approach asymptotically. Some might find this comforting, others mysterious.

Other work has been done from a Bayesian point of view on attempting to put in an Occam's Razor, that is to say a device for penalising a large number of parameters or entities. The idea here is that we can penalise a model family of higher complexity, that is to say with more parameters, by observing that the normalising constant $\int 
p(x\vert m)p(m)dm $ will be bigger when the integration is over a higher dimensional manifold of models m thus reducing p(m|x) in comparison with some alternative and `smaller' such manifold. The fact that the integral depends on the parametrisation of the manifold of models is grounds for some degree of nervousness among the innocent, but one can be persuaded this is not fatal. The result is that we favour a model from a class when the particular model accounts for the data well, when we have no strong prior prejudice against the model, and when on average the other members of the model class do fairly badly, possibly because there are so many of them. This has a certain charm. See MacKay's Doctoral Thesis in the bibliography for a well written propaganda job and some good references.

In the case of gaussian mixture modelling, the method implicitly recommended in the next chapter for dealing with pattern recognition problems if we opt for statistical methods, the choice of how many gaussians to use can be solved using the Akaike Information Criterion (AIC), the MDL or stochastic complexity criterion of Rissanen, or others, notably the BIC. Experiments, conducted on data sets actually generated by a known number of gaussians, suggest that the Rissannen criterion works tolerably well over a range of dimensions and numbers of gaussians, and performs better than the AIC. A rough guess of a sensible number may often be made by eyeballing the data under projection provided the dimension is not too high, and more sophisticated methods of automating the eyeballing to some extent will occur to the reasonably ingenious. Then some experiments with different numbers of gaussians and some arithmetic along the lines indicated above can yield a sensible number of gaussians to use in the model.

I shall return to the matter of compression and optimal modelling when we come to Neural Net models, which are notorious for having enormous dimensions for representing the data and consequently large numbers of parameters.


next up previous contents
Next: Summary of the chapter Up: Minimum Description Length Models Previous: Compression for pdfs
Mike Alder
9/19/1997