next up previous contents
Next: Exercises Up: Statistical Ideas Previous: Summary of Rissanen Complexity

Summary of the chapter

The basic assumption behind the probabilistic analysis of data is that there is some process or processes operating with the property that even when the input to the process is more or less known, the outputs differ. Rather like the drunk dart throwers of chapter one, the process ought, in a neat newtonian, causal, clockwork universe, to replicate its last output this time, but it generally does not. The classic examples, and the origins of probability theory, are in playing cards, throwing dice and tossing coins, where although we have a convincing causal model (Newtonian Dynamics) for the process, our ignorance of the initial states leaves us only `average' behaviour as effective data in deciding what to expect next.

The assumptions of probability theory are that when you replicate an experiment or repeat something like tossing a coin or throwing a die, what you actually do is to reapply a map with some generally different and unknown and indeed unknowable initial state. All we know about the initial states is the measure of those subsets of them that lead to different sets of outcomes.

We can simplify then to having a measure on the space of outcomes. From this model of random processes as non-random processes but with ignorance of the precise initial state, all of probability theory derives, and on this basis, much modern statistics is founded. We rapidly[*] deduce the existence of probability density functions over the space of outcomes as measuring the results of repeating the application of the random variable an infinite (usually uncountably infinite) number of times. When there are only finitely many possible outcomes, we assign a probability distribution to the elementary outcomes. The practicalities of trying to decide whether one state of affairs is a repetition of another state of affairs are ignored on the grounds that it is too difficult to lay down rules. So what a replication is, nobody knows, or them as knows ain't saying. To highlight this point, consider a robot programmed with classical mechanics required to throw or catch a cricket ball or some similar mechanical task.[*] It might be a programming problem, but the principles are clear enough. Now contrast this with the case of a robot which is intended to play a reasonable game of poker. To what extent is the information acquired in one game transferable to another? This is the problem of replication; it involves what psychologists call `transfer of learning', and we do not understand it. When we do, you will be able to play a poker game with two human beings and a robot, and the robot will catch you if you cheat. Don't hold your breath until it happens.

Model families, often parametrised as manifolds, spring out of the collective unconscious whenever a suitable data set is presented, and there are no algorithmic procedures for obtaining them. Again, this is bad news for the innocent, such as me, who wants to build a robot to do it. There are procedures for taking the manifold of models and choosing the point of it which best fits or describes the data. If there were only one we might feel some faith in it, but there are several. There is no consensus as to which procedure gives the best model, because there are competing ideas of what `best' means. There are three well known, and occasionally competing, procedures for picking, from some set of models, the best model for a given data set, and they all demand that someone, somehow, went and worked out what the choice of models had to be. (Take a model, any model...). These three procedures, which we have discussed at some length, give rise to the Maximum Likelihood model, the Bayesian optimal model and the Minimum Description length model. Sometimes these are all different, sometimes not. Arguments to persuade you that one is better than another are crude, heuristic and possibly unconvincing. Tough luck, that's the way the subject is.

Thus we face some rather hairy problems. The state of the art is something like this:

We cannot in general evolve a model of any useful sort from the data as yet, we have to rely on people looking at the data, and then inspecting the entrails of chickens or gallivanting about in their subconsciouses or wherever, and bringing back a model family.

When we have a set of models described by finitely many parameters so that the models comprise a smooth manifold, there are several different ways of picking the best one from the set. We can compute a figure of merit for the pair consisting of a model and a set of data which the model might or might not have generated. This figure of merit is a real number called the Likelihood, and hence for a fixed data set there is a function defined from the manifold of models into ${\fam11\tenbbb R}$. This function has a unique maximum often enough to make it tempting to use Maximum Likelihood as the natural meaning of the `best' model to account for the data. However, this pays no attention to the possible existence of other information which might predispose us to some other model; in particular, there might be a (prior) probability density function associated with the manifold of models. This, many people might feel in their bones, ought to be used in choosing the `best' model.[*] Where the prior pdf comes from is a matter seldom discussed, but presumably it comes from some other source of data about the system under investigation: if we are into coin tossing then presumably it derives from having tossed other, different but similar, coins in the past.

Finally, if we have feelings in our bones about information theory as the right place to found statistical reasoning, and if we also feel in our bones a preference for simple models rather than complicated ones, we may be able to fall back on Rissanen style arguments if we are lucky, but many statisticians don't accept Rissanen's ideas. Rissanen gives us a chance to reject models which give a high likelihood but seem to be too complex, and to prefer simpler models with a lower degree of likelihood for the data. I discussed philosophical issues, I drew morals and extracted, some would say extorted, principles. I then went on to pdfs, starting with a coin model, and showing how the ML model for the coin compressed the results of tossing it. Then I compressed a set of points in the unit interval, using a pdf over [0,1]. Various useful and intuitive results were proved in a manner that no modern analyst would tolerate but that was good enough for Gauss.

The subjectivity which allows Statisticians to spend uncommon amounts of time in disputation means that choosing the answer that pleases you most is what is commonly done.

One should not be deterred from considering a theory just because of the air of dottiness about it which appears once one has stripped away the technicalities and described it in plain English. Theories are to be judged by their consequences, and on that criterion Probability Theory has been extremely successful. There are, nevertheless, quite serious problems with the semantics of the theory. For a start, the repetitions of the application of the rv mean that the different initial states have to be equally likely, but this is part of what we are trying to define by the apparatus of random variables. In applications, Rissanen has pointed out that there is no operational procedure for deciding if something is `random', and experts argue over the legitimacy of various techniques while the trusting just use them. Formalisation of these ideas via the Kolmogorov axioms has meant that while the Mathematics may be impeccable, it isn't at all clear how reality gets into the picture. The innocent engineer who wants recipes not rationales can be a party to his own deception and often has been.

You can see why so many people dislike Stats. Just as you've finished learning some complicated methods of coping with uncertainty and doubt and finally come to believe you've got a stranglehold on the stuff, it leers at you and tells you it's all a matter of opinion as to whether it's OK to use them. For the innocent wanting certainty, it isn't as good a buy as religion.


next up previous contents
Next: Exercises Up: Statistical Ideas Previous: Summary of Rissanen Complexity
Mike Alder
9/19/1997