next up previous contents
Next: Probabilistic Models as Data Up: History, and Deep Philosophical Previous: Histograms and Probability Density

Models and Probabilistic Models

Of course, nobody comes along and gives you a random variable. What they do usually is to give you either a description of a physical situation or some data and leave it to you to model the description by means of an rv. The cases of the cards and the coins are examples of this. With the coins, for example, you may take the 12-dimensional state space of physics if you wish, but it suffices to have a two point space with measure 0.5 on each point, the map sending one to Heads and the other to Tails. For two-up, you can make do with a two point space, one labelled `same', the other labelled `different', or you can have two copies of the space of Heads and Tails with four points in it, a pair of them labelled `same', the other pair `different', or a twenty-four dimensional state space with half the space black and the other half white- it doesn't much matter.

The histograms over the space of outcomes are the same.

Statisticians sometimes say that the rv, or the histogram or pdf, is a model for the actual data. In using this terminology they are appealing to classical experience of mathematical models such as Newtonian Dynamics. There are some differences: in a classical mathematical model the crucial symbols have an interpretation in terms of measurables and there are well defined operations, such as weighing, which make the interpretation precise. In the case of probabilistic models, we can measure values of a physical variable, but the underlying mechanism of production is permanently hidden from us. Statistical or probabilistic models are not actually much like classical mathematical models at all, and a certain kind of confidence trick is being perpetrated by using the same terminology. Let us look at the differences.

Newton modelled the solar system as a collection of point masses called planets revolving around another point mass called the sun, which was fixed in space. The reduction of a thing as big as a planet to a single point had to be carefully proved to be defensible, because we live on one of them and have prejudices about the matter. The Earth doesn't look much like a point to us. It can be shown that the considerable simplification will have no effect on the motion provided the planets are small enough not to bump into each other, and are spheres having a density function which is radial. Now this is not in fact true. Planets and suns are oblate, and the Earth is not precisely symmetric. But the complications this makes are not large, so we usually forget about them and get on with predicting where the planets will be at some time in the future. The results are incredibly good, good enough to send space craft roaring off to their destiny a thousand million miles away and have them arrive on schedule at the right location. Anybody who doesn't thrill to this demonstration of the power of the human mind has rocks in his head. Note that the model is not believed to be true, it is a symbolic description of part of the universe, and it has had simplifications introduced. Also, we don't have infinite precision in our knowledge of the initial state. So if the spacecraft is out by a few metres when it gets to Saturn, we don't feel too bad about it. Anybody who could throw darts with that precision could put one up a gnat's bottom from ten kilometres.

If we run any mathematical model (these days often a computer simulation) we can look at the measurements the model predicts. Then we go off and make those measurements, and look to see if the numbers agree. If they do, we congratulate ourselves: we have a good model. If they disagree but only by very small amounts that we cannot predict but can assign to sloppy measurement, we feel moderately pleased with ourselves. If they differ in significant ways, we go into deep depression and brood about the discrepancies until we improve the model. We don't have to believe that the model is true in order to use it, but it has to agree with what we measure when we measure it, or the thing is useless. Well, it might give us spiritual solace or an aesthetic buzz, but it doesn't actually do what models are supposed to do.

Probabilistic models do not generate numbers as a rule, and when they do they are usually the wrong ones. That is to say, the average behaviour may be the same as our data set, but the actual values are unlikely to be the same; even the model will predict that. Probabilistic models are models, not of the measured values, but of their distribution and density. It follows that if we have very few data, it is difficult to reject any model on the grounds that it doesn't fit the data. Indeed, the term `data' is misleading. There are two levels of `data', the first is the set of points in ${\fam11\tenbbb R}^n$, and the second is the distribution and density of this set which may be described by a probabilistic model. Since we get the second from the first by counting, and counting occurences in little cells to get a histogram as often as not, if we have too few to make a respectable histogram we can't really be said to have any data at the level where the model is trying to do its job. And if you don't have a measurement, how can you test a theory? This view of things hasn't stopped people cheerfully doing hypothesis testing with the calm sense of moral superiority of the man who has passed all his exams without thinking and doesn't propose to start now.

Deciding whether a particular data set is plausibly accounted for by a particular probabilistic model is not a trivial matter therefore, and there are, as you will see later, several ways of justifying models. At this point, there is an element of subjectivity which makes the purist and the thoughtful person uneasy. Ultimately, any human decision or choice may have to be subjective; the choice of whether to use one method or another comes easily to the engineer, who sees life as being full of such decisions. But there is still an ultimate validation of his choice: if his boiler blows up or his bridge falls down, he goofed. If his program hangs on some kinds of input, he stuffed up. But the decision as to whether or not the probabilistic advice you got was sound or bad is not easily taken, and if you have to ask the probabilist how to measure his success, you should do it before you take his advice, not after.

A probabilistic model then does not generate data in the same way that a causal model does; what it can do, for any given measurement it claims to model, is to produce a number saying how likely such a measurement is. In the case of a discrete model, with finitely many outcomes say, the number is called the probability of the observation. In the case of a continuous pdf it is a little more complicated. The continuous pdf is, recall, a limit of histograms, and the probability of getting any specified value is zero. If we want to know the probability of getting values of the continuous variable within some prescribed interval, as when we want to know the probability of getting a dart within two centimetres of the target's centre, we have to take the limit of adding up the appropriate rectangular areas: in other words we have to integrate the pdf over the interval. For any outcome in the continuum, the pdf takes, however, some real value. I shall call this value the likelihood of the outcome or event, according to the model defined by the pdf. [*]

If we have two data, then each may be assessed by the model and two probabilities or likelihoods output (depending on whether the model is discrete or continuous); multiplying these numbers together gives the probability or likelihood of getting the pair on independent runs of the model. It is important to distinguish between a model of events in a space of observables applied twice, and a model where the observables are pairs. The probability of a pair will not in general be the product of the probabilities of the separate events. When it is, we say the events are independent.

For example, I might assert that any toss of a coin is an atomic phenomenon, in which case I am asserting that the probability of any one toss producing heads is the same as any other. I am telling you how to judge my model: if you found a strict alternation of heads and tails in a sequence of tosses, you might reasonably have some doubts about this model. Conversely, if I were to model the production of a sequence of letters of the alphabet by asserting that there is some probability of getting a letter `u' which depends upon what has occurred in the preceding two letters, the analysis of the sequence of letters and the inferences which might be drawn from some data as to the plausibility of the model would be a lot more complicated than the model where each letter is produced independently, as though a die were being thrown to generate each letter. Note that this way of looking at things supposes that probabilities are things that get assigned to events, things that happen, by a model. Some have taken the view that a model is a collection of sentences about the world, each of which may often contain a number or numbers between 0 and 1; others, notably John Maynard Keynes, have taken the view that the sentence doesn't contain a number, but its truth value lies between 0 and 1. The scope for muddle when trying to be lucid about the semantics of a subject is enormous.

Another difference between probabilistic models and causal models is the initial conditions. Most models are of the form: if condition A is observed or imposed, then state B will be observed. Here condition A and state B are specified by a set of measurements, i.e by the values of vectors obtained by prescribed methods of measurement. Now it is not at all uncommon for probabilistic models to assume that a system is `random'. In the poker calculation for example, all bets are off if the cards were dealt from a pack which had all the hearts at the top, for my having three hearts means that so do two of the other players and the last has four. So the probability of a flush being filled is zero. Now if you watched the dealer shuffle the cards, you may believe that such a contingency is unlikely, but it was the observation of shuffling that induced you to feel that way. If you'd seen the dealer carefully arranging all the hearts first, then the spades and clubs in a disorganised mess, and then the diamonds at the end, you might have complained. Why? Because he would have invalidated your model of the game. Now you probably have some loose notion of when a pack of cards has been randomised, but would you care to specify this in such a way that a robot could decide whether or not to use the model? If you can't, the inherent subjectivity of the process is grounds for being extremely unhappy, particularly to those of us in the automation business.

The terminology I have used, far from uncommon, rather suggests that some orders of cards in a pack are `random' while others are not, and shuffling is a procedure for obtaining one of the random orders. There are people who really believe this. Gregory Chaitin is perhaps the best known, but Solomonoff and Kolmogorov, also take seriously the idea that some orders are random and others are not. The catch is that they define `random' for sequences of cards as, in effect, `impossible to describe briefly, or more briefly than by giving a list of all the cards in order'. This makes the sequence of digits consisting of the decimal expansion of $\pi$ from the ten thousandth place after the decimal point to the forty thousandth place very much non-random. But if you got them printed out on a piece of paper it would not be very practical to see the pattern. They would almost certainly pass all the standard tests that statisticians use, for what that is worth.

There was a book published by Rand Corporation once which was called ` One Million Random Digits', leading the thoughtful person to enquire whether they were truly random or only appeared to be. How can you tell? The idea that some orders are more random than others is distinctly peculiar, and yet the authors of ` One Million Random Digits' had no hesitation in rejecting some of the sequences on the grounds that they failed tests of randomness[*]. Would the observation that they can't be random because my copy of the book has exactly the same digits as yours, allowing me to predict the contents of yours with complete accuracy, be regarded as reasonable? Probably not, but what if an allegedly random set of points in the plane turned out to be a star map of some region of the night sky? All these matters make rather problematic the business of deciding if something is random or not. Nor does the matter of deciding whether something is realio-trulio random allow of testing by applying a fixed number of procedures. And yet randomness appears to be a necessary `condition A' in lots of probabilistic models, from making decisions in a card game to sampling theory applied to psephologists second guessing the electorate.

These points have been made by Rissanen, but should trouble anybody who has been obliged to use probabilistic methods. Working probabilists and statisticians can usually give good value for money, and make sensible judgements in these cases. Hiring one is relatively safe and quite cheap. But if one were to contemplate automating one, in even a limited domain, these issues arise.

The notion of repeatability of an experiment is crucial to classical, causal models of the world, as it is to probabilistic models. There is a problem with both, which is, how do you know that all the other things which have changed in the interval between your so called replications are indeed irrelevant? You never step into the same river twice, and indeed these days most people don't step into rivers at all, much preferring to drive over them, but you never throw the same coin twice either, it got bashed when it hit the ground the first time, also, last time was Tuesday and the planet Venus was in Caries, the sign of the dentist, and how do you know it doesn't matter? If I collect some statistics on the result of some measurements of coin tossing, common sense suggests that if the coin was run over by a train half way through the series and severely mangled, then this is enough to be disinclined to regard the series before and after as referring to the same thing. Conversely, my common sense assures me that if another series of measurements on a different coin was made, and the first half were done on Wednesdays in Lent and the last half on Friday the thirteenth, then this is not grounds for discounting half the data as measuring the wrong thing. But there are plenty of people who would earnestly and sincerely assure me that my common sense is in error. Of course, they probably vote Green and believe in Fairies at the bottom of the garden, but the dependence on subjectivity is disconcerting. In assigning some meaning to the term `probability of an event', we have to have a clear notion of the event being, at least in principle, repeatable with (so far as we can determine) the same initial conditions. But this notion is again hopelessly metaphysical. It entails at the very least appeal to a principle asserting the irrelevance of just about everything, since a great deal has changed by the time we come to replicate any experiment. If I throw a coin twice, and use the data to test a probabilistic model for coins, then I am asserting as a necessary part of the argument, that hitting the ground the first time didn't change the model for the coin, and that the fact that the moons of Jupiter are now in a different position is irrelevant. These are propositions most of us are much inclined to accept without serious dispute, but they arise when we are trying to automate the business of applying probabilistic ideas. If we try to apply the conventional notions to the case of a horse race, for instance, we run into conceptual difficulties: the race has never been run before, and will never be run again. In what sense then can we assign a probability to a horse winning? People do in fact do this, or they say they do and behave as if they do, to some extent at least, so what is going on here? The problem is not restricted to horse races, and indeed applies to every alleged replication. Anyone who would like to build a robot which could read in the results of all the horse races in history, inspect each horse in a race, examine the course carefully, and taking into account whatever data was relevant produce estimates of the probabilities of each horse winning, will see the nature of the difficulties. There is no universally accepted procedure which could do this for the much simpler case of coin tossing.

So the question of what is an allowable measurement gives us a problem, since specifying what is irrelevant looks to be a time consuming job. If the cards are marked, all bets are off on the matter of whether its better to draw two to fill the flush or one for the straight. And how do you list all the other considerations which would make you believe that the argument given above was rendered inapposite? The classical case is much simpler because we can tell whether or not something is being measured properly if we have been trained in the art of recognising a proper measurement, and we can build machines which can implement the measurements with relatively little difficulty. But how do we get a machine to tell if the other guys are cheating in a game of poker? How does it decide if the shuffling process has produced enough randomness? And when does it give up on the model because the data is not in accord with it? Subjectivity may be something engineers are used to, but they haven't had to build it into robots before. But whenever you test a model you have to ensure that the input conditions are satisfied and the measuring process properly conducted.

A fundamental assumption in classical systems, seldom stated,[*] is the stability of the model under variations in the data and the model parameters. If you cast your mind back to the curious discussions of weights joined by zero thickness, weightless threads, hanging over pulleys, which took place in applied mathematics classes in olden times, you may recall having been troubled by the thought that if the threads were of zero width then the pressure underneath them must have been infinite and they would have cut through the pulley like a knife through butter. And they would have to be infinitely strong not to snap, because the force per unit cross sectional area would also have been infinite. What was being glossed over, as you eventually came to realize, was that although the idealised model made no physical sense, it is possible to approximate it with bits of string sufficiently well to allow the results of calculations to be useful. Idealisations are never satisfied in the real world, but sometimes it doesn't much matter.

What it boils down to then is this: you postulate a model and you simultaneously postulate that if the input data and model parameters are varied just a little bit it won't make any great deal of diference to the output of the model. If you are right about this last assumption, you have a classical model, while if not you have a probabilistic model and have to to be less ambitious and settle for knowing only the relative probabilities of the outcomes.

This is complicated by the platonic, metaphysical component of probabilistic modelling: you may be given a set of conditions under which you are told it is safe to use the model, but the conditions are not such as can be confirmed by any operational procedure. For example, you are allowed to make inferences about populations from samples providing your sample is `random', but there is no procedure for ensuring that a given selection scheme is random. Or you are assured that using a normal or gaussian distribution is warranted when the process generating the data has a `target' value and is influenced by a very large number of independent factors all of which may add a small disturbing value and satisfy some conditions. But you are not told, for a particular data set, how to set about investigating these factors and conditions. This situation, in which models are `justified' provided they satisy certain conditions, but there is no operational way of deciding whether the conditions are satisfied or not, is common. The extent to which Statistical methodology is a matter of training people to tell the same story, as in theological college, as opposed to giving the only possible story, as in science, is far from clear, but the assurances one gets from working probabilists are not particularly comforting.


next up previous contents
Next: Probabilistic Models as Data Up: History, and Deep Philosophical Previous: Histograms and Probability Density
Mike Alder
9/19/1997