The histograms over the space of outcomes are the same.
Statisticians sometimes say that the rv, or the histogram or pdf, is a model for the actual data. In using this terminology they are appealing to classical experience of mathematical models such as Newtonian Dynamics. There are some differences: in a classical mathematical model the crucial symbols have an interpretation in terms of measurables and there are well defined operations, such as weighing, which make the interpretation precise. In the case of probabilistic models, we can measure values of a physical variable, but the underlying mechanism of production is permanently hidden from us. Statistical or probabilistic models are not actually much like classical mathematical models at all, and a certain kind of confidence trick is being perpetrated by using the same terminology. Let us look at the differences.
Newton modelled the solar system as a collection of point masses called planets revolving around another point mass called the sun, which was fixed in space. The reduction of a thing as big as a planet to a single point had to be carefully proved to be defensible, because we live on one of them and have prejudices about the matter. The Earth doesn't look much like a point to us. It can be shown that the considerable simplification will have no effect on the motion provided the planets are small enough not to bump into each other, and are spheres having a density function which is radial. Now this is not in fact true. Planets and suns are oblate, and the Earth is not precisely symmetric. But the complications this makes are not large, so we usually forget about them and get on with predicting where the planets will be at some time in the future. The results are incredibly good, good enough to send space craft roaring off to their destiny a thousand million miles away and have them arrive on schedule at the right location. Anybody who doesn't thrill to this demonstration of the power of the human mind has rocks in his head. Note that the model is not believed to be true, it is a symbolic description of part of the universe, and it has had simplifications introduced. Also, we don't have infinite precision in our knowledge of the initial state. So if the spacecraft is out by a few metres when it gets to Saturn, we don't feel too bad about it. Anybody who could throw darts with that precision could put one up a gnat's bottom from ten kilometres.
If we run any mathematical model (these days often a computer simulation) we can look at the measurements the model predicts. Then we go off and make those measurements, and look to see if the numbers agree. If they do, we congratulate ourselves: we have a good model. If they disagree but only by very small amounts that we cannot predict but can assign to sloppy measurement, we feel moderately pleased with ourselves. If they differ in significant ways, we go into deep depression and brood about the discrepancies until we improve the model. We don't have to believe that the model is true in order to use it, but it has to agree with what we measure when we measure it, or the thing is useless. Well, it might give us spiritual solace or an aesthetic buzz, but it doesn't actually do what models are supposed to do.
Probabilistic models do not generate numbers as
a rule, and when they do they are usually the
wrong ones. That is to say, the average behaviour may
be the same as our data set, but the actual values
are unlikely to be the same; even the model will
predict that. Probabilistic models are models,
not of the measured values, but of their distribution
and density. It follows that if we have very
few data, it is difficult to reject any model on the
grounds that it doesn't fit the data. Indeed,
the term `data' is misleading. There are two levels
of `data', the first is the set of points in
, and the second is the distribution and
density of this set which may be described by
a probabilistic model. Since we get the second from
the first by
counting, and counting occurences in little cells
to get a histogram as often as not, if we have
too few to make a respectable histogram we can't
really be said to have any data at the level
where
the model is trying to do its job. And if you
don't have a measurement, how can you test a
theory?
This view of things hasn't stopped people cheerfully
doing hypothesis testing with the calm sense
of moral superiority of the man who has passed
all his exams without thinking and doesn't propose
to start now.
Deciding whether a particular data set is plausibly accounted for by a particular probabilistic model is not a trivial matter therefore, and there are, as you will see later, several ways of justifying models. At this point, there is an element of subjectivity which makes the purist and the thoughtful person uneasy. Ultimately, any human decision or choice may have to be subjective; the choice of whether to use one method or another comes easily to the engineer, who sees life as being full of such decisions. But there is still an ultimate validation of his choice: if his boiler blows up or his bridge falls down, he goofed. If his program hangs on some kinds of input, he stuffed up. But the decision as to whether or not the probabilistic advice you got was sound or bad is not easily taken, and if you have to ask the probabilist how to measure his success, you should do it before you take his advice, not after.
A probabilistic model then does not generate
data in the same way that a causal model does;
what it can do, for any given measurement it claims
to model, is to produce a number saying how
likely such a measurement is. In the case of a
discrete model, with finitely many outcomes say,
the number is called the probability of the
observation. In the case of a continuous
pdf it is a little more complicated.
The continuous pdf is, recall, a limit of histograms,
and the probability of getting any specified
value is zero. If we want to know the probability
of getting values of the continuous variable
within some prescribed interval, as when we want
to know the probability of getting a dart within
two centimetres of the target's centre, we have
to take the limit of adding up the appropriate
rectangular areas: in other words we have to integrate
the pdf over the interval. For any outcome
in the
continuum, the pdf takes, however, some real
value. I shall call this value the likelihood
of the
outcome or event, according to the model defined
by the pdf.
If we have two data, then each may be assessed by the model and two probabilities or likelihoods output (depending on whether the model is discrete or continuous); multiplying these numbers together gives the probability or likelihood of getting the pair on independent runs of the model. It is important to distinguish between a model of events in a space of observables applied twice, and a model where the observables are pairs. The probability of a pair will not in general be the product of the probabilities of the separate events. When it is, we say the events are independent.
For example, I might assert that any toss of a coin is an atomic phenomenon, in which case I am asserting that the probability of any one toss producing heads is the same as any other. I am telling you how to judge my model: if you found a strict alternation of heads and tails in a sequence of tosses, you might reasonably have some doubts about this model. Conversely, if I were to model the production of a sequence of letters of the alphabet by asserting that there is some probability of getting a letter `u' which depends upon what has occurred in the preceding two letters, the analysis of the sequence of letters and the inferences which might be drawn from some data as to the plausibility of the model would be a lot more complicated than the model where each letter is produced independently, as though a die were being thrown to generate each letter. Note that this way of looking at things supposes that probabilities are things that get assigned to events, things that happen, by a model. Some have taken the view that a model is a collection of sentences about the world, each of which may often contain a number or numbers between 0 and 1; others, notably John Maynard Keynes, have taken the view that the sentence doesn't contain a number, but its truth value lies between 0 and 1. The scope for muddle when trying to be lucid about the semantics of a subject is enormous.
Another difference between probabilistic models and causal models is the initial conditions. Most models are of the form: if condition A is observed or imposed, then state B will be observed. Here condition A and state B are specified by a set of measurements, i.e by the values of vectors obtained by prescribed methods of measurement. Now it is not at all uncommon for probabilistic models to assume that a system is `random'. In the poker calculation for example, all bets are off if the cards were dealt from a pack which had all the hearts at the top, for my having three hearts means that so do two of the other players and the last has four. So the probability of a flush being filled is zero. Now if you watched the dealer shuffle the cards, you may believe that such a contingency is unlikely, but it was the observation of shuffling that induced you to feel that way. If you'd seen the dealer carefully arranging all the hearts first, then the spades and clubs in a disorganised mess, and then the diamonds at the end, you might have complained. Why? Because he would have invalidated your model of the game. Now you probably have some loose notion of when a pack of cards has been randomised, but would you care to specify this in such a way that a robot could decide whether or not to use the model? If you can't, the inherent subjectivity of the process is grounds for being extremely unhappy, particularly to those of us in the automation business.
The terminology I have used, far from uncommon,
rather suggests that some orders of cards in
a pack are `random' while others are not, and shuffling
is a procedure for obtaining one of the
random orders. There are people who really believe
this. Gregory Chaitin is perhaps the best
known, but Solomonoff and Kolmogorov, also take
seriously the idea that some orders are random
and others are not. The catch is that they define
`random' for sequences of cards as, in effect,
`impossible to describe briefly, or more briefly
than by giving a list of all the cards in order'.
This makes the sequence of digits consisting of
the decimal expansion of
from the ten
thousandth place after the decimal point to the
forty thousandth place very much non-random.
But if you got them printed out on a piece of paper
it would not be very practical to see the
pattern. They would almost certainly pass all
the standard tests that statisticians use, for
what that is worth.
There was a book published by Rand Corporation
once which was called ` One Million Random Digits',
leading the thoughtful person to enquire whether
they were truly random or only appeared to be.
How can you tell? The idea that some orders are
more random than others is distinctly peculiar,
and yet the authors of ` One Million Random Digits'
had no hesitation in rejecting some of the
sequences on the grounds that they failed tests
of randomness
. Would the observation that they
can't be random because my copy of the book has
exactly the same digits as yours, allowing
me to predict the contents of yours with complete
accuracy, be regarded
as reasonable? Probably not, but what if an allegedly
random set of points in the plane turned out
to be a star map of some region of the night sky?
All these matters make rather problematic the
business of deciding if something is random or
not. Nor does the matter of deciding whether
something
is realio-trulio random allow of testing by applying
a fixed number of procedures. And yet
randomness appears to be a necessary `condition
A' in lots of probabilistic models, from making
decisions in a card game to sampling theory applied
to psephologists second guessing the electorate.
These points have been made by Rissanen, but should trouble anybody who has been obliged to use probabilistic methods. Working probabilists and statisticians can usually give good value for money, and make sensible judgements in these cases. Hiring one is relatively safe and quite cheap. But if one were to contemplate automating one, in even a limited domain, these issues arise.
The notion of repeatability of an experiment is crucial to classical, causal models of the world, as it is to probabilistic models. There is a problem with both, which is, how do you know that all the other things which have changed in the interval between your so called replications are indeed irrelevant? You never step into the same river twice, and indeed these days most people don't step into rivers at all, much preferring to drive over them, but you never throw the same coin twice either, it got bashed when it hit the ground the first time, also, last time was Tuesday and the planet Venus was in Caries, the sign of the dentist, and how do you know it doesn't matter? If I collect some statistics on the result of some measurements of coin tossing, common sense suggests that if the coin was run over by a train half way through the series and severely mangled, then this is enough to be disinclined to regard the series before and after as referring to the same thing. Conversely, my common sense assures me that if another series of measurements on a different coin was made, and the first half were done on Wednesdays in Lent and the last half on Friday the thirteenth, then this is not grounds for discounting half the data as measuring the wrong thing. But there are plenty of people who would earnestly and sincerely assure me that my common sense is in error. Of course, they probably vote Green and believe in Fairies at the bottom of the garden, but the dependence on subjectivity is disconcerting. In assigning some meaning to the term `probability of an event', we have to have a clear notion of the event being, at least in principle, repeatable with (so far as we can determine) the same initial conditions. But this notion is again hopelessly metaphysical. It entails at the very least appeal to a principle asserting the irrelevance of just about everything, since a great deal has changed by the time we come to replicate any experiment. If I throw a coin twice, and use the data to test a probabilistic model for coins, then I am asserting as a necessary part of the argument, that hitting the ground the first time didn't change the model for the coin, and that the fact that the moons of Jupiter are now in a different position is irrelevant. These are propositions most of us are much inclined to accept without serious dispute, but they arise when we are trying to automate the business of applying probabilistic ideas. If we try to apply the conventional notions to the case of a horse race, for instance, we run into conceptual difficulties: the race has never been run before, and will never be run again. In what sense then can we assign a probability to a horse winning? People do in fact do this, or they say they do and behave as if they do, to some extent at least, so what is going on here? The problem is not restricted to horse races, and indeed applies to every alleged replication. Anyone who would like to build a robot which could read in the results of all the horse races in history, inspect each horse in a race, examine the course carefully, and taking into account whatever data was relevant produce estimates of the probabilities of each horse winning, will see the nature of the difficulties. There is no universally accepted procedure which could do this for the much simpler case of coin tossing.
So the question of what is an allowable measurement gives us a problem, since specifying what is irrelevant looks to be a time consuming job. If the cards are marked, all bets are off on the matter of whether its better to draw two to fill the flush or one for the straight. And how do you list all the other considerations which would make you believe that the argument given above was rendered inapposite? The classical case is much simpler because we can tell whether or not something is being measured properly if we have been trained in the art of recognising a proper measurement, and we can build machines which can implement the measurements with relatively little difficulty. But how do we get a machine to tell if the other guys are cheating in a game of poker? How does it decide if the shuffling process has produced enough randomness? And when does it give up on the model because the data is not in accord with it? Subjectivity may be something engineers are used to, but they haven't had to build it into robots before. But whenever you test a model you have to ensure that the input conditions are satisfied and the measuring process properly conducted.
A fundamental assumption in classical systems,
seldom stated,
is the stability of the model under variations
in the data and the model parameters. If you
cast your mind back to the curious discussions of weights
joined by zero thickness, weightless threads,
hanging over pulleys, which took place in applied
mathematics classes in olden times, you
may recall having been troubled by the thought
that if the threads were of zero width then the
pressure
underneath them must have been infinite and they
would have cut through the pulley like a knife
through butter. And they would have to be infinitely
strong not to snap, because the force per
unit cross sectional area would also have been
infinite. What was being glossed over, as you
eventually came to realize, was that although
the idealised model made no physical sense, it
is possible to approximate it with bits of string
sufficiently well to allow the results of
calculations to be useful. Idealisations are never
satisfied in the real world, but sometimes it
doesn't much matter.
What it boils down to then is this: you postulate a model and you simultaneously postulate that if the input data and model parameters are varied just a little bit it won't make any great deal of diference to the output of the model. If you are right about this last assumption, you have a classical model, while if not you have a probabilistic model and have to to be less ambitious and settle for knowing only the relative probabilities of the outcomes.
This is complicated by the platonic, metaphysical component of probabilistic modelling: you may be given a set of conditions under which you are told it is safe to use the model, but the conditions are not such as can be confirmed by any operational procedure. For example, you are allowed to make inferences about populations from samples providing your sample is `random', but there is no procedure for ensuring that a given selection scheme is random. Or you are assured that using a normal or gaussian distribution is warranted when the process generating the data has a `target' value and is influenced by a very large number of independent factors all of which may add a small disturbing value and satisfy some conditions. But you are not told, for a particular data set, how to set about investigating these factors and conditions. This situation, in which models are `justified' provided they satisy certain conditions, but there is no operational way of deciding whether the conditions are satisfied or not, is common. The extent to which Statistical methodology is a matter of training people to tell the same story, as in theological college, as opposed to giving the only possible story, as in science, is far from clear, but the assurances one gets from working probabilists are not particularly comforting.