next up previous contents
Next: Bayesian Statistics Up: Bayesian Methods Previous: Bayesian Methods

Bayes' Theorem

Suppose we are given the codomain of a discrete random variable, that is to say the set of k possible outcomes. Each such outcome will be referred to as an atomic event. I shall use the labels 1 to k to specify each of the outcomes.

If we imagine that the rv is used to generate data sequentially, as with succeeding throws of a coin, then we will obtain for a sequence of n such repetitions a space of kn distinct outcomes. The fact that these are repetitions of the same rv, means that each of the n repetitions is a separate event in no way contingent upon the results of earlier or later repetitions. The same histogram should be applied to each of them separately when it comes to computing probabilities. Then this defines us a particular kind of new rv which is from the cartesian product of n copies of the domain of the basic rv, and goes to the cartesian product of n copies of the codomain, by the obvious product map. It follows that the probability of an atomic event in the product rv is the product of the probabilities of the components.

There may be a different rv on the same domain and into the same codomain(kn) in the case where what happens at the jth stage in the sequence depends upon what happens at earlier (or later) stages in the sequence. In such a case the probability of an event in the codomain will not usually be expressible as such a product. In order to deal with such cases, and for other situations also, the idea of conditional probability is useful.

For any rv, an event will be used to mean a non-empty element of the power set of the space of atomic events, i.e. a non-empty subset of the codomain of the rv. For example, if the rv is the obvious model for the throwing of a six sided cubic die, the atomic events can be labelled by the integers 1 to 6, and the set of such points $ \{1,3,5 \}$ might be described as the event where the outcome is an odd number. Then events inherit probabilities from the atomic events comprising them: we simply take the measure of the inverse image of the set by the map which is the rv. Alternatively, we take the probability of each of the atomic events in the subset, and add up the numbers.

Suppose A and B are two events for a discrete random variable. Then p(A) makes sense, and p(B) makes sense, and so does $p(A \cup B)$ and $p(A 
\cap B)$. It is clear that the formula

\begin{displaymath}
p(A \cup B) = p(A) + p(B) - p(A \cap B) \end{displaymath}

simply follows from the fact that p is a measure, i.e. behaves like an area or volume. In addition, I can make up the usual definitions:

Definition: A and B are independent iff $p(A \cap B) = p(A)p(B) $

It might not be altogether trivial to decide if two events which are observed are produced by a random variable for which they are independent. Usually one says to oneself `I can't think of any reason why these events might be causally related, so I shall assume they aren't.' As for example when A is the event that a coin falls Heads up and B is the event that it is Tuesday. It seems a hit or miss process, and sometimes it misses.

Definition: If A and B are any two events for a discrete rv the conditional probability of A given B, p(A|B), is defined when $P(B) \neq 0$ by

\begin{displaymath}
p(A\vert B) = \frac{p(A \cap B)}{p(B)} \end{displaymath}

There is a clear enough intuitive meaning to this: Given that event B has occurred, we really have to look at the measure of those points in the domain of the rv which can give rise to an outcome in the event A from among those which can give rise to the event B. In short we have what a category theorist would call a subrandom variable, were category theorists to be let loose in probability theory. A straightforward example would be when B is the event that a die is thrown and results in a value less than four, and A is the event that the die shows an odd number. Then p(A|B) is the probability of getting one or three divided by the probability of getting one or two or three, in other words it is $\frac{2}{3}$. We can also calculate p(B|A), which is the probability of getting one or three divided by the probability of getting one or three or five, which is also $\frac{2}{3}$. In general the two probabilities, p(A|B) and p(B|A), will be different. In fact we obviously have the simple result:

Bayes' Theorem $\ \ \ p(B\vert A) = \frac{p(A\vert B) 
p(B)}{p(A)} $

This follows immediately from the rearrangements:

\begin{displaymath}
p(A\vert B) p(B) = p(A \cap B) = p(B\vert A) p(A) \end{displaymath}

It would be a mistake to imagine that Bayes' theorem is profound, it is simply linguistic. For some reason, Bayes got his name on a somewhat distinctive approach to deciding, inter alia which model is responsible for a given datum. For now let us suppose that this extends to the case of continua, and we have the situation of Fig.1.10 of chapter one. Here, two models, gm and gf perhaps, are candidates for having been responsible for the point x. We can calculate without difficulty the numbers gm(x) and gf(x). We decide to regard these two numbers as something like the conditional probability of having got x given model m and the conditional probability of having got x given model f, respectively. We may write this, with extreme sloppiness, as p(x|m) and p(x|f) respectively. These would be interpreted by the sanguine as measuring the probability that the datum x will be observed given that model m (respectively, f) is true. Now what we want is p(m|x) and p(f|x) respectively, the probabilities that the models are true given the observation x. By applying Bayes Theorem in a spirit of untrammelled optimism, we deduce that

\begin{displaymath}
p(m\vert x) = \frac{p(x\vert m) p(m)}{p(x)} \end{displaymath}

with a similar expression for p(f|x).

Suppose we are in the situation of the first problem at the end of chapter one; you will doubtless recall the half naked ladies and possibly the trees from which they were to be distinguished. Now it could be argued that in the absence of any data from an actual image, we would rate the probability of the image being of a tree as 0.9 and of it being a naked lady as 0.1, on the grounds that these are the ratios of the numbers of images. Bayesians refer to these as the prior probabilities of the events. So in the above formulae we could put p(m) and p(f) in as numbers if we had some similar sort of information about the likelihoods of the two models. This leaves only p(x) as a number which it is a little hard to assign. Happily, it occurs in both expressions, and if we look at the likelihood ratio, it cancels out. So we have:

\begin{displaymath}
\frac{p(m\vert x)}{p(f\vert x)} = \frac{p(x\vert m)p(m)}{p(x\vert f)p(f)} \end{displaymath}

and the right hand side, known as the likelihood ratio is computable. Well, sort of.

If you are prepared to buy this, you have a justification for the rule of thumb of always choosing the bigger value, at least in the case where p(m) and p(f) are equal. In this case, p(m|x) is proportional to p(x|m) and p(f|x) is proportional to p(x|f) and with the same constant of proportionality, so choosing whichever model gives the bigger answer is the Bayes Optimal Solution. More generally than the crude rule of thumb, if it is ten times as likely to be m as f which is responsible for a datum in a state of ignorance of what the actual location of the datum is, then we can use this to obtain a bias of ten to one in favour of m by demanding that the ratio of the likelihoods be greater than ten to one before we opt for f as the more likely solution.

The blood runs cold at the thought of what must be done to make the above argument rigorous; the step from probabilities, as used in the formulae, to likelihoods (since things like p(x|m) interpreted as gm(x) are distinctly shonky) requires some thought, but reasonable arguments about pdfs being really just the limits of histograms can carry you through this. What is more disturbing is the question of what the random variable is, that has events which are made up out of points in ${\fam11\tenbbb R}^n$ together with gaussian (or other) models for the distribution of those points. Arguments can be produced. A Bayesian is one for whom these (largely philosophical) arguments are convincing. The decision arrived at by this method is known as the Bayes Optimal decision. Sticking the word `optimal' in usually deceives the gullible into the conviction that no critical thought is necessary; it is the best possible decision according to Bayes, who was a clergyman and possibly divinely inspired, so to quarrel with the Bayes optimal decision would be chancing divine retribution as well as the certainty of being sneered at by an influential school of Statisticians. Rather like the calculation of ` 99% confidence intervals' which suggest to the credulous that you can be 99% confident that the results will always be inside the intervals. This is true only if you are 100% confident in the model you are using. And if you are, you shouldn't be.

Using power words as incantations is an infallible sign of ignorance, superstition and credulity in the part of the user. It should be left to politicians and other con-men.


next up previous contents
Next: Bayesian Statistics Up: Bayesian Methods Previous: Bayesian Methods
Mike Alder
9/19/1997