If we imagine that the rv is used to generate data sequentially, as with succeeding throws of a coin, then we will obtain for a sequence of n such repetitions a space of kn distinct outcomes. The fact that these are repetitions of the same rv, means that each of the n repetitions is a separate event in no way contingent upon the results of earlier or later repetitions. The same histogram should be applied to each of them separately when it comes to computing probabilities. Then this defines us a particular kind of new rv which is from the cartesian product of n copies of the domain of the basic rv, and goes to the cartesian product of n copies of the codomain, by the obvious product map. It follows that the probability of an atomic event in the product rv is the product of the probabilities of the components.
There may be a different rv on the same domain and into the same codomain(kn) in the case where what happens at the jth stage in the sequence depends upon what happens at earlier (or later) stages in the sequence. In such a case the probability of an event in the codomain will not usually be expressible as such a product. In order to deal with such cases, and for other situations also, the idea of conditional probability is useful.
For any rv, an event will be used
to mean a non-empty element of the power set
of the
space of atomic events, i.e. a non-empty subset
of the codomain of the rv. For example,
if
the rv is the obvious model
for the throwing of a six sided cubic die, the
atomic events can be labelled by the integers
1 to
6, and the set of such points
might be described as the event where the outcome
is
an odd number. Then events inherit probabilities
from the atomic events comprising them: we simply
take the measure of the inverse image of the set
by the map which is the rv. Alternatively,
we take the probability of each of the atomic
events in the subset, and add up the numbers.
Suppose A and B are two events for a discrete
random variable. Then p(A) makes sense, and
p(B)
makes sense, and so does
and
. It is clear that the formula
![]()
Definition: A and B are independent
iff ![]()
It might not be altogether trivial to decide if two events which are observed are produced by a random variable for which they are independent. Usually one says to oneself `I can't think of any reason why these events might be causally related, so I shall assume they aren't.' As for example when A is the event that a coin falls Heads up and B is the event that it is Tuesday. It seems a hit or miss process, and sometimes it misses.
Definition: If A and B are any two events
for a discrete rv the
conditional probability of A given B, p(A|B),
is defined when
by

There is a clear enough intuitive meaning to this:
Given that event B has occurred, we really
have
to look at the measure of those points in the
domain of the rv which can give rise to
an
outcome in the event A from among those which
can give rise to the event B. In short we have
what a category theorist would call a subrandom
variable, were category theorists to be
let loose in probability theory. A straightforward
example would be when B is the event that a
die is thrown and results in a value less than
four, and A is the event that the die shows
an
odd number. Then p(A|B) is the probability of
getting one or three divided by the probability
of
getting one or two or three, in other words it
is
. We can also calculate p(B|A),
which is the probability of getting one or three
divided by the probability of getting one or
three or five, which is also
. In
general the two probabilities, p(A|B) and
p(B|A), will be different. In fact we obviously
have the simple result:
Bayes' Theorem ![]()
This follows immediately from the rearrangements:
![]()
It would be a mistake to imagine that Bayes' theorem is profound, it is simply linguistic. For some reason, Bayes got his name on a somewhat distinctive approach to deciding, inter alia which model is responsible for a given datum. For now let us suppose that this extends to the case of continua, and we have the situation of Fig.1.10 of chapter one. Here, two models, gm and gf perhaps, are candidates for having been responsible for the point x. We can calculate without difficulty the numbers gm(x) and gf(x). We decide to regard these two numbers as something like the conditional probability of having got x given model m and the conditional probability of having got x given model f, respectively. We may write this, with extreme sloppiness, as p(x|m) and p(x|f) respectively. These would be interpreted by the sanguine as measuring the probability that the datum x will be observed given that model m (respectively, f) is true. Now what we want is p(m|x) and p(f|x) respectively, the probabilities that the models are true given the observation x. By applying Bayes Theorem in a spirit of untrammelled optimism, we deduce that

Suppose we are in the situation of the first problem at the end of chapter one; you will doubtless recall the half naked ladies and possibly the trees from which they were to be distinguished. Now it could be argued that in the absence of any data from an actual image, we would rate the probability of the image being of a tree as 0.9 and of it being a naked lady as 0.1, on the grounds that these are the ratios of the numbers of images. Bayesians refer to these as the prior probabilities of the events. So in the above formulae we could put p(m) and p(f) in as numbers if we had some similar sort of information about the likelihoods of the two models. This leaves only p(x) as a number which it is a little hard to assign. Happily, it occurs in both expressions, and if we look at the likelihood ratio, it cancels out. So we have:

If you are prepared to buy this, you have a justification for the rule of thumb of always choosing the bigger value, at least in the case where p(m) and p(f) are equal. In this case, p(m|x) is proportional to p(x|m) and p(f|x) is proportional to p(x|f) and with the same constant of proportionality, so choosing whichever model gives the bigger answer is the Bayes Optimal Solution. More generally than the crude rule of thumb, if it is ten times as likely to be m as f which is responsible for a datum in a state of ignorance of what the actual location of the datum is, then we can use this to obtain a bias of ten to one in favour of m by demanding that the ratio of the likelihoods be greater than ten to one before we opt for f as the more likely solution.
The blood runs cold at the thought of what must
be done to make the above argument rigorous;
the step from probabilities, as used in the formulae,
to likelihoods (since things like p(x|m)
interpreted as gm(x) are distinctly shonky)
requires some thought, but reasonable arguments
about pdfs being really just the limits
of histograms can carry you through this. What
is
more disturbing is the question of what the random
variable is, that has events which are made up
out of points in
together with gaussian
(or other) models for the distribution of those
points. Arguments can be produced. A Bayesian
is one for whom these (largely philosophical)
arguments are convincing. The decision arrived
at by this method is known as the Bayes
Optimal decision. Sticking the word `optimal' in usually
deceives the gullible into the conviction that
no critical thought is necessary; it is the best
possible decision according to Bayes, who was
a clergyman and possibly divinely inspired, so to
quarrel with the Bayes optimal decision would
be chancing divine retribution as well as the certainty
of being sneered at by an influential school
of Statisticians. Rather like the calculation of
` 99% confidence intervals' which suggest to
the credulous that you can be 99% confident that
the results will always be inside the intervals.
This is true only if you are 100% confident in
the model you are using. And if you are, you
shouldn't be.
Using power words as incantations is an infallible sign of ignorance, superstition and credulity in the part of the user. It should be left to politicians and other con-men.