Let m be the model that asserts p(H) = m;
then there is a model for each
.
Let
x be the observation consisting of a particular
sequence of 8 Heads and 2 Tails. Then
p(x|m) = m8(1-m)2
for any such m. If we maximise p(x|m) for
all values of m, we get the maximum likelihood
model, which by a bit of easy calculus occurs
when
8m7(1-m)2 - m8(2)(1-m) = 0
i.e. for m = 0.8, as claimed in 3.1.5.Now let us suppose that you have a prejudice about the different values for m. What I shall do is to assume that there is a probability distribution over the different values which m can have. If I try the distribution 6m(1-m) then I am showing a preference for the value of m at 0.5 where this pdf has a maximum, but it is not a very powerful commitment.
Now looking at the Bayes formula

we obtain immediately that

Since p(x) is some prior probability of x
which, I shall suppose, does not depend upon
the
model, m, I can use calculus to obtain the maximum
of this expression. A little elementary
algebra gives a value of
. This
is between the maximum likelihood estimate and
the prior prejudice of 0.5. Since my commitment
to 0.5 was not very strong, as indicated by the
pdf I chose, I have gone quite a long way
toward the ML model. Had I used 30m2(1-m)2
as my measure of prejudice, I should have been closer
to 0.5.
The above argument assumes that p(x) does not depend on m. But what is p(x)? If x has just been observed, then it ought to be 1; so it doesn't mean that. The most sensible answer is that it is the so called marginal probability, which is obtained by adding up all the p(x|m)s for all the different m, weighted by the probability of m. Since these are all the values between 0 and 1, we have
![]()
which means that dividing by p(x) simply amounts to normalising the numerator in the definition of p(m|x) above, thus ensuring that it is a bona fide pdf. So we have not just a most probable value for the estimated value of the parameter m given the observation data x, we have a pdf for m. This pdf known, not unreasonably, as the posterior pdf, and is written, as above, as p(m|x), and may be compared with p(m) as the prior pdf. It is easy to compute it explicitly in the case of the coin tossing experiment, and the reader is urged to do this as it is soothing, easy and helps burn the idea into the brain in a relatively painless manner. The maximum of the posterior pdf is at some places in the literature referred to as the MAP probability, because acronyms are much less trouble to memorise than learning the necessary latin, and Maximum A Posteriori takes longer to say than `MAP', even though it can be difficult for the untrained ear to hear the capital letters.
In the general case where data is coming in sequentially, we can start off with some prior probability distribution, and for each new datum use this method to predict what the datum ought to be, and when we find out what it actually was, we can update the prior. This is clearly a kind of learning, and it is not beyond the range of belief that something similar may occur in brains. We talk, at each stage of new data acquisition, of the a priori and a posteriori pdfs for the data, so we obtain an updated estimate of the `best' pdf all the time. This will not in general be the same as the Maximum Likelihood estimate, as in the example.