next up previous contents
Next: Subjective Bayesians Up: Bayesian Methods Previous: Bayes' Theorem

Bayesian Statistics

In the decision case of telling the guys from the gals, we assumed that the models m and f were given and non-negotiable and that the problem was to choose between them as the most probable explanation for the occurrence of the datum x. In many problems, we have the data and the issue is to find the best or most probable pdf to account for the whole lot. Point or parameter estimation is the business of finding the `best' model for a given data set. The question, of course, is, what does `best' mean? We have discussed the idea of the Maximum Likelihood estimator, but there are cases when it is obvious even to the most partisan that it is not a particularly good choice. In the case of the coin which was thrown ten times and came down Heads eight times, the Maximum Likelihood model leads you to believe it will come down 800 times, give or take a few, if it is tossed a thousand times. This might surprise someone who looked hard at the coin and concluded that it looked like any other coin. Such a person might be prejudiced in favour of something closer to 0.5 and unwilling to go all the way to 0.8 on the strength of just ten throws.

Let m be the model that asserts p(H) = m; then there is a model for each $m \in [0,1]$. Let x be the observation consisting of a particular sequence of 8 Heads and 2 Tails. Then p(x|m) = m8(1-m)2 for any such m. If we maximise p(x|m) for all values of m, we get the maximum likelihood model, which by a bit of easy calculus occurs when

8m7(1-m)2 - m8(2)(1-m) = 0

i.e. for m = 0.8, as claimed in 3.1.5.

Now let us suppose that you have a prejudice about the different values for m. What I shall do is to assume that there is a probability distribution over the different values which m can have. If I try the distribution 6m(1-m) then I am showing a preference for the value of m at 0.5 where this pdf has a maximum, but it is not a very powerful commitment.

Now looking at the Bayes formula

\begin{displaymath}
p(m\vert x) = \frac{p(x\vert m) p(m)}{p(x)} \end{displaymath}

we obtain immediately that

\begin{displaymath}
p(m\vert x) = \frac{m^8(1-m)^26m(1-m)}{p(x)} = 
\frac{6m^9(1-m)^3}{p(x)} \end{displaymath}

Since p(x) is some prior probability of x which, I shall suppose, does not depend upon the model, m, I can use calculus to obtain the maximum of this expression. A little elementary algebra gives a value of $m = \frac{3}{4} $. This is between the maximum likelihood estimate and the prior prejudice of 0.5. Since my commitment to 0.5 was not very strong, as indicated by the pdf I chose, I have gone quite a long way toward the ML model. Had I used 30m2(1-m)2 as my measure of prejudice, I should have been closer to 0.5.

The above argument assumes that p(x) does not depend on m. But what is p(x)? If x has just been observed, then it ought to be 1; so it doesn't mean that. The most sensible answer is that it is the so called marginal probability, which is obtained by adding up all the p(x|m)s for all the different m, weighted by the probability of m. Since these are all the values between 0 and 1, we have

\begin{displaymath}
p(x) = \int_{m=0}^{m=1} p(x\vert m)p(m) dm \end{displaymath}

which means that dividing by p(x) simply amounts to normalising the numerator in the definition of p(m|x) above, thus ensuring that it is a bona fide pdf. So we have not just a most probable value for the estimated value of the parameter m given the observation data x, we have a pdf for m. This pdf known, not unreasonably, as the posterior pdf, and is written, as above, as p(m|x), and may be compared with p(m) as the prior pdf. It is easy to compute it explicitly in the case of the coin tossing experiment, and the reader is urged to do this as it is soothing, easy and helps burn the idea into the brain in a relatively painless manner. The maximum of the posterior pdf is at some places in the literature referred to as the MAP probability, because acronyms are much less trouble to memorise than learning the necessary latin, and Maximum A Posteriori takes longer to say than `MAP', even though it can be difficult for the untrained ear to hear the capital letters.

In the general case where data is coming in sequentially, we can start off with some prior probability distribution, and for each new datum use this method to predict what the datum ought to be, and when we find out what it actually was, we can update the prior. This is clearly a kind of learning, and it is not beyond the range of belief that something similar may occur in brains. We talk, at each stage of new data acquisition, of the a priori and a posteriori pdfs for the data, so we obtain an updated estimate of the `best' pdf all the time. This will not in general be the same as the Maximum Likelihood estimate, as in the example.


next up previous contents
Next: Subjective Bayesians Up: Bayesian Methods Previous: Bayes' Theorem
Mike Alder
9/19/1997