Suppose we conduct the experiment with tossing a coin ten times and get the sequence of 8 Heads and 2 Tails, as described earlier. Again contemplate two models, the p(H) = 0.8 model and the p(H) = 0.5 model. I want to compute the compression I can obtain by using these models to describe the actual data.
Using model m=0.5, because the arithmetic is easier,
on the first throw, the probability of Heads
(according to the model, models are things that
have probabilities, not coins) is 0.5. The
improbability is the reciprocal of this which
is 2, and the information I get when a Head actually
turns up is
. This happens for
the first 8 throws of the coin, and I get 1 bit
of
information or surprise each time. On the last
two throws, I get a Tail which I also expect
with
probability 0.5, so I get another 1 bit of information
for each of these. This adds up to a total
of ten bits, and that is the most uninformative
model I could have for the process.
Using model m = 0.8, I get a total amount of surprise of
![]()
In writing a number between 0 and 1 in binary,
I imagine representing my number as a point on
the
line. I divide the line into two equal bits, with
in the middle. If the number is
in
the left half of this partition, I write 0, if
in the right half I write 1. Now I take the half
containing the point and blow it up so that it
looks the same as the original interval, and
again
I subdivide it in the middle. If my point is in
the left half I write 0, if in the right half
I
write 1. So far I have two digits, .00 if the
point is in the left quarter, .01 if it lies
between
and
, and so on. I
continue in this way until I have specified the
position of the point to a sufficient precision,
unless I am a pure mathematician in which case
I
continue indefinitely. This gives pure mathematicians
something to do.
Now there was no necessity to chop up the interval into two subintervals; had I used ten and labelled them from 0 to 9, I should have obtained the decimal expansion of the number by the same process. Nor is there any necessity to chop the interval up into equal intervals. I shall now chop it up into a left hand portion between 0 and 0.8, and a right hand portion from 0.8 to 1.0. If the first coin is Heads, I put it in the left portion, if tails in the right portion. If the second coin is Heads, I do the same on the expanded portion which got used last time. I continue until all ten coin throws have been described. The result is an interval. For the case of ten throws with the first 8 being Heads, I actually find that the interval is from 0.96(0.8)8 to 0.88. In decimal this comes out as between 0.16106127 and 0.16777216 approximately. Now to transmit the result in arithmetic code, I simply choose a number in binary which is in this interval. If I receive such a number and know that there are only ten results and that the 0.8/0.2 partition has been used, I can deduce all the results. In fact the binary number 0.0010101 with only seven bits comes out as 0.1630625 exactly, which is inside the interval. So I can actually send seven bits to code the sequence- using the model m = 0.8 to do it. This involves a small amount of cheating which I leave you to ponder. Beating the ideal of 7.219 looks suspicious and is.
Some innocent experimenting with other models and a calculator may lead you to enlightenment faster than any amount of prohibited substances.
It is easy to see that the m = 0.8 model does better than any other. The reason is obvious: when we add up the logarithms of the reciprocals of the probabilities we are multiplying the improbabilities, and the minimum of this number is the maximum likelihood. So the maximal compression model in this case is the good old Maximum Likelihood model. It is not too hard to believe that in the case where a Bayesian convinces you that the MAP model is the best one for fitting the data that hasn't yet come in, then your best bet for compressing this data, if he is right, is the MAP model.
The overhead of sending the compression scheme, i.e. the model and the procedure for Arithmetic coding is, presumably, the same in both cases and may be ignored.