next up previous contents
Next: Compression for pdfs Up: Minimum Description Length Models Previous: Codes: Information theoretic preliminaries

Compression for coin models

Suppose we conduct the experiment with tossing a coin ten times and get the sequence of 8 Heads and 2 Tails, as described earlier. Again contemplate two models, the p(H) = 0.8 model and the p(H) = 0.5 model. I want to compute the compression I can obtain by using these models to describe the actual data.

Using model m=0.5, because the arithmetic is easier, on the first throw, the probability of Heads (according to the model, models are things that have probabilities, not coins) is 0.5. The improbability is the reciprocal of this which is 2, and the information I get when a Head actually turns up is $\log_2(2) = 1$. This happens for the first 8 throws of the coin, and I get 1 bit of information or surprise each time. On the last two throws, I get a Tail which I also expect with probability 0.5, so I get another 1 bit of information for each of these. This adds up to a total of ten bits, and that is the most uninformative model I could have for the process.

Using model m = 0.8, I get a total amount of surprise of

\begin{displaymath}
8 \log_2(\frac{1}{0.8}) + 2 \log_2(\frac{1}{0.2}) 
= 8(0.3219) + 2 (2.3219) = 7.219 \end{displaymath}

which is less. So the amount of compression of this string of digits on this second model could reduce it from 10 bits to just over 7 bits. Huffman coding won't help, there are too few symbols, but Arithmetic coding can help. I describe this now.

In writing a number between 0 and 1 in binary, I imagine representing my number as a point on the line. I divide the line into two equal bits, with $\frac{1}{2}$ in the middle. If the number is in the left half of this partition, I write 0, if in the right half I write 1. Now I take the half containing the point and blow it up so that it looks the same as the original interval, and again I subdivide it in the middle. If my point is in the left half I write 0, if in the right half I write 1. So far I have two digits, .00 if the point is in the left quarter, .01 if it lies between $\frac{1}{4}$ and $\frac{1}{2}$, and so on. I continue in this way until I have specified the position of the point to a sufficient precision, unless I am a pure mathematician in which case I continue indefinitely. This gives pure mathematicians something to do.

Now there was no necessity to chop up the interval into two subintervals; had I used ten and labelled them from 0 to 9, I should have obtained the decimal expansion of the number by the same process. Nor is there any necessity to chop the interval up into equal intervals. I shall now chop it up into a left hand portion between 0 and 0.8, and a right hand portion from 0.8 to 1.0. If the first coin is Heads, I put it in the left portion, if tails in the right portion. If the second coin is Heads, I do the same on the expanded portion which got used last time. I continue until all ten coin throws have been described. The result is an interval. For the case of ten throws with the first 8 being Heads, I actually find that the interval is from 0.96(0.8)8 to 0.88. In decimal this comes out as between 0.16106127 and 0.16777216 approximately. Now to transmit the result in arithmetic code, I simply choose a number in binary which is in this interval. If I receive such a number and know that there are only ten results and that the 0.8/0.2 partition has been used, I can deduce all the results. In fact the binary number 0.0010101 with only seven bits comes out as 0.1630625 exactly, which is inside the interval. So I can actually send seven bits to code the sequence- using the model m = 0.8 to do it. This involves a small amount of cheating which I leave you to ponder. Beating the ideal of 7.219 looks suspicious and is.

Some innocent experimenting with other models and a calculator may lead you to enlightenment faster than any amount of prohibited substances.

It is easy to see that the m = 0.8 model does better than any other. The reason is obvious: when we add up the logarithms of the reciprocals of the probabilities we are multiplying the improbabilities, and the minimum of this number is the maximum likelihood. So the maximal compression model in this case is the good old Maximum Likelihood model. It is not too hard to believe that in the case where a Bayesian convinces you that the MAP model is the best one for fitting the data that hasn't yet come in, then your best bet for compressing this data, if he is right, is the MAP model.

The overhead of sending the compression scheme, i.e. the model and the procedure for Arithmetic coding is, presumably, the same in both cases and may be ignored.


next up previous contents
Next: Compression for pdfs Up: Minimum Description Length Models Previous: Codes: Information theoretic preliminaries
Mike Alder
9/19/1997