next up previous contents
Next: Compression for coin models Up: Minimum Description Length Models Previous: Minimum Description Length Models

Codes: Information theoretic preliminaries

Suppose I have a piece of English text to transmit down a line or store on disk. I shall suppose that the conventional ASCII character set and its usual representation by seven bits is well known to you and everybody who might take an interest in computers and text. Let us assume that the text consists of the sentence

the cat sat on the mat

which has just 22 characters. It therefore takes up 154 bits with the usual ascii coding. If we use 8 bits to a character, with say a parity check bit, then it goes up to 176 bits.

Let us examine the best compression we can think of, and ignore the noise and corruption which the parity check bit is designed to detect.

Most letters do not occur at all, and of those that do, the frequency varies. The frequencies of occurrence and the corresponding probabilities (divide by 22) rounded are:

t 5  0.22727
^ 5  0.22727
a 3  0.13636 
h 2  0.09091
e 2  0.09091 
c 1  0.04545
s 1  0.04545 
o 1  0.04545 
n 1  0.04545 
m 1  0.04545


 
Figure 3.3: A Huffman tree to code `the cat sat on the mat'.
\begin{figure}
\vspace{8cm}
\special {psfile=patrecfig3.6.ps}\end{figure}

I have put the characters in decreasing order of frequency, with a space denoted by ^ . Now I shall recode the characters by giving a new dictionary chosen so that the symbols which occur most often get the shortest binary coding. There are several well known ways of doing this, Shannon-Fano and Huffman being the best known.

In Huffman coding (Huffman, 1952) we imagine that we have the letters and probabilities at the leaves of a tree and proceed to join them together pairwise until we have a single node, the root of the tree. The simplest rule for joining nodes, and the one I shall use in this example, is to pick the two with the lowest summed probability, and to choose arbitrarily when there are several such.

In the above case we join together m and n as these have smallest probabilities and call the resulting node/mn/. Next we join o and s to get /os/. Each of /mn/ and /os/ have sum probability 0.09091 (after allowing for rounding). Now we join /os/ to c to get /osc/ with probability 0.13636. We could have chosen to join /mn/ to c, it doesn't make any difference to the end compression. We continue in this way until we have the whole tree, as indicated in Fig.3.3.

Now we obtain a code for each symbol by starting off from the ROOT and proceeding to the symbol. Every time we turn left at a fork we write `0' and every time we turn right we write `1'. Thus the symbol t gets coded as 00, while a space is 01. `a' becomes 111, and the rest are as shown:

 
t 5  0.22727   00
^ 5  0.22727   01
a 3  0.13636   111
h 2  0.09091   1000
e 2  0.09091   1001
c 1  0.04545   1010
s 1  0.04545   10110
o 1  0.04545   10111 
n 1  0.04545   1100 
m 1  0.04545   1101

Now suppose we code using this system. We get:

00 1000 1001 01 1010 111 00 01 10110 111 00 01 

t   h    e    ^   c   a  t  ^    s    a   t ^ 
 

10111 1100 01 00  1000 1001 01  1101 111 00 
  o     n  ^   t    h   e   ^     m    a  t

Several points merit attention. First, the total number of bits has dropped from 154 to 67, a considerable saving. Second, the code is a prefix code which is to say it can be decoded left to right without putting spaces in, no codeword is a left segment of another codeword. Third, the saving is illusory since I also have to send the alphabet of ten symbols (in ascii, 7 bits per symbol) followed by the code word for it, a total of 70 + 37 bits, which gives 174 bits. On the other hand, we didn't consider the overhead involved in the ascii code being sent!

The need to send the coding scheme I shall refer to as the overhead associated with the scheme. Of course, the coding scheme need be sent only once, and it is unreasonable to expect much compression in such a short string anyway. The actual example is fairly silly of course, but using the same idea, it is easy to get respectable levels of string compression for natural language text.

A little thought suggests some improvements. The `model' we are tacitly applying here is for English text being a random sequence of letters each with a predetermined frequency. English is pretty bad, but not as bad as that. If we were to store the frequency of consecutive pairs of symbols, the `bigrams', to use Shannon's terminology, we would get a more accurate model of English. For large enough samples of English text, an accurate bigram count would surely do a better job than the singleton count used in the example. After all, most bigrams don't occur at all, and some redundancies in English (such as u after q except in QANTAS) can be exploited. And why not move on to trigrams and tetragrams and so on? Given large enough samples of text from which to extract the ngrams, we could extract, you might think, every scrap of redundancy in English.[*] It is true, as one's intuitions suggest, that there are improvements of coding which, by treating blocks of text, get closer to the ideal coding implied by the model, and also that one cannot do better without a better model.

There are several ways of doing better than Huffman coding, but the first point to which I wish to draw your attention is that there are two components to compression of symbols: the first is a probabilistic model which gives frequencies of occurences which may be expected of symbols or blocks of symbols. The second is the exploitation of this model by assigning strings of bits to the symbols or blocks so that the high frequency symbols or blocks get the shortest strings.

The second point to observe is that if x is a symbol in an alphabet ${\cal X}$ of cardinality M, and if p(x) denotes the probability of getting the symbol x in the stream we wish to transmit (or store) according to a model, then Huffman coding gets us reasonably close to being able to code x with a string of bits of length l(x), where
$ l(x) \approx \log_2(\frac{1}{p(x)}) $. It is easy to see that if we deal with ngrams, the increase in the alphabet size allows us to get closer to this ideal coding length. And given any coding in bits of length l(x), then the code gives a `probability' to x of $(\frac{1}{2})^{l(x)}$. This is shown for the silly example, above, to indicate how one gets reasonably close despite the smallness

of the data set.

t 5  0.22727   00     0.25
^ 5  0.22727   01     0.25 
a 3  0.13636   111    0.125 
h 2  0.09091   1000   0.0625 
e 2  0.09091   1001   0.0625 
c 1  0.04545   1010   0.0625 
s 1  0.04545   10110  0.03125 
o 1  0.04545   10111  0.03125 
n 1  0.04545   1100   0.0625 
m 1  0.04545   1101   0.0625
If we code ${\cal X}$ in the obvious, `blind' way, we need $\log_2(\vert{\cal X}\vert)$ bits per symbol and if the message is of length L, it takes up $L \log_2(\vert{\cal X}\vert)$ bits to send it, neglecting overhead. If we code it using the model, we have, assuming we can get l(x) close to ideal,

\begin{displaymath}
L \sum_{x \in {\cal X}} p(x) \log_2(\frac{1}{p(x)}) \end{displaymath}

which is usually less. In fact it cannot be more, and the two are equal in the `maximum entropy' case where $p(x) = \frac{1}{\vert{\cal X}\vert} $. Proving this is a simple bit of calculus. Note that we take the value of $ p(x) \log_2(\frac{1}{p(x)})$ to be 0 when p(x) is zero- which makes sense if you draw the graph of $ p(x) \log_2(\frac{1}{p(x)})$ for x between 0 and 1.

The third point to observe is that if you poured water at 1 litre per second into the root of the tree in Fig.3.6 and the water splits up at the first node so that half goes each way, and similarly for every subsequent place where the pipe bifurcates, then the water coming out at the leaf labelled t emerges at $\frac{1}{4}$ litres per second, and if x is a leaf, then the amount of water coming out at x is $\frac{1}{2^{l(x)}}$. Now if you add up the amount of water coming out of all the leaves, equivalent to adding up all the numbers in the last column of the above table, you get the amount of water going in, which is 1. It is clear that you couldn't have more water coming out than going in, but the above counting neglects the possibility that we might have some code words left over with no symbols to be assigned to them. We therefore deduce the rather trivial result:

\begin{displaymath}
\sum_{x \in {\cal X}} \frac{1}{2^{l(x)}} \leq 
1 \end{displaymath}

It is easy to see that there is a generalisation to the case where we use more than two symbols in place of our binary code. We leave it to the ingenuity of the reader to write it down. This result is the well known Kraft Inequality. [*] It is plain that any code for ${\cal X}$ must satisfy the Kraft inequality, and this sometimes comes in useful.

A fourth point to note is that if the model, which asserts something about the probability of getting the symbol x, is wrong, then it is easy to see that the coding based upon the model will be less effective at compressing the data stream than if the model were right. This follows because we can compare the length of the coded data stream using the actual frequencies, say q(x), when we get

\begin{displaymath}
L \sum_{x \in {\cal X}} q(x) \log_2(\frac{1}{p(x)}) \end{displaymath}

for the length of the bitstream coded on model p and

\begin{displaymath}
L \sum_{x \in {\cal X}} q(x) \log_2(\frac{1}{q(x)}) \end{displaymath}

for the length of the bitstream coded on model q, and the second is less than the first by an easy little exercise in constrained optimisation. It is obviously a good idea to get as good a model as possible for what actually happens, if you want the maximum compression.

A fifth point to make is the rather obvious one that if we can predict the rest of the string past some point with absolute certainty, then the rest of the string is not informative to anyone who knows our prediction system. All we need to do is to send our prediction system and that suffices to allow the receiver to reconstruct the rest of the sequence. This, of course, is the idea behind sending the algorithm for $\pi$ instead of the first ten thousand digits. Our probabilistic model has some probability 1, and the length of the compressed string falls to zero. We have to send only the overhead.

Summarising, compression arises from redundancy in the string which allows us to model the string and the different frequencies of the components of the string. The more we know about the string the shorter the resulting coded version. The belief that there is some intrinsic amount of redundancy in the string is simple minded, what we have is the possibility of better and better models. The better your model, the more compression. We can extract redundancy from the data, relative to the model, by a coding scheme. The difference

\begin{displaymath}
L \left(\log_2(\vert{\cal X}\vert) - \sum_{x \in {\cal 
X}} p(x) \log_2(\frac{1}{p(x)}) \right) \end{displaymath}

measures the amount of information extracted from a stream of symbols from the alphabet ${\cal X}$ of length L by the model which gives the probabilities over the alphabet. In the case where the model fits the data perfectly with probabilities being 1 or 0, this is all of the information there is.

In practice, in order to use any such scheme, we have to transmit the model itself, which overhead may make it simpler to transmit the data instead. For large amounts of data, compression would seem to be a good investment.

This gives us an unambiguous sense of when one model is better than another: we can measure the amounts of compression we get on the data which the model is supposed to be modelling. The one which gives higher compression has been extracting more information (that is the only sensible way to measure information) from the data. In making this comparison, we may have to consider the overhead of the models. If the models have the same number of parameters, then the overheads are the same and may be ignored, but if the model is complex, the overhead may be unacceptable- and this may be determined unambiguously by measuring the total compression, taking into account the overhead.

You will be, I hope, familiar with the practicalities of Information Theory, or The Mathematical Theory of Communication, as Shannon prefers to call it. In order to think about what is going on, as opposed to simply knowing how to do the sums, it is useful to think of the basics as follows.

First one argues that information is passed in messages which consist of temporal sequences of symbols. We may as well take it that these symbols come from some finite alphabet, because if they came from an infinite alphabet, either we should have to wait an infinite length of time to get just one of them or we should have some practical limit on our ability to distinguish when two of them are different. So we model the whole business of communication as an exchange of symbol strings from some finite alphabet. Now the sender and receiver of these strings, not necessarily different people, will have, at any point in the string, some kind of expectation about the possibilities of what symbol can come next, which we propose to model as a probability distribution over the alphabet. This probability distribution will, of course, change as we go down the string of symbols.

Now at any point, for any symbol in the alphabet, I shall take the probability of the symbol and take its reciprocal and call it the improbability. Then I take the logarithm of this improbability and call it the surprise of the symbol turning up at this location. This seems a reasonable name, since if the probability is 1, so it is certain, then the improbability is also 1 and the logarithm is zero, so I shall get zero surprise. If the probability is 0 so that I firmly believe it can't possibly happen, then I get an infinite amount of surprise, which serves me right. Not so much a surprise as a paralysing shock. And if the probability is 0.5, then the improbability is 2, and if I take logarithms to the base 2, I get one unit of surprise. Such a unit is called a bit. Since probabilities for independent events multiply, so do improbabilities, and surprises simply sum. So if I get one bit of surprise from one throw of a coin (as I will if it is a fair coin, or if I believe it is) and then I throw it again and get another bit of surprise, I get altogether two bits of surprise.

Note that this is the surprise produced by the data relative to the model. It doesn't make sense to talk of the amount of surprise in the data per se. If you and I have different models, we might experience quite different amounts of surprise at different parts of the string of symbols.

The average amount of surprise I expect is easily calculated: the symbol x will occur with frequency p(x) (which will be conditional on all sorts of things) and each time it occurs I get $\log_2(\frac{1}{p(x)})$ bits of surprise, so the total is the sum of this over all symbols in the alphabet, weighted by the relative frequency of occurrence. This gives the well known formula for the entropy of the probability distribution:

\begin{displaymath}
\sum_{x \in {\cal X}} p(x) \log_2(\frac{1}{p(x)}) \end{displaymath}

This is clearly a property of the model and has nothing at all to do with the data. It is, of course, known as the entropy of the model. I can also compute, by summing the amount of surprise as I go along, the total amount of surprise I get from any given message, and the average surprise per symbol requires that I divide this by the number of symbols in the message. If the model is a good model of the message, then these two numbers ought to be fairly close. Given two models, I can measure the exent to which they expect, on average, the same symbols with similar probabilities, and thus get a measure of how different the models are, the Kullback-Leibler measure. What I call the `surprise' is more commonly called the information associated with the symbol. I guess it sounds classier.

Most of this may be found, in improved and extended form, in the book by Shannon and Weaver, which should be on everybody's bookcase. What isn't may be found in the book Elements of Information Theory by Cover and Thomas of the bibliography.

In conclusion, a thought for the day.

Brains may be regarded as devices for figuring out what is likely to happen next so as to enable the owner of the brain to improve his chances of survival. A brain has to compress the experience of the owner's life so far, and possibly also some of the experience of his ancestors. The amount of data is huge, and compressing it is essential. Happily, the better at extracting information from that experience, the more compression. So a device for figuring out what is likely to happen next will have the best chance of surviving in a hostile world when it is best at extracting information and hence of compressing its experience.


next up previous contents
Next: Compression for coin models Up: Minimum Description Length Models Previous: Minimum Description Length Models
Mike Alder
9/19/1997