the cat sat on the mat
which has just 22 characters. It therefore takes up 154 bits with the usual ascii coding. If we use 8 bits to a character, with say a parity check bit, then it goes up to 176 bits.
Let us examine the best compression we can think of, and ignore the noise and corruption which the parity check bit is designed to detect.
Most letters do not occur at all, and of those that do, the frequency varies. The frequencies of occurrence and the corresponding probabilities (divide by 22) rounded are:
t 5 0.22727 ^ 5 0.22727 a 3 0.13636 h 2 0.09091 e 2 0.09091 c 1 0.04545 s 1 0.04545 o 1 0.04545 n 1 0.04545 m 1 0.04545
I have put the characters in decreasing order of frequency, with a space denoted by ^ . Now I shall recode the characters by giving a new dictionary chosen so that the symbols which occur most often get the shortest binary coding. There are several well known ways of doing this, Shannon-Fano and Huffman being the best known.
In Huffman coding (Huffman, 1952) we imagine that we have the letters and probabilities at the leaves of a tree and proceed to join them together pairwise until we have a single node, the root of the tree. The simplest rule for joining nodes, and the one I shall use in this example, is to pick the two with the lowest summed probability, and to choose arbitrarily when there are several such.
In the above case we join together m and n as these have smallest probabilities and call the resulting node/mn/. Next we join o and s to get /os/. Each of /mn/ and /os/ have sum probability 0.09091 (after allowing for rounding). Now we join /os/ to c to get /osc/ with probability 0.13636. We could have chosen to join /mn/ to c, it doesn't make any difference to the end compression. We continue in this way until we have the whole tree, as indicated in Fig.3.3.
Now we obtain a code for each symbol by starting off from the ROOT and proceeding to the symbol. Every time we turn left at a fork we write `0' and every time we turn right we write `1'. Thus the symbol t gets coded as 00, while a space is 01. `a' becomes 111, and the rest are as shown:
t 5 0.22727 00 ^ 5 0.22727 01 a 3 0.13636 111 h 2 0.09091 1000 e 2 0.09091 1001 c 1 0.04545 1010 s 1 0.04545 10110 o 1 0.04545 10111 n 1 0.04545 1100 m 1 0.04545 1101
Now suppose we code using this system. We get:
00 1000 1001 01 1010 111 00 01 10110 111 00 01 t h e ^ c a t ^ s a t ^ 10111 1100 01 00 1000 1001 01 1101 111 00 o n ^ t h e ^ m a t
Several points merit attention. First, the total number of bits has dropped from 154 to 67, a considerable saving. Second, the code is a prefix code which is to say it can be decoded left to right without putting spaces in, no codeword is a left segment of another codeword. Third, the saving is illusory since I also have to send the alphabet of ten symbols (in ascii, 7 bits per symbol) followed by the code word for it, a total of 70 + 37 bits, which gives 174 bits. On the other hand, we didn't consider the overhead involved in the ascii code being sent!
The need to send the coding scheme I shall refer to as the overhead associated with the scheme. Of course, the coding scheme need be sent only once, and it is unreasonable to expect much compression in such a short string anyway. The actual example is fairly silly of course, but using the same idea, it is easy to get respectable levels of string compression for natural language text.
A little thought suggests some improvements. The
`model' we are tacitly applying here is for
English text being a random sequence of letters
each with a predetermined frequency. English
is
pretty bad, but not as bad as that. If we were
to store the frequency of consecutive pairs of
symbols, the `bigrams', to use Shannon's terminology,
we would get a more accurate model of
English. For large enough samples of English text,
an accurate bigram count would surely do a
better job than the singleton count used in the
example. After all, most bigrams don't occur
at
all, and some redundancies in English (such as
u after q except in QANTAS) can be exploited.
And
why not move on to trigrams and tetragrams and
so on? Given large enough samples of text from
which to extract the ngrams, we could extract,
you might think, every scrap of redundancy in
English.
It is true, as one's intuitions
suggest, that there are improvements of coding
which, by treating blocks of text, get closer
to
the ideal coding implied by the model, and also
that one cannot do better without a better model.
There are several ways of doing better than Huffman coding, but the first point to which I wish to draw your attention is that there are two components to compression of symbols: the first is a probabilistic model which gives frequencies of occurences which may be expected of symbols or blocks of symbols. The second is the exploitation of this model by assigning strings of bits to the symbols or blocks so that the high frequency symbols or blocks get the shortest strings.
The second point to observe is that if x is
a symbol in an alphabet
of cardinality
M, and if p(x) denotes the probability of
getting the symbol x in the stream we wish
to
transmit (or store) according to a model, then
Huffman coding gets us reasonably close to being
able to code x with a string of bits of length
l(x), where
. It is
easy to see that if we deal with ngrams, the
increase in the alphabet size allows us to get
closer to this ideal
coding length. And given any coding in bits of
length l(x),
then the code gives a `probability' to x of
. This is
shown for the silly example, above, to indicate
how one gets reasonably close despite the smallness
of the data set.
t 5 0.22727 00 0.25 ^ 5 0.22727 01 0.25 a 3 0.13636 111 0.125 h 2 0.09091 1000 0.0625 e 2 0.09091 1001 0.0625 c 1 0.04545 1010 0.0625 s 1 0.04545 10110 0.03125 o 1 0.04545 10111 0.03125 n 1 0.04545 1100 0.0625 m 1 0.04545 1101 0.0625If we code

The third point to observe is that if you poured
water at 1 litre per second into the root of
the tree in Fig.3.6 and the water splits up
at the first node so that half goes each way,
and similarly for every subsequent place where the
pipe bifurcates, then the water coming out at
the leaf labelled t emerges at
litres
per second, and if x is a leaf, then the amount
of water coming out at x is
.
Now if you add up the amount of water coming
out of all the leaves, equivalent to adding up
all the numbers in the last column of the above
table, you get the amount of water going in, which
is 1. It is clear that you couldn't have more
water coming out than going in, but the above
counting neglects the possibility that we might
have some code words left over with no symbols to be
assigned to them. We therefore deduce the rather
trivial result:

It is easy to see that there is a generalisation
to the case where we use more than two symbols
in place of our binary code. We leave it to the ingenuity
of the reader to write it down. This result
is the well known Kraft Inequality.
It is plain that any code for
must
satisfy the Kraft inequality, and this sometimes
comes in useful.
A fourth point to note is that if the model, which asserts something about the probability of getting the symbol x, is wrong, then it is easy to see that the coding based upon the model will be less effective at compressing the data stream than if the model were right. This follows because we can compare the length of the coded data stream using the actual frequencies, say q(x), when we get


A fifth point to make is the rather obvious one
that if we can predict the rest of the string
past some point with absolute certainty, then the rest
of the string is not informative to anyone who
knows our prediction system. All we need to do
is to send our prediction system and that suffices
to allow the receiver to reconstruct the rest
of the sequence. This, of course, is the idea
behind
sending the algorithm for
instead of the
first ten thousand digits. Our probabilistic
model has some probability 1, and the length of the
compressed string falls to zero. We have to send
only the overhead.
Summarising, compression arises from redundancy in the string which allows us to model the string and the different frequencies of the components of the string. The more we know about the string the shorter the resulting coded version. The belief that there is some intrinsic amount of redundancy in the string is simple minded, what we have is the possibility of better and better models. The better your model, the more compression. We can extract redundancy from the data, relative to the model, by a coding scheme. The difference

measures the amount of information extracted from
a stream of symbols from the alphabet
of length L by the model which gives the probabilities
over the alphabet. In the case where the
model fits the data perfectly with probabilities
being 1 or 0, this is all of the information
there is.
In practice, in order to use any such scheme, we have to transmit the model itself, which overhead may make it simpler to transmit the data instead. For large amounts of data, compression would seem to be a good investment.
This gives us an unambiguous sense of when one model is better than another: we can measure the amounts of compression we get on the data which the model is supposed to be modelling. The one which gives higher compression has been extracting more information (that is the only sensible way to measure information) from the data. In making this comparison, we may have to consider the overhead of the models. If the models have the same number of parameters, then the overheads are the same and may be ignored, but if the model is complex, the overhead may be unacceptable- and this may be determined unambiguously by measuring the total compression, taking into account the overhead.
You will be, I hope, familiar with the practicalities of Information Theory, or The Mathematical Theory of Communication, as Shannon prefers to call it. In order to think about what is going on, as opposed to simply knowing how to do the sums, it is useful to think of the basics as follows.
First one argues that information is passed in messages which consist of temporal sequences of symbols. We may as well take it that these symbols come from some finite alphabet, because if they came from an infinite alphabet, either we should have to wait an infinite length of time to get just one of them or we should have some practical limit on our ability to distinguish when two of them are different. So we model the whole business of communication as an exchange of symbol strings from some finite alphabet. Now the sender and receiver of these strings, not necessarily different people, will have, at any point in the string, some kind of expectation about the possibilities of what symbol can come next, which we propose to model as a probability distribution over the alphabet. This probability distribution will, of course, change as we go down the string of symbols.
Now at any point, for any symbol in the alphabet, I shall take the probability of the symbol and take its reciprocal and call it the improbability. Then I take the logarithm of this improbability and call it the surprise of the symbol turning up at this location. This seems a reasonable name, since if the probability is 1, so it is certain, then the improbability is also 1 and the logarithm is zero, so I shall get zero surprise. If the probability is 0 so that I firmly believe it can't possibly happen, then I get an infinite amount of surprise, which serves me right. Not so much a surprise as a paralysing shock. And if the probability is 0.5, then the improbability is 2, and if I take logarithms to the base 2, I get one unit of surprise. Such a unit is called a bit. Since probabilities for independent events multiply, so do improbabilities, and surprises simply sum. So if I get one bit of surprise from one throw of a coin (as I will if it is a fair coin, or if I believe it is) and then I throw it again and get another bit of surprise, I get altogether two bits of surprise.
Note that this is the surprise produced by the data relative to the model. It doesn't make sense to talk of the amount of surprise in the data per se. If you and I have different models, we might experience quite different amounts of surprise at different parts of the string of symbols.
The average amount of surprise I expect is easily
calculated: the symbol x will occur with
frequency p(x) (which will be conditional on
all sorts of things) and each time it occurs
I get
bits of surprise, so
the total is the sum of this over all symbols
in the
alphabet, weighted by the relative frequency of
occurrence. This gives the well known formula
for
the entropy of the probability distribution:

This is clearly a property of the model and has nothing at all to do with the data. It is, of course, known as the entropy of the model. I can also compute, by summing the amount of surprise as I go along, the total amount of surprise I get from any given message, and the average surprise per symbol requires that I divide this by the number of symbols in the message. If the model is a good model of the message, then these two numbers ought to be fairly close. Given two models, I can measure the exent to which they expect, on average, the same symbols with similar probabilities, and thus get a measure of how different the models are, the Kullback-Leibler measure. What I call the `surprise' is more commonly called the information associated with the symbol. I guess it sounds classier.
Most of this may be found, in improved and extended form, in the book by Shannon and Weaver, which should be on everybody's bookcase. What isn't may be found in the book Elements of Information Theory by Cover and Thomas of the bibliography.
In conclusion, a thought for the day.
Brains may be regarded as devices for figuring out what is likely to happen next so as to enable the owner of the brain to improve his chances of survival. A brain has to compress the experience of the owner's life so far, and possibly also some of the experience of his ancestors. The amount of data is huge, and compressing it is essential. Happily, the better at extracting information from that experience, the more compression. So a device for figuring out what is likely to happen next will have the best chance of surviving in a hostile world when it is best at extracting information and hence of compressing its experience.