Suppose we run a local predictor of order k over a stream sample and consider at each point in the stream the probability distribution or more generally the pdf over the alphabet at each stage. One of the important things we can say about the predictor is what it's entropy is at each point of the stream. Another thing we can do is to compute the amount of `surprise' which the actual symbol produces at each point of the stream. We can also compute the mean entropy of the stream sample relative to the predictor by summing the last number over the entire stream sample and dividing by the number of symbols, and we can also compute the mean entropy of the predictor. If the predictor is a good model for the stream, these will be about the same. If one predictor is better than another, we can tell by comparing the mean information (surprise) supplied by the stream to each predictor, and picking the smaller number. God, who is omniscient, is never surprised by what happens, which must make for unspeakable boredom.
It was observed by Shannon that as we run along
English text, the entropy of a loal predictor
of order k for k > 1 decreases inside a word
until it comes to the white space. If I give
you
as a (k-1)gram, the number of possible
next symbols is large and the probability distribution
is broad. If you get
as the (k-1)gram,
the number of next symbols is somewhat less and
their probabilities somewhat higher, so the entropy
is less. At
there are very few next possible
symbols in English, and m, p, are much more
likely than the others. At comp the next symbol
is a vowel and the entropy is very low. At ring
the next symbol is either white space, comma,
s or e to high probability. And at
it is virtually impossible to say what the next
symbol will be and the entropy has shot up again. So
even using 5grams to obtain a local predictor,
it is easy to find separators without prior knowledge
of which symbols have been selected for that
purpose. Alternatively, if someone were to go
through text and remove all the white spaces,
it would be possible to go through and do a reasonably
good job of putting them back in again. Or
to put it in a third form, we can use a predictor
to go through text and chunk it into words.
Definition 14492
Using a predictor to separate a stream of symbols into consecutive strings as described above is called entropic chunking, and the strings so found are known as chunks.
Once chunks have been found by entropic chunking, the stream of symbols may be reconstructed to become a stream of chunks.
Definition 14493
The process of mapping a stream of symbols into a stream of chunks is called UpWriting; we say that the stream of chunks is at a higher level than the original stream of symbols. Recovering the stream of symbols from the stream of chunks is called DownWriting. UpWriting entails compiling a dictionary of chunks and assigning a new symbol to each chunk. DownWriting entails looking up the dictionary and replacing each chunk with the corresponding string of symbols.
It must be conceded that the practicalities may
be considerable; finding maxima in the entropy
may give not words of English but stems and inflections.
Thus the s at the end of a word used to
denote plurality may be a separate chunk, and
similarly with ed to denote tense. Inflected
languages simply haven't evolved to the consistency
of having a white space used to denote a chunk
end.
Experiments make it convincing that the method of entropic chunking by a local predictor works sufficiently well to be applied to machine language programs, inter alia, and that they discover the obvious structure of operators and addresses.
As well as chunking by entropy, we can chunk by the amount of information supplied by a character, or the surprise as I have called it earlier. This will, by definition, give the same average result as chunking by entropy if our predictor has been obtained by counting occurrences in the text itself, but it can give different results on different texts.
The process of UpWriting yields compression; it is easy to see that knowing the language in which something is written gives a large amount of redundancy to be extracted and that this dictionary method will accomplish something of that extraction. It is also easy to see that there may be a language, or more exactly a stream, for which chunking doesn't work. Let it be noted that there do exist streams for which it does work, and that natural language texts are among them.
The process of UpWriting a stream into a higher level stream may be iterated. In the case of natural language streams, this yields a progressive decomposition into larger and larger chunks, as has already been described. In each stage there is some compression of the original stream, and each stage can be implemented in the same way by a local predictor of fairly low order. It is easy to persuade oneself that one attains greater compression than by having a high order predictor at least for some kinds of stream. Let us call such streams hierarchical streams.
Formally:
Definition 14496
A stream is said to be hierarchical if it allows compression by a sequence of UpWrites by low order predictors.
Since it is apparent that much human information processing involves a hierarchy of structures, hierarchical streams are natural things to study in that they are relatively simple cases of the decomposition process that, presumably, extends to learning to drive a car. The existence of clichés tells us that English is at least partly hierarchical to two levels, but plainly there is more going on than that.