next up previous contents
Next: Stochastic Equivalence Up: Discrete Dynamic Patterns Previous: Streams, predictors and smoothers

Chunking by Entropy

Suppose we run a local predictor of order k over a stream sample and consider at each point in the stream the probability distribution or more generally the pdf over the alphabet at each stage. One of the important things we can say about the predictor is what it's entropy is at each point of the stream. Another thing we can do is to compute the amount of `surprise' which the actual symbol produces at each point of the stream. We can also compute the mean entropy of the stream sample relative to the predictor by summing the last number over the entire stream sample and dividing by the number of symbols, and we can also compute the mean entropy of the predictor. If the predictor is a good model for the stream, these will be about the same. If one predictor is better than another, we can tell by comparing the mean information (surprise) supplied by the stream to each predictor, and picking the smaller number. God, who is omniscient, is never surprised by what happens, which must make for unspeakable boredom.

It was observed by Shannon that as we run along English text, the entropy of a loal predictor of order k for k > 1 decreases inside a word until it comes to the white space. If I give you $by\_c$ as a (k-1)gram, the number of possible next symbols is large and the probability distribution is broad. If you get $y\_co$ as the (k-1)gram, the number of next symbols is somewhat less and their probabilities somewhat higher, so the entropy is less. At $\_com$ there are very few next possible symbols in English, and m, p, are much more likely than the others. At comp the next symbol is a vowel and the entropy is very low. At ring the next symbol is either white space, comma, s or e to high probability. And at $ing\_$ it is virtually impossible to say what the next symbol will be and the entropy has shot up again. So even using 5grams to obtain a local predictor, it is easy to find separators without prior knowledge of which symbols have been selected for that purpose. Alternatively, if someone were to go through text and remove all the white spaces, it would be possible to go through and do a reasonably good job of putting them back in again. Or to put it in a third form, we can use a predictor to go through text and chunk it into words.

Definition 14492

Using a predictor to separate a stream of symbols into consecutive strings as described above is called entropic chunking, and the strings so found are known as chunks.

Once chunks have been found by entropic chunking, the stream of symbols may be reconstructed to become a stream of chunks.

Definition 14493

The process of mapping a stream of symbols into a stream of chunks is called UpWriting; we say that the stream of chunks is at a higher level than the original stream of symbols. Recovering the stream of symbols from the stream of chunks is called DownWriting. UpWriting entails compiling a dictionary of chunks and assigning a new symbol to each chunk. DownWriting entails looking up the dictionary and replacing each chunk with the corresponding string of symbols.

It must be conceded that the practicalities may be considerable; finding maxima in the entropy may give not words of English but stems and inflections. Thus the s at the end of a word used to denote plurality may be a separate chunk, and similarly with ed to denote tense. Inflected languages simply haven't evolved to the consistency of having a white space used to denote a chunk end. [*]

Experiments make it convincing that the method of entropic chunking by a local predictor works sufficiently well to be applied to machine language programs, inter alia, and that they discover the obvious structure of operators and addresses.

As well as chunking by entropy, we can chunk by the amount of information supplied by a character, or the surprise as I have called it earlier. This will, by definition, give the same average result as chunking by entropy if our predictor has been obtained by counting occurrences in the text itself, but it can give different results on different texts.

The process of UpWriting yields compression; it is easy to see that knowing the language in which something is written gives a large amount of redundancy to be extracted and that this dictionary method will accomplish something of that extraction. It is also easy to see that there may be a language, or more exactly a stream, for which chunking doesn't work. Let it be noted that there do exist streams for which it does work, and that natural language texts are among them.

The process of UpWriting a stream into a higher level stream may be iterated. In the case of natural language streams, this yields a progressive decomposition into larger and larger chunks, as has already been described. In each stage there is some compression of the original stream, and each stage can be implemented in the same way by a local predictor of fairly low order. It is easy to persuade oneself that one attains greater compression than by having a high order predictor at least for some kinds of stream. Let us call such streams hierarchical streams.

Formally:

Definition 14496

A stream is said to be hierarchical if it allows compression by a sequence of UpWrites by low order predictors.

Since it is apparent that much human information processing involves a hierarchy of structures, hierarchical streams are natural things to study in that they are relatively simple cases of the decomposition process that, presumably, extends to learning to drive a car. The existence of clichés tells us that English is at least partly hierarchical to two levels, but plainly there is more going on than that.


next up previous contents
Next: Stochastic Equivalence Up: Discrete Dynamic Patterns Previous: Streams, predictors and smoothers
Mike Alder
9/19/1997