next up previous contents
Next: Quasi-Linguistic Streams Up: Discrete Dynamic Patterns Previous: Chunking by Entropy

Stochastic Equivalence

Consider the natural language English with symbols at the word level. Take a stream, any stream, and consider such sentences as `The cat sat on the mat'. Regarding this as a kgram for k=6 suppose we collect all other such 6grams which agree with the given one except in the second symbol. Thus `The dog sat on the mat' will be in the list. We have something very like the local predictor of order 6 except that we take account of context on both sides. We will get a similar object out, a probability distribution over the alphabet (in this case dictionary) of symbols. Of course, I didn't have to pick the second place as the `variable', it could have been any location.

Definition 14505

A context of order k for a stream ${\cal 
S}$ over an alphabet A is a string of the stream together with a designated place between 1 and k which is replaced with a placeholder symbol. The symbols of the alphabet together with their relative frequencies which match the place holder when all the other symbols of the context are matched define the context distribution for the stream and the context.

Suppose I have two context distributions arising from different contexts. Then it might happen that they give rise to distributions which are pretty much the same; as distributions, they may not differ by more than I would expect from simply sampling variation. There are various measures of the difference between distributions, from $\chi^2$ to Kullback-Leibler via a straight euclidean metric, and I shall not go into any of them here. It suffices to observe that it makes sense to decide that some context distributions are not distinguishable from others. It is clear that this will happen in English; a cat and a dog are both things that can sit on mats, but they can do quite a lot of other things in common too. So the context distribution for `The -?- sat on the mat' is probably not discriminable from that for `She played with the young -?-' for most stream samples. If we have a distribution arising from a context or a local predictor (a particular sort of context with the place holder at the right hand end) which is not well represented because the context is a rare one, this could be useful as we could get a better estimate of the true distribution by pooling with a distribution obtained from another context. This is fraught with risk, of course, but there is a risk involved in having too few data to make a reliable estimate. Life is pretty fraught all round.

At the letter level in English text, there are some very distinctive distributions which may be found by examining different contexts: separators and punctuation symbols for example, and digits also form a cluster in the space of probability distributions over the alphabet of symbols. (Which looks exactly like a piece of ${\fam11\tenbbb R}^n$, where n is the size of the alphabet). Vowels form a distinguishable cluster in this space, with a consistent ratio of counts in a variety of different contexts. And some reflection suggests that this feature of natural language is not confined to English, nor is it confined to the letter or word level.[*]

Any such regularity as this can be exploited in a grammar (regarded as a data compression device for specifying streams) or language model. Suppose I were to assign labels to the distributions. If there were too many of them and no grounds for pooling them, then I could have as many distributions as contexts, which is many more symbols than my alphabet originally contained. If, on the other hand there are very few, it might be desirable to do a new sort of UpWrite to the names of the distributions. The DownWrite from this would require making some choice of which symbol from the distribution to select. Instead of storing the symbol names and assigning them equal length, we would do well to send the distribution and a derived coding for the symbols exploiting the fact that commonly occurring ones should have short names. This could be a useful feature of any device which had to remember lots of symbol strings. A stream or language might have the property that it would be more economical to code clusters and places within the clusters than just to send the stream, it would depend on the extent to which clustering in the space of context distributions occurred.

Example 14524

The stream is obtained using 16 symbols a through to p. There are auxilliary symbols A,B, C,D which are `bags'. A is the bag containing a, b, c and d, and so on down to D which contains m, n, o, p. The bags are selected cyclically, A,B,C,D,A,B,C,D.. until 1024 have been listed altogether. Each time a bag is selected, one of the symbols inside the bag is picked with equal probability one quarter. So something like aeimbfjnc.. comes outof the process. Now it takes 4 bits to specify one of the 16 symbols, and there are 1024 of them, so a straight listing of the stream sample would occupy 4Kbits. But this does not exploit the fact that the bag structure is cyclic. If we have four bags, A,B,C,D then we can code each symbol by coding the bag it comes from with 2 bits, and the symbol in the bag by another 2 bits. Now the cyclic bag structure allows us to specify the bag 2-bits for each bag in much less than 2Kbits: we need to code the information that the cycle ABCD recurs 256 times. A thousand bits or so would seem more than adequate for this. So we save at least 1Kbits by using the structure of the stream.

The discerning and reflective reader will note that this may be thought of as a Hidden Markov Model for the language. The `bags' are the hidden states, and the state diagram is particularly simple.

Definition 14530

A stream is said to have a stochastic equivalence structure when it is the case that the context distributions for some order k of context form clusters in the space of probability distributions over the alphabet. If two symbols a and b occur with similar probabilities in a significant number of distributions which are all close in the space of all distributions over the alphabet, we may say that a and b are stochastically equivalent.

It is clear, just by stopping and thinking for a while, that English at the letter level has clusters for digits, for punctuation and for vowels. It is easy to confirm, by computation on English text, that for k = 4 these clusters actually exist. It is rather more expensive to compute the distributions for English at the word level, particularly for large k, but it is possible for k = 3, and again one's intuitions are borne out: synonyms and also antonyms are discovered sitting in the same distribution; class name exemplars seem to be implemented as elements of such distributions. For example, the class of domestic animals (including children) constitutes a distribution which occurs in many texts in many different contexts. Buses, taxis, cars, boats and aeroplanes figure as prominent elements of a distribution which again recurs in many stream samples and many contexts. This may not be altogether surprising, but it is hardly a necessary feature of a stream.

The above definition lacks the precision which one might reasonably expect in book written by a mathematician, even one who has gone downhill into engineering, and it is a matter of some irritation that it is not easy to give a satisfactory definition of what it means for a set of points to be clustered. (A probability distribution over an alphabet of N symbols is, of course, a point in ${\fam11\tenbbb R}^N$.) There are many different definitions, see any of the books on clustering, and althought they usually agree on obvious cases, they may difer on the marginal ones. For present purposes, I suggest the reader thinks of a modelling by a mixture of gaussians and uses Rissanen's stochastic complexity criterion to decide on the optimal number of gaussians. If the optimal number is zero or one, then we can say that there is no clustering. If the number is small compared with the number of data points so that many distributions find themselves packed closely to other distributions, then we obtain a working definition of stochastic equivalence. I do not propose this as any more than an ad hoc heuristic at this point.

If there is a clustering structure for the context distributions, then we can take all the kgram predictors which have `close' distributions in the last place, and deem them to be `close' also. After all, they ought to be stored together in any sensible prediction scheme precisely because they predict the same sequent. So we can cluster the k-1grams. This gives us the possibility of a new type of UpWrite, a stochastic equivalence class of kgrams gets written to a new symbol.

Definition 14539

An UpWrite may be of two types: the chunking UpWrite assigns a symbol in the UpWrite stream to a substring in the base stream. The clustering UpWrite assigns a symbol in the UpWrite stream to a stochastic equivalence class in the base level stream.


next up previous contents
Next: Quasi-Linguistic Streams Up: Discrete Dynamic Patterns Previous: Chunking by Entropy
Mike Alder
9/19/1997