Consider the natural language English with symbols at the word level. Take a stream, any stream, and consider such sentences as `The cat sat on the mat'. Regarding this as a kgram for k=6 suppose we collect all other such 6grams which agree with the given one except in the second symbol. Thus `The dog sat on the mat' will be in the list. We have something very like the local predictor of order 6 except that we take account of context on both sides. We will get a similar object out, a probability distribution over the alphabet (in this case dictionary) of symbols. Of course, I didn't have to pick the second place as the `variable', it could have been any location.
Definition 14505
A context of order k for a stream
over an alphabet A is a string of the stream
together with a designated place between 1 and
k which is replaced with a placeholder symbol.
The symbols of the alphabet together with their
relative frequencies which match the place holder
when all the other symbols of the context are
matched define the context distribution
for
the stream and the context.
Suppose I have two context distributions arising
from different contexts. Then it might happen
that they give rise to distributions which are
pretty much the same; as distributions, they
may
not differ by more than I would expect from simply
sampling variation. There are various measures
of the difference between distributions, from
to Kullback-Leibler via a straight euclidean
metric, and I shall not go into
any of them here. It suffices to observe that
it makes sense to decide that some context
distributions are not distinguishable from others.
It is clear that this will happen in English;
a cat and a dog are both things that can sit on
mats, but they can do quite a lot of other things
in
common too. So the context distribution for `The
-?- sat on the mat' is probably not discriminable
from that for `She played with the young -?-'
for most stream samples. If we have a distribution
arising from a context or a local predictor (a
particular sort of context with the place holder
at
the right hand end) which is not well represented
because the context is a rare one, this could
be
useful as we could get a better estimate of the
true distribution by pooling with a distribution
obtained from another context. This is fraught
with risk, of course, but there is a risk involved
in having too few data to make a reliable estimate.
Life is pretty fraught all round.
At the letter level in English text, there are
some very distinctive distributions which may
be
found by examining different contexts: separators
and punctuation symbols for example, and digits
also form a cluster in the space of probability
distributions over the alphabet of symbols. (Which
looks exactly like a piece of
, where n
is the size of the alphabet). Vowels form a
distinguishable cluster in this space, with a
consistent ratio of counts in a variety of different
contexts.
And some reflection suggests that this feature
of natural language is not confined to English,
nor is
it confined to the letter or word level.
Any such regularity as this can be exploited in a grammar (regarded as a data compression device for specifying streams) or language model. Suppose I were to assign labels to the distributions. If there were too many of them and no grounds for pooling them, then I could have as many distributions as contexts, which is many more symbols than my alphabet originally contained. If, on the other hand there are very few, it might be desirable to do a new sort of UpWrite to the names of the distributions. The DownWrite from this would require making some choice of which symbol from the distribution to select. Instead of storing the symbol names and assigning them equal length, we would do well to send the distribution and a derived coding for the symbols exploiting the fact that commonly occurring ones should have short names. This could be a useful feature of any device which had to remember lots of symbol strings. A stream or language might have the property that it would be more economical to code clusters and places within the clusters than just to send the stream, it would depend on the extent to which clustering in the space of context distributions occurred.
Example 14524
The stream is obtained using 16 symbols a through to p. There are auxilliary symbols A,B, C,D which are `bags'. A is the bag containing a, b, c and d, and so on down to D which contains m, n, o, p. The bags are selected cyclically, A,B,C,D,A,B,C,D.. until 1024 have been listed altogether. Each time a bag is selected, one of the symbols inside the bag is picked with equal probability one quarter. So something like aeimbfjnc.. comes outof the process. Now it takes 4 bits to specify one of the 16 symbols, and there are 1024 of them, so a straight listing of the stream sample would occupy 4Kbits. But this does not exploit the fact that the bag structure is cyclic. If we have four bags, A,B,C,D then we can code each symbol by coding the bag it comes from with 2 bits, and the symbol in the bag by another 2 bits. Now the cyclic bag structure allows us to specify the bag 2-bits for each bag in much less than 2Kbits: we need to code the information that the cycle ABCD recurs 256 times. A thousand bits or so would seem more than adequate for this. So we save at least 1Kbits by using the structure of the stream.
The discerning and reflective reader will note that this may be thought of as a Hidden Markov Model for the language. The `bags' are the hidden states, and the state diagram is particularly simple.
Definition 14530
A stream is said to have a stochastic equivalence structure when it is the case that the context distributions for some order k of context form clusters in the space of probability distributions over the alphabet. If two symbols a and b occur with similar probabilities in a significant number of distributions which are all close in the space of all distributions over the alphabet, we may say that a and b are stochastically equivalent.
It is clear, just by stopping and thinking for a while, that English at the letter level has clusters for digits, for punctuation and for vowels. It is easy to confirm, by computation on English text, that for k = 4 these clusters actually exist. It is rather more expensive to compute the distributions for English at the word level, particularly for large k, but it is possible for k = 3, and again one's intuitions are borne out: synonyms and also antonyms are discovered sitting in the same distribution; class name exemplars seem to be implemented as elements of such distributions. For example, the class of domestic animals (including children) constitutes a distribution which occurs in many texts in many different contexts. Buses, taxis, cars, boats and aeroplanes figure as prominent elements of a distribution which again recurs in many stream samples and many contexts. This may not be altogether surprising, but it is hardly a necessary feature of a stream.
The above definition lacks the precision which
one might reasonably expect in book written by
a mathematician, even one who has gone downhill
into engineering, and it is a matter of some
irritation that it is not easy to give a satisfactory
definition of what it means for a set of
points to be clustered. (A probability distribution
over an alphabet of N symbols is, of
course, a point in
.) There are many different
definitions, see any of the books on
clustering, and althought they usually agree on
obvious cases, they may difer on the marginal
ones. For present purposes, I suggest the reader
thinks of a modelling by a mixture of gaussians
and uses Rissanen's stochastic complexity criterion
to decide on the optimal number of gaussians.
If the optimal number is zero or one, then we
can say that there is no clustering. If the number
is small
compared with the number of data points so that
many distributions find themselves packed closely
to other distributions, then we obtain a working
definition of stochastic equivalence. I do not
propose this as any more than an ad hoc
heuristic at this point.
If there is a clustering structure for the context distributions, then we can take all the kgram predictors which have `close' distributions in the last place, and deem them to be `close' also. After all, they ought to be stored together in any sensible prediction scheme precisely because they predict the same sequent. So we can cluster the k-1grams. This gives us the possibility of a new type of UpWrite, a stochastic equivalence class of kgrams gets written to a new symbol.
Definition 14539
An UpWrite may be of two types: the chunking UpWrite assigns a symbol in the UpWrite stream to a substring in the base stream. The clustering UpWrite assigns a symbol in the UpWrite stream to a stochastic equivalence class in the base level stream.