next up previous contents
Next: Chunking by Entropy Up: Discrete Dynamic Patterns Previous: Inference of ReWrite grammars

Streams, predictors and smoothers

A language as we have defined it is a set (usually infinite in principle) of strings. It may be noted that natural language fits (uneasily) into this definition in several different ways; it may be taken with alphabet the printed characters, in which case we can take the words as the strings, or the sentences, or even the paragraphs. If we take the words, then the strings of one language become the symbols of the next, where the strings may be sentences. Or paragraphs, or chapters, or books, or for that matter libraries... So natural languages are actually hierarchies of languages as we have defined them, which suggests that we may not be defining them very intelligently.

What we are confronted with on opening a book is, after we have mastered the convention of scanning from left to right across a line, down a page and up to the next page or turning over, a stream of characters, or a very long string indeed.[*] Since we don't read it all at once, and don't know how long it is precisely, it makes sense to idealise it as we do a signal received from a television transmitter, as an infinitely long sequence of things. In this case as an infinitely long sequence of symbols. This is how text files are thought of as stored on disk in a computer, although they are not in fact, happily, infinitely long.[*] Then a white space is a separator symbol and a full stop, `.', is another, while linefeed carriage return, twice, is another. This allows us to chop the stream up into strings which then provides us with a language, in fact several neatly nested.

Definition 14456

A stream or time series of symbols on an alphabet A is a map from ${\fam11\tenbbb Z}^+$ into A.

A local grammar of order k makes just as much sense for a stream as it does for a stochastic language. In fact if we are given a sequence of strings from a language, we can turn them into a stream by either putting begin < and end > symbols around them and just concatenating, or perhaps replacing >< by a white space in this arrangement to save on symbols.

Definition 14461

A stream sample on an alphabet A is a map from the set [1..N] to A, where [1..N] is the set of all positive integers less than N+1.

In real life, we are given stream samples. N may be several million or more. A may be $\{ 0,1\}$in computer applications. A local grammar becomes a predictor of what the next symbol may be as it moves along (up? down?) the stream.

Definition 14465

A predictor of order k for a stream over A, is a map from the set of kgrams in a local grammar for the stream to the set of pdfs over the alphabet A.

In other words, if I give you k symbols which have been found to occur consecutively, a predictor of order k will accept this as input and return a probability, conditional on the input, for such a kgram being followed by any given symbol of A.

Example 14470

Go through the stream or stream sample and obtain a local grammar of order 2 by listing all trigrams. Now collect together all trigrams which agree in the first two places, and count for each symbol in the alphabet the number of times it appears in the last place. For example, the preceding two sentences would have the trigrams

_an
_al
_ag
_a_
which start with a whitespace ( replaced here by an underscore) and have `a' as the second symbol. The first and second occur twice and three times respectively. Thus the probability estimate of getting ` an' is $\frac{2}{7}$.

Definition 14474

The predictor which stores k+1grams by storing all those with the first k symbols the same and retaining a count of the different possibilities for the last symbol is called a local predictor of order k.

It is easy to see how such a predictor could have its uses in an OCR program. If you were to collect trigrams from the text you were sure about, you could compare with known statistics for English and decide if it was English, or perhaps some related language such as bureaucrats gobbledegook. Then once you had a plausible identification of the language, you could clean up `Profes5or' by using two sets of data, the relative probabilities of `es5', `ess', `esa', and so on from the probabilistic model for the character, and the relative probabilities from the predictor in the absence of any actual information about the character except its context. A little thought will show how to express the net probability of the character being a `5' or an `s'. This is smarter than using a dictionary, since it can cope with new words (such as proper names) which do not occur in the dictionary. Such a system will easily improve the performance of an OCR system. Indeed it can be used to build a font independent OCR device as follows:

Suppose that the text can be segmented into characters with high reliability and that the object language is known. I scan the text and start with the first character. I assign it at random to one of the letters of the alphabet. I move on to the next character. If it is the same as one I have seen before in this data, I assign it to the same letter as last time, if it has not been seen before, I assign it to a new letter so far unused, at random. I continue in this way until all characters have been assigned letters. This is simple clustering.

I have now got the text transcribed to ASCII characters, but the text has been coded by a `Caesar' code, that is a 1-1 substitution scheme. Now I use the known statistics of the language to break the code, a matter of a few minutes work by hand, and much faster on a PC.

Given a book of English text to scan, with the text stored as letters not characters, the little green thing from outer space would be struck at once by the fact that one character is much more common than any other. I am referring not to the letter `e', but to the white space separating words. These characters, used by us as separators, may be identified as separators without prior knowledge. This is just as well; in speech there are no acoustic cues telling the auditor when a word has ended- there are no white spaces. It would be easy enough to put them back in were they left out of printed text however. How this may be done will be described next.


next up previous contents
Next: Chunking by Entropy Up: Discrete Dynamic Patterns Previous: Inference of ReWrite grammars
Mike Alder
9/19/1997