Definition 14540
A stream is said to be quasi-linguistic if it is hierarchical and at least some of the (chunked) UpWritten streams admit a stochastic equivalence structure. In other words, we allow ourselves to do both kinds of UpWrite, into chunks of consecutive symbols or into stochastic equivalence classes of symbols. If any sequence of either UpWrite yields data compression, then the stream is quasi-linguistic
Although the definitions are not as tight as is desirable, they are adequate for there to be reasonable grounds for consensus on whether quasi-linguistic streams exist and whether a particular stream is an example of one. Now it is easy to convince oneself by a certain amount of simple programming that samples of English text of less than a million characters give reasonable grounds for believing such things exist. There are many other possible structures which may be found in Natural Language stream samples, but whereas grammarians have been usually concerned with finding them by eye and articulating them in the form of rules, I am concerned with two other issues: first the question of inference, of how one extracts the structure from the sample and what algorithms make this feasible. Second the investigation of more powerful methods of specifying structure than by rule systems. It should be plain that it is certainly a structural feature of natural language if it exhibits a stochastic equivalence structure, but it is not one which is naturally articulated by giving a set of rules.
If such a structure is found in a stream, then it can be used for smoothing probability estimates of predictors.
Suppose for example we are using a trigrammar
at the word level to constrain the acoustic part
of
a word recognition system, as is done at IBM's
Thomas J Watson Research Center in Yorktown
Heights. It is common to encounter a new word
never seen before (proper names, of ships, for
example) and also common to encounter two words
never seen before in that order. Suppose we have
a
string ABCDEF? and want to predict the ? from
the preceding words, EF, but have never seen
EF
together, although we have seen E and F. We
may first try replacing E by something stochastically
equivalent to it in the context CDE. There are,
presumably, a number of other symbols which might
have occurred in place of E after CD. We identify
the probability distribution as belonging to
a
cluster of other such distributions. We replace
the given distribution by the union of all those
in
the local cluster. This gives us a list of other
symbols which are stochastically equivalent to
E.
We weight them by their relative likelihoods,
and list them in the form
.Now we take the F and replace it by all symbols
Yi which are stochastically equivalent in
the
context DXj. Again we weight by their probabilities,
and include the DXjF case as of particularly
high weight. We take the union of the resulting
distributions for the weighted XjYi as the
bigram predictor. Then we try to recognise this
distribution, i.e. we choose the closest cluster
of known distributions. This is our final distribution
estimate for the successor of ABCDEF.
Similar methods apply whenever E or F is a
new word which has never been seen before.
If one reflects on the way in which we use context to determine meaning of a new word in a page of text, it may be seen that this method holds out some promise. Semantic meaning is inferred from other known meanings and the syntax of language. If we take the narrowest of stochastic equivalence sets, we may replace some of the symbols in a context with a stochastic equivalence class, and generate a larger stochastic equivalence class by the process described above. These classes become very large quite quickly, and rapidly tend towards things like the grammatical categories `Noun', `Adjective', `Transitive Verb', and so on.
Quasi-linguistic stream samples are of interest largely because they constitute abstractions of natural languages, and moreover the abstractions have been constructed so as to facilitate the inference of structures. This `Newtonian' approach to language analysis differs from the traditional `scholastic' approach, which you may find described from the writings of Chomsky on. The question arises, to what extent do streams of symbols occur, other than in natural language, which exhibit this structure?
A real number is a symbol, and so a time series of real numbers may be considered in this manner. There is of course a metric structure on the symbols which is usually presumed to be known. The interested reader is invited to investigate extensions of syntactic methods to real valued time series.