next up previous contents
Next: Inference of ReWrite grammars Up: Alphabets, Languages and Grammars Previous: ReWrite Grammars

Grammatical Inference

It is a very nice thing to have a grammar for a language, and represents a nice compact representation of the language. It is a model for the language in a sense which you should recognise as congenial to a follower of Rissanen. The question is, how are such things to be obtained?

Definition 14331

The Grammatical Inference Problem is, given a language $\cal 
L$ over an alphabet A, to obtain a grammar for it. This holds for both stochastic and non-stochastic languages; for a stochastic language one seeks a stochastic grammar.

In real life, one is actually given a sample of strings produced by some process and the sample is finite. The real life problem is to model the process economically. Thus the next example is a sensible problem which can be made absurd with no trouble at all:

Example 14337

You are given what are claimed to be 15 strings produced by a stochastic process of an unknown sort. The strings are aba (8), abba (4), abbba (2) and abbbba (1), where the numbers in parentheses give the counts of the string. You are required to model the process and use your model to obtain an estimate of the probability of seeing abbbbbba.

Without some extra constraints such as a class of models from which to pick, the problem has no unique solution. For example, you might model the data by saying these are the only strings there are. This is just the SureThing or Predestination model again. Your estimate of the probability of getting abbbbbba is that it is zero. This is very incautious of you, since if it occurs your model will get an infinite amount of `surprise', the shock of which might disable you. On the other hand, you might assign a probability of $\frac{1}{2^n}$ for getting abna, and of zero for getting any other string. This would also be incautious, since if the next string were abab you would get infinite surprise. On the other hand, my instincts and intuitions tell me that I would sooner chance getting abab than abbbbbba, which looks to me much more like abbbba, which I have seen.

This is telling us that my own personal grammatical inference system has more similarities to a model with $p(ab^na) 
= \frac{1}{2^n}$ for n a positive integer than it does to the finite model with $p(aba) = \frac{8}{15}$, $p(abba) = 
\frac{4}{15}$, $ p(abbba) = \frac{2}{15}$, and $p(abba)= \frac{1}{15}$. And I do have some sort of personal grammatical inference system, because I do have different expectations about getting other strings from the same source as produced the 15 samples. The question of what kind of expectations I in fact have, and how I obtain them from contemplating the strings would appear to be a study of my personal psychology; not, on the face of things, a very interesting object of thought. If it should turn out that your expectations are very similar to mine and may have been obtained in the same way, then the subject immediately becomes more interesting. And if it should turn out that the kinds of expectations about what can occur in future for sequences of events of a kind such as might interest a cockroach are also obtained by a similar process, then that process becomes very interesting indeed. Now even a cockroach experiences a sequence of things happening to it, and has an interest in predicting the next one or two. Cockroaches, in other words, have to perform grammatical inference too, just like you and me. It seems possible then that we might all use similar methods, which means that in examining these methods we should be saying something about how brains use data to obtain expectations of what might happen next. This hypothesis might seem wildly implausible, but it is (a) economical and (b) susceptible to some kind of analysis and experimentation. It is therefore defensible as a first attempt at making sense of the phenomenon of learning as carried out by brains.

Definition 14365

A local (stochastic) grammar of order k for a language $\cal 
L$ is a set of kgrams, i.e. strings of length k, all of which are substrings of strings in $\cal 
L$, and such that every substring of length k of a string of $\cal 
L$ is in the set of kgrams. If the language is stochastic, each kgram is accompanied by its relative frequency in the language among those kgrams having the first k-1 symbols the same. For strings of length less than k, the whole string is stored, with its relative frequency in the stochastic case.

The rule for generating strings from the grammar is to choose left segments agreeing with the preceding k-1 symbols and to multiply the corresponding relative frequencies conditional on the preceding k-1gram. Recognising strings is done in the same way.

Example 14404

Let $\cal 
L$ be the stochastic language $\{ 
ab^na (\frac{1}{2^n}): n \in {\fam11\tenbbb Z}^+ \}$ and take k to be 3. I write < for the start of a string and > for its end. Then I get 3grams:

$<ab , \ aba,\ ba\gt$ from the string aba which occurs half the time in any sufficiently large sample from $\cal 
L$. I also get:

$<ab,\ abb, \ bba, \ ba\gt$ from the string abba which occurs in about one quarter of all sufficiently large samples from $\cal 
L$. Given a sufficently large sample or alternatively by using a little arithmetic, we conclude that the local stochastic grammar of order 3 for the language will be:

\begin{displaymath}
<ab:(1.0), \ aba:(0.5), \ abb:(0.5), \ bba:(0.5), 
\ bbb:(0.5), \ ba\gt:(1.0) \end{displaymath}

Now this is a generative grammar and a recognition grammar with or without the frequencies. If we wish to obtain the language with the frequencies, we start off at the begin symbol and have no choice about what comes next, it has to be ab, so we write down <ab. Now with ab as the given 2gram we observe that we have an 0.5 chance of getting an a following and an 0.5 chance of getting a b, So half the time we go to <aba and the other half we go to <abb. If we take the first case, we see that we have no choice but to terminate with <aba> as our string, which therefore happens half the time. In the second case, we have <abb and there is a probability of 0.5 that we follow with an a, in which case we terminate with one quarter of our cases being <abba>, and an 0.5 probability of following with another b to get <abbb so far.

It is clear that this will generate the same language as the finite state diagram of fig.7.1, so we have an alternative class of grammars to the finite state grammars. If we ignore the probabilities, we get the non-stochastic language back. Finite samples will consistently tend to underestimate bbb against bba and abb against aba, but the discrepancies get small fast with sample size.

To recognise a string or to compute its probability is done as follows. Suppose we are given the string abbba. We can go immediately to <ab and match the first three symbols of the string and write down a probability of 1.0. next we match abb and assign a probability of 0.5. Next we match bbb with probability 0.5,then bba with probability 0.5, and finally we match ba> with probability 1.0. Multiplying the five numbers together gives 0.125 as the probability of the string. For the non-stochastic case we simply note that matches to abna are all possible and no others are.

The local grammar, stochastic or not, comes with an inference procedure which is very simple and easy to compute. It assigns a small but non zero probability to getting abbbbbba on the basis of the sample of fifteen strings discussed above, but a zero probability to abab, for k > 1, which accords tolerably well with my personal prejudices in the matter. For k = 1 we obtain simply the relative frequencies of the symbols, and any string is possible provided that all symbols in the alphabet are included in the sample. It is possible to accomplish grammatical inference for finite state machines with rather more work, and for more general categories of language with more difficulty. The language consisting of $\{ a^n 
b a^n\ : n \in {\fam11\tenbbb Z}^+ \}$ recall, cannot be described by any finite state machine, although it is easy for a push-down stack machine. The distinctions are dealt with lucidly by Eilenberg, see the bibliography.[*]

It is instructive to infer a local grammar of order k for different k from the language $a^nba^n ; n \in {\fam11\tenbbb Z}^+$, or some finite sample of it. The stochastic language $a^nba^n, (\frac{1}{2^n})$likewise. For $k \approx 20$ and around a million samples, you are unlikely to be able to tell the difference between the language generated by the grammar and the stochastic language. This has some implications for doing practical inference from finite samples. Note that the language generated by the local grammar of order k from the whole language as a `sample' is in general a superset of the original language. When it isn't, we say the language is local of order k, in this case they are the same. Formally:

Definition 14416

A (stochastic) language is local of order k iff it is identical to the language obtained from the local grammar of order k.

It is easy to see that the language $\{a^nba^n: 
n \in {\fam11\tenbbb N}\}$ is not local of order k for any k.

It is clear that the whole business of modelling languages, or more realistically language samples, can be done using a variety of possible classes of model, and that the class of local grammars of order k for different k at least makes the job easy if not always altogether convincing.

The disenchantment with local grammars of order k may be reduced slightly by going to a local grammar of order k+1.

Definition 14427

The Shannon filtration for a language $\cal 
L$ is the sequence of local grammars of order k for all positive integers k.

We can take the Shannon Filtration of a natural language sample, at least up to the point where k is as big as the sample, when it gets pretty silly. It gets pretty silly long before that as a rule; also impractical to store. For instance we may do what IBM's Yorktown Heights Speech Group does and store trigrams of words extracted from English text, together with the counts of the last word in the trigram for all possible bigrams. Shannon used almost exclusively the filtration I have named after him for getting better and better approximations to English. Convergence is not very fast and there are better filtrations. Finding a good one has importance in the study of natural language and such applications as speech recognition and OCR.


next up previous contents
Next: Inference of ReWrite grammars Up: Alphabets, Languages and Grammars Previous: ReWrite Grammars
Mike Alder
9/19/1997