It is a very nice thing to have a grammar for a language, and represents a nice compact representation of the language. It is a model for the language in a sense which you should recognise as congenial to a follower of Rissanen. The question is, how are such things to be obtained?
Definition 14331
The Grammatical Inference Problem is, given
a language
over an alphabet A, to
obtain a grammar for it. This holds for both stochastic
and non-stochastic languages; for a
stochastic language one seeks a stochastic grammar.
In real life, one is actually given a sample of strings produced by some process and the sample is finite. The real life problem is to model the process economically. Thus the next example is a sensible problem which can be made absurd with no trouble at all:
Example 14337
You are given what are claimed to be 15 strings produced by a stochastic process of an unknown sort. The strings are aba (8), abba (4), abbba (2) and abbbba (1), where the numbers in parentheses give the counts of the string. You are required to model the process and use your model to obtain an estimate of the probability of seeing abbbbbba.
Without some extra constraints such as a class
of models from which to pick, the problem has
no unique solution. For example, you might model
the data by saying these are the only strings
there are. This is just the SureThing or Predestination
model again. Your estimate of the probability
of getting abbbbbba is that it is zero. This is
very incautious of you, since if it occurs your model
will get an infinite amount of `surprise', the
shock of which might disable you. On the other
hand, you might assign a probability of
for getting abna, and of zero
for getting any other string. This would also
be incautious, since if the next string were abab
you would get infinite surprise. On the other
hand, my instincts and intuitions tell me that
I would sooner chance getting abab than abbbbbba,
which looks to me much more like abbbba, which
I have seen.
This is telling us that my own personal grammatical
inference system has more similarities to a
model with
for n
a positive integer than it does to the finite
model with
,
,
,
and
. And I do have
some sort of personal grammatical inference
system, because I do have different expectations
about getting other strings from the same source
as produced the 15 samples. The question of what
kind of expectations I in fact have, and how
I
obtain them from contemplating the strings would
appear to be a study of my personal psychology;
not, on the face of things, a very interesting
object of thought.
If it should turn out that your expectations are
very similar to mine and may have been obtained
in the same way, then the subject immediately
becomes more interesting. And if it should turn
out
that the kinds of expectations about what can
occur in future for sequences of events of a
kind
such as might interest a cockroach are also obtained
by a similar process, then that process
becomes very interesting indeed. Now even a cockroach
experiences a sequence of things happening to
it,
and has an interest in predicting the next one
or two. Cockroaches, in other words, have to
perform grammatical inference too, just like you
and me. It seems possible then that we might
all use
similar methods, which means that in examining
these methods we should be saying something about
how
brains use data to obtain expectations of what
might happen next. This hypothesis might seem
wildly
implausible, but it is (a) economical and (b)
susceptible to some kind of analysis and experimentation.
It is therefore defensible as a first attempt
at making sense of the phenomenon of learning
as carried
out by brains.
Definition 14365
A local (stochastic) grammar of order k
for a language
is a set of kgrams,
i.e. strings of length k, all of which are substrings
of strings in
, and such that
every substring of length k of a string of
is in the set of kgrams. If the language
is stochastic, each kgram is accompanied by
its relative frequency in the language among
those kgrams having the first k-1 symbols the same.
For strings of length less than k, the whole
string is stored, with its relative frequency
in the stochastic case.
The rule for generating strings from the grammar is to choose left segments agreeing with the preceding k-1 symbols and to multiply the corresponding relative frequencies conditional on the preceding k-1gram. Recognising strings is done in the same way.
Example 14404
Let
be the stochastic language
and take
k to
be 3. I write < for the start of a string and
> for its end. Then I get 3grams:
from the string aba which
occurs half the time in any sufficiently large
sample from
. I also get:
from the string abba
which occurs in about one quarter of all sufficiently
large samples from
. Given a sufficently
large sample or alternatively by using a little
arithmetic, we conclude that the local stochastic
grammar of order 3 for the language will be:
![]()
It is clear that this will generate the same language as the finite state diagram of fig.7.1, so we have an alternative class of grammars to the finite state grammars. If we ignore the probabilities, we get the non-stochastic language back. Finite samples will consistently tend to underestimate bbb against bba and abb against aba, but the discrepancies get small fast with sample size.
To recognise a string or to compute its probability is done as follows. Suppose we are given the string abbba. We can go immediately to <ab and match the first three symbols of the string and write down a probability of 1.0. next we match abb and assign a probability of 0.5. Next we match bbb with probability 0.5,then bba with probability 0.5, and finally we match ba> with probability 1.0. Multiplying the five numbers together gives 0.125 as the probability of the string. For the non-stochastic case we simply note that matches to abna are all possible and no others are.
The local grammar, stochastic or not, comes with
an inference procedure which is very simple and
easy to compute. It assigns a small but non zero
probability to getting abbbbbba on the basis
of
the sample of fifteen strings discussed above,
but a zero probability to abab, for k > 1,
which
accords tolerably well with my personal prejudices
in the matter. For k = 1 we obtain simply the
relative frequencies of the symbols, and any string
is possible provided that all symbols in the
alphabet are included in the sample. It is possible
to accomplish grammatical inference for
finite state machines with rather more work, and
for more general categories of language with
more
difficulty. The language consisting of
recall, cannot be described
by any finite state machine, although it is easy
for a push-down stack machine. The distinctions
are dealt with lucidly by Eilenberg, see the
bibliography.
It is instructive to infer a local grammar of
order k for different k from the language
, or some finite sample
of it. The stochastic language
likewise. For
and around a million
samples, you are unlikely to be able to
tell the difference between the language generated
by the grammar and the stochastic language.
This has some implications for doing practical
inference from finite samples. Note that the
language generated by the local grammar of order
k from the whole language as a `sample' is
in general a superset of the original language. When
it isn't, we say the language is
local of order k, in this case they are
the same. Formally:
Definition 14416
A (stochastic) language is local of order k iff it is identical to the language obtained from the local grammar of order k.
It is easy to see that the language
is not local of order k for any
k.
It is clear that the whole business of modelling languages, or more realistically language samples, can be done using a variety of possible classes of model, and that the class of local grammars of order k for different k at least makes the job easy if not always altogether convincing.
The disenchantment with local grammars of order k may be reduced slightly by going to a local grammar of order k+1.
Definition 14427
The Shannon filtration for a language
is the sequence of local grammars of order
k for all positive integers k.
We can take the Shannon Filtration of a natural language sample, at least up to the point where k is as big as the sample, when it gets pretty silly. It gets pretty silly long before that as a rule; also impractical to store. For instance we may do what IBM's Yorktown Heights Speech Group does and store trigrams of words extracted from English text, together with the counts of the last word in the trigram for all possible bigrams. Shannon used almost exclusively the filtration I have named after him for getting better and better approximations to English. Convergence is not very fast and there are better filtrations. Finding a good one has importance in the study of natural language and such applications as speech recognition and OCR.