The treatment I shall give extracts the juice
from the papers of King Sun Fu on the topic,
and makes it
much plainer what is going on.
The approach is one of compression. It is easiest to illustrate the method with an example, then to contemplate the abstract issues and finally to formulate the rules. This is how algorithms and theorems are devised.
Suppose we are given the language
, or if you prefer
. We write the lot out
as a sort of infinite finite state machine which
just happens to be disconnected: any subset of abna
for
will be a perfectly good finite
state diagram. I use the symbol < for the begin
states and > for the end states. This gives
us fig. 7.2.
The next step is to start gluing together the nodes and arcs, starting with the begin and end symbols, and then proceeding from the left. The next stage is shown in Fig.7.3.
This gives us the possibility of gluing the first nodes you come to from the start together, and identifying all the arcs also. This is safe because all the arcs have the same symbol, a, on them. This gives the diagram of fig.7.4, where we see that we can proceed to identify from the left to get the `tower' of b symbols.
Finally, in fig.7.5 we amalgamate the tower into a self loop, and all the final arcs also get amalgamated. This yields the necessary finite state diagram for the language.
All that remains is to carefully formulate the gluing rules which make the transition from the original diagram to the final diagram. It seems reasonable that for finite state languages there are such rules; what we need is a procedure which tests to see if the language generated by the transformed network is altered by the gluing operation. This looks to be a finitary business.
The details may be found in Eilenberg's excellent Automata, Languages and Machines, Vol.A. Chapter III section 5, although the terminology requires some translation. Note the reference to Beckman's IBM Research Report.
The ideas outlined can be turned into operations on ReWrite grammars, where the methods rapidly become virtually unintelligible. It is much easier to visualise them in terms of finite state diagrams, or automata as Eilenberg calls them. In the ReWrite grammar form, we have just outlined the ideas for inference of finite state grammars which King Sun Fu described in the papers cited in the bibliography at the end of this chapter.
For the stochastic case, there are some interesting
issues: you will recall our inference of a local
grammar for the stochastic language
. The same process
can be applied to a finite sample. It can also be applied
to the right segments of a stochastic language
obtained by doing identifications from the left,
as above. All that is necessary is to keep counts
of the number of strings and to add them when we
do the amalgamation. Then if we had the sample:
where the numbers give the string occurrences,
we may do the same process of identifications,
to give the diagram of fig.7.6 where the counts
on the arcs have been obtained by summing.
Now the diagram has merely summarised the data given, and does not produce anything essentially new. Nevertheless, it is tempting to collapse the column, even though it has finite height. We might justify this by observing that from the top of it, the view backward looks, if we reduce the counts to fractions, exactly the same as the view from the topmost but one-provided we do not look too far back. If we look at most two nodes back, the top of the tower sees behind it a 0.5 chance of coming in from its predecessor by emitting a b, and an equal chance of going home by emitting an a. The predecessor sees exactly the same, as does its predecessor. This suggests that amalgamating the top of the tower with the node preceding it will be harmless, and we might just as well identify the arcs leaving, because they go to the same place with the same symbol. This gives a self loop at the top of the tower. Of course, it is only locally harmless. It actually changes the language generated; it increases it to an infinite set. And having done this, we now discover that the view forward from every node of what is left of the tower is the same as for every other such node, so we might as well amalgamate them all.
We see then that we can compress the diagram just
a little more than is altogether good for it,
and if we do, we get a
model out which expects other things than those
we actually observed. We could do even more compression
and get out even larger languages: in the limiting
case we blithely identify every node, and we
expect any non empty string on the alphabet
.
If there are p as and q bs, the probability
is
. The reader
is encouraged to satisfy himself that the result
is a well defined stochastic language and that all
the probabilities sum, in the limit, to 1. This
might suggest a neat way to work out some of those formulae
involving multinomial expansions which are so
tiresome to remember.
This can be done with any language sample on any
alphabet, the business of amalgamating nodes
and summing counts can be performed with more or less
enthusiasm and corresponding disregard for the
actual data. This gives an alternative to seeking
hidden markov model intialisations by going into
seedy bars if you prefer, for some reason, to
avoid them.
It is quite entertaining to experiment
with some wild language samples (such as you might
get out of a Vector Quantiser for speech) and
see if
you can find interesting and plausible rules for
amalgamating nodes.
It is possible, as was indicated earlier, to modify
the finite state diagrams or automata for the
case
of Context free grammars;
the idea here is to allow boxes to contain copies
of themselves as well as being part of a network.
The
case of the language
discussed above
is shown in such a form in fig.7.7.
Inference for such things is possible in special cases, but one might reasonably doubt if this is a problem that is likely to occur in practice. If we think of the problems faced by a cockroach trying in his limited way to remember a sequence of muscle contractions, say, and if we suppose that allowable sequences that have successfully got the 'roach where he wanted in the past have had some particular form, then we observe that although some small amount of palindromic structure is conceivable, it seems, a priori, somewhat unlikely that it would get carried very far. And we are most likely to be concerned not with the case of a deterministic grammar or language, but a stochastic one, and with a finite sample. The reader might like to try the following experiment: construct two grammars for a small alphabet, generate a dozen strings from each, put the strings from one grammar into a red box and from the other grammar into a blue box. Procure a small child of age three or thereabouts. Read the strings from the red box grammar while ostentatiously pointing to the red box, then read the strings of the other grammar while pointing to the blue box from which they were drawn. Now generate a few more strings, and ask the child to point to the appropriate box. See if the child gets the right answer. If not, is it colour blind? Did you make an injudicious choice of grammars? Try some other strings not generated by either grammar and see if the child is uncertain. The child is doing grammatical inference on the sample provided, and it is of some interest to speculate about the class of grammars it is prone to use. If instead of reading out symbols and making noises, a sequence of brightly coloured blocks is used, does the child still use the same grammatical inference process?
Return child after use.
The grammatical inference problem can be recast to other forms, and other problems, notably a large fraction of those occurring in I.Q. tests, can be viewed as variants on grammatical inference. Short sequences of distinguishable events often occur in nature, and classifying them is something which may be rather desirable for an organism wishing to reproduce itself before being eaten or starving to death. It would be rather nice to know to what extent the central nervous system of animals uses consistent methods for obtaining such classifications, or equivalently, whether some particular class of stochastic grammars is preferred. I shall return to these somewhat speculative matters later when discussing a new class of neural models.