At the bottom of the ladder sit the finite state languages. If we want to we can think of them as being the languages we get from hopping about a finite state diagram, or we can cast them into ReWrite form as in the following example:
The language
can be represented
by the finite state diagram of fig. 7.1
it can be expressed rather more elliptically as
![]()
![]()
The symbols A and B can be rewritten into
the last alphabet, the symbols of the alphabet
of the
language for which we are trying to obtain a grammar,
the alphabet
. The allowed ways of
doing this, together with the above ReWrite for completeness
sake, are:
It is easy to see that if you simply follow the rules which are to start with an S and rewrite the strings you get by rewriting the symbols in them and stop only when there is nothing more to do, then you must generate one of the strings of the given language. Also, every string of the language can be obtained in this way.
It is also easy to see that you could make this into a stochastic grammar for the stochastic language generated with the probabilities given by fig.7.1 by putting probabilities on which rule you choose when there is a choice of rewriting a particular symbol. This is a somewhat unusual way of representing a Markov Model, but it generalises to other languages which are not describable as Markov Models.
With only a little thought, you can convince yourself that any finite state diagram can be turned into a ReWrite grammar. The idea is simply to label the nodes of the diagram with letters from a new alphabet, and write
![]()
Conversely, if the ReWrite rules all look like
![]()
In general we may define a ReWrite grammar as follows:
Definition 14305
A ReWrite Grammar for a language
is
a finite sequence of alphabets
of distinct symbols where the last alphabet is
the alphabet of the language
, together
with a finite set of productions or
rules of the form
![]()
The alphabets other than the last are called the intermediate alphabets and any symbol from them is called an intermediate symbol. The last alphabet is called the alphabet of terminal symbols, the first alphabet the alphabet of initial symbols. Then we require that every string of the language can be produced by a derivation, and no strings not in the language can be so derived.
This is a little untidy: it suffices to have only one initial symbol, by introducing an eaven earlier alphabet if necessary, together with the obvious productions to get from my new initial symbol to your old initial symbols. Similarly, we can get by with amalgamating all the intermediate alphabets into just one. Why bother to make the distinction? Ah, well, there may be reasons coming along shortly.
One of the first restrictions which we could make on the ReWrite Grammars defined above is to stipulate that you can only go down the sequence of alphabets, that is to say, when we have a production
![]()
The most extreme restriction of this kind we might
make is that the left hand side of each production
is allowed to consist of a single symbol at any
level, and the right hand side of a string of
symbols
at the level one lower. This gives only finite
languages such as English. It doesn't do a whole
lot for
English either.
Even a superficial
familiarity with the structure of English or
any
other natural language leaves one feeling that
grammars such as this don't get us too far.
An alternative restriction which might be imposed would be to allow the right hand side of a production to contain a string of symbols from the next alphabet lower down followed by an optional symbol from the same level as the left hand side, and to stipulate still only single symbols on the left. This gives us, by the above argument, the finite state languages. This is a weaker restriction, so we get a larger class of languages which can be obtained by this class of ReWrite rules.
To weaken things even further, suppose that you ensure that the left hand side is a single symbol, from some level alphabet, and the right hand side is some mix of symbols from the same level or the next lower level alphabet. I shall call such a grammar a Context Free (ReWrite) grammar. If a language may be obtained from such a grammar but not from a finite state grammar, it is called a context free language.
As an example, consider the language
defined by
![]()
To convince
yourself of this, observe that if you run around
a finite state diagram which has no loops in
it, then
there is some longest string which can be obtained,
and all other strings are the same length or
shorter. So there can only be a finite number
of them. It follows that if the language has
an infinite
number of strings and is finite state, then the
finite state diagram which generates the language
must have at least one loop. Extending the argument
a little, given such a diagram, there is a longest
path (at least one) which does not pass around
a loop, and any longer path must go around some
loop at
least once. So if you have an infinite language
which is regular, there is some string in it
with the
property that generating the string involved going
at least once around a loop, so that some of
the
string represents going around the loop, the bit
before (if any) represents getting to the start
of the
loop, and the part after was obtained by going
home after completing the loop. Now if you went
around
once, you could have gone around twice, three
times or any other number of times. Or to say
it in algebra,
there is for any finite state language which is
infinite, a string
in that
language having a substring p in it, so that
for other strings a,b (possibly
empty)
and such that apnb is also always in the language,
for any
. This result is called the
pumping lemma, posibly because you get a
heart attack if you run around the same loop
often enough.
Now apply it to the language
. There
is, if the language is finite state, some (possibly
longish) string in it and a substring
representing
an arbitrarily repeatable loop. If this
substring
contains the b, then there
would have to be strings which contain lots of
bs in
the language, and there aren't. So it must consist
entirely of as on the
left of the b, or some string of as on the
right of the b. Either way this leads to strings
anbam for
, which do not in fact
occur. So the language is not finite state.
It can however easily be obtained by the following ReWrite system
among others. This is evidently context-free.
Context Free languages can also be described by graphical means; the diagrams between pp 116 and 118 in the PASCAL User Manual and Report by Kathleen Jensen and Niklaus Wirth, Springer Verlag, 1975 follow a Backus-Naur formalism (Rewrite Grammar) description of the syntax of the programming language PASCAL. It is easy to see that we could put all the labelled boxes together into one big diagram. They do not, however, constitute an automaton or finite state diagram (or finite state machine) because a subdiagram consisting of a box called factor can contain inside itself a sequence consisting of a term `not' followed by the box factor. This allows for recursion. The PASCAL diagrams have symbols associated with nodes rather than arcs, but it is easy to persuade oneself that this is a minor difference of convention, and that if I have a finite state diagram where the symbols are emitted at nodes, you can easily construct one in which symbols are emitted along arcs so that both systems produce the same strings, and vice versa. Similarly with the diagrams for PASCAL, which determine what is called a Push Down Stack Automaton. To represent such a machine we may need to have diagrams containing little boxes with a suitable symbolic label so that a path can take one into and thence out of each such box, and to also have a separate diagram telling you about what is inside the box. What may be inside the box is a path from input to output which passes through other such boxes or even the same box again. It is not too hard to prove that such automata can generate and recognise any Context Free Language and that all languages generated and recognised (or accepted) by such automata are context free. This allows for an extension of the pumping lemma for CF languages; the proofs of all these claims are left to the reader to work out, or he can cheat by examining the standard books on the subject. Working out the proper definitions of the terms is probably the only bit that is much fun, and it would be wrong of me to take this pleasure from him.
The next level of generality in the Chomsky Hierarchy was to simply ensure that the right hand side of any production had to contain at least as many terminal symbols as the left hand side. Such grammars are called Context Sensitive, as are the languages which can only be obtained from such grammars. (It is easy to construct a context sensitive grammar for a finite state language, merely unnecessary).
And finally, the Unrestricted ReWrite Systems
allow anything whatever. It is a fact that any
language
which can be generated by a Turing Machine can
be generated by some ReWrite system, and
vice
versa, and Church's Thesis is the claim that
anything that is effectively computable can be
computed
by a Turing Machine. So if you are willing to
buy Church's Thesis, any algorithm whatever for
generating
strings can be realised by a ReWrite system if
you feel so inclined. This explains why I use
the term
`algorithm' in the definitions at the beginning
of the chapter.
Computer Scientists are familiar with the Backus-Naur
Formalism, much used for specifying the structure
of
Computer Languages, as a slightly disguised version
of the Rewrite form of a grammar; the
Metamathematicians
got to it first. The principal applications of
formal language theory of this sort are (a) in
designing
Compilers, where there is virtually no application
of a body of theory but some value in the discipline
of
thought forced by the terminology, and (b) keeping
a certain sort of Pure
Mathematician off the streets.