next up previous contents
Next: Grammatical Inference Up: Alphabets, Languages and Grammars Previous: Definitions and Examples

ReWrite Grammars

Noam Chomsky is a conspiracy theorist who used to be a theoretical linguist; the Chomsky Hierarchy refers to language types rather than degrees of authority. Chomsky's work in formal language theory has been described by some as `seminal' [*].

At the bottom of the ladder sit the finite state languages. If we want to we can think of them as being the languages we get from hopping about a finite state diagram, or we can cast them into ReWrite form as in the following example:

The language $\{ ab^na: n \in {\fam11\tenbbb N}\}$ can be represented by the finite state diagram of fig. 7.1 it can be expressed rather more elliptically as

\begin{displaymath}
\{ aa, aba, abba, abbba, abbbba, \cdots \} \end{displaymath}

Now we use three alphabets to describe it, and ReWrite rules. The first alphabet has only one symbol in it, S (which some say stands for `sentence', and others for `start'), and you are allowed to ReWrite it to the string ABA on a quite different alphabet containing two symbols, A,B. We write this in the form:

\begin{displaymath}
S \longrightarrow ABA \end{displaymath}

The symbols A and B can be rewritten into the last alphabet, the symbols of the alphabet of the language for which we are trying to obtain a grammar, the alphabet $\{ a, b \}$. The allowed ways of doing this, together with the above ReWrite for completeness sake, are:

It is easy to see that if you simply follow the rules which are to start with an S and rewrite the strings you get by rewriting the symbols in them and stop only when there is nothing more to do, then you must generate one of the strings of the given language. Also, every string of the language can be obtained in this way.

It is also easy to see that you could make this into a stochastic grammar for the stochastic language generated with the probabilities given by fig.7.1 by putting probabilities on which rule you choose when there is a choice of rewriting a particular symbol. This is a somewhat unusual way of representing a Markov Model, but it generalises to other languages which are not describable as Markov Models.

With only a little thought, you can convince yourself that any finite state diagram can be turned into a ReWrite grammar. The idea is simply to label the nodes of the diagram with letters from a new alphabet, and write

\begin{displaymath}
S \longrightarrow xY \end{displaymath}

whenever you can get from the start of the diagram to the node Y by emmitting the symbol x, and then to continue similarly for all the other nodes. We may conveniently use a blank for the symbol for the terminating node or nodes if we have such things.

Conversely, if the ReWrite rules all look like

\begin{displaymath}
X \longrightarrow xY \end{displaymath}

where X and Y may be the same, and Y may empty, when it represents the terminating node of the diagram, then it is easy to build a finite state diagram of nodes and arrows which produces the same output. Clearly, some compression of this description is allowable; we didn't really need the A symbol in the example at all, and x may be a string of terminal symbols, that is, symbols of the alphabet of the language we actually care about.

In general we may define a ReWrite grammar as follows:

Definition 14305

A ReWrite Grammar for a language $\cal 
L$ is a finite sequence of alphabets of distinct symbols where the last alphabet is the alphabet of the language $\cal 
L$, together with a finite set of productions or rules of the form

\begin{displaymath}
P \longrightarrow Q \end{displaymath}

where P and Q are strings of symbols from any of the alphabets. A derivation of a string $\ell$ of $\cal 
L$ is a sequence of productions and a sequence of strings, such that the first element of the sequence of strings is a single symbol from the first alphabet, the last string is precisely $\ell$, and the move from one string to its succesor is sanctioned by the corresponding rewrite rule: the kth string contains a substring P where the (k+1)th string contains a substring Q, and the left hand side of the kth production also contains P where the right hand side contains Q, and where the left hand side of the production is a substring of the kth string.

The alphabets other than the last are called the intermediate alphabets and any symbol from them is called an intermediate symbol. The last alphabet is called the alphabet of terminal symbols, the first alphabet the alphabet of initial symbols. Then we require that every string of the language can be produced by a derivation, and no strings not in the language can be so derived.

This is a little untidy: it suffices to have only one initial symbol, by introducing an eaven earlier alphabet if necessary, together with the obvious productions to get from my new initial symbol to your old initial symbols. Similarly, we can get by with amalgamating all the intermediate alphabets into just one. Why bother to make the distinction? Ah, well, there may be reasons coming along shortly.

One of the first restrictions which we could make on the ReWrite Grammars defined above is to stipulate that you can only go down the sequence of alphabets, that is to say, when we have a production

\begin{displaymath}
P \longrightarrow Q \end{displaymath}

Q is not allowed to contain symbols from earlier alphabets than P. If you reflect on natural language, where you can ReWrite the S symbol as, perhaps, $NP\ast VP\ast NP$ where NP is a new symbol (standing for Noun Phrase) and the sequence of ReWrites which can go on to generate a sequence of ASCII characters constituting a sentence in the English language, you will perhaps find this stipulation attractive. It gives some kind of hierarchical structure to the system.

The most extreme restriction of this kind we might make is that the left hand side of each production is allowed to consist of a single symbol at any level, and the right hand side of a string of symbols at the level one lower. This gives only finite languages such as English. It doesn't do a whole lot for English either. [*] Even a superficial familiarity with the structure of English or any other natural language leaves one feeling that grammars such as this don't get us too far.

An alternative restriction which might be imposed would be to allow the right hand side of a production to contain a string of symbols from the next alphabet lower down followed by an optional symbol from the same level as the left hand side, and to stipulate still only single symbols on the left. This gives us, by the above argument, the finite state languages. This is a weaker restriction, so we get a larger class of languages which can be obtained by this class of ReWrite rules.

To weaken things even further, suppose that you ensure that the left hand side is a single symbol, from some level alphabet, and the right hand side is some mix of symbols from the same level or the next lower level alphabet. I shall call such a grammar a Context Free (ReWrite) grammar. If a language may be obtained from such a grammar but not from a finite state grammar, it is called a context free language.

As an example, consider the language ${\cal L}_{cf}$ defined by

\begin{displaymath}
{\cal L}_{cf} = \{ a^nba^n: n \in {\fam11\tenbbb N}\} \end{displaymath}

Now it is not too hard to see that this is not a finite state language, that is to say there is no finite diagram such that running round it will generate precisely this set of strings.

To convince yourself of this, observe that if you run around a finite state diagram which has no loops in it, then there is some longest string which can be obtained, and all other strings are the same length or shorter. So there can only be a finite number of them. It follows that if the language has an infinite number of strings and is finite state, then the finite state diagram which generates the language must have at least one loop. Extending the argument a little, given such a diagram, there is a longest path (at least one) which does not pass around a loop, and any longer path must go around some loop at least once. So if you have an infinite language which is regular, there is some string in it with the property that generating the string involved going at least once around a loop, so that some of the string represents going around the loop, the bit before (if any) represents getting to the start of the loop, and the part after was obtained by going home after completing the loop. Now if you went around once, you could have gone around twice, three times or any other number of times. Or to say it in algebra, there is for any finite state language which is infinite, a string $\ell$ in that language having a substring p in it, so that $ \ell = apb $ for other strings a,b (possibly empty) and such that apnb is also always in the language, for any $n \in {\fam11\tenbbb Z}^+$. This result is called the pumping lemma, posibly because you get a heart attack if you run around the same loop often enough.

Now apply it to the language ${\cal L}_{cf}$. There is, if the language is finite state, some (possibly longish) string in it and a substring $\ell$ representing an arbitrarily repeatable loop. If this substring $\ell$ contains the b, then there would have to be strings which contain lots of bs in the language, and there aren't. So it must consist entirely of as on the left of the b, or some string of as on the right of the b. Either way this leads to strings anbam for $n \neq m$, which do not in fact occur. So the language is not finite state.

It can however easily be obtained by the following ReWrite system

among others. This is evidently context-free.

Context Free languages can also be described by graphical means; the diagrams between pp 116 and 118 in the PASCAL User Manual and Report by Kathleen Jensen and Niklaus Wirth, Springer Verlag, 1975 follow a Backus-Naur formalism (Rewrite Grammar) description of the syntax of the programming language PASCAL. It is easy to see that we could put all the labelled boxes together into one big diagram. They do not, however, constitute an automaton or finite state diagram (or finite state machine) because a subdiagram consisting of a box called factor can contain inside itself a sequence consisting of a term `not' followed by the box factor. This allows for recursion. The PASCAL diagrams have symbols associated with nodes rather than arcs, but it is easy to persuade oneself that this is a minor difference of convention, and that if I have a finite state diagram where the symbols are emitted at nodes, you can easily construct one in which symbols are emitted along arcs so that both systems produce the same strings, and vice versa. Similarly with the diagrams for PASCAL, which determine what is called a Push Down Stack Automaton. To represent such a machine we may need to have diagrams containing little boxes with a suitable symbolic label so that a path can take one into and thence out of each such box, and to also have a separate diagram telling you about what is inside the box. What may be inside the box is a path from input to output which passes through other such boxes or even the same box again. It is not too hard to prove that such automata can generate and recognise any Context Free Language and that all languages generated and recognised (or accepted) by such automata are context free. This allows for an extension of the pumping lemma for CF languages; the proofs of all these claims are left to the reader to work out, or he can cheat by examining the standard books on the subject. Working out the proper definitions of the terms is probably the only bit that is much fun, and it would be wrong of me to take this pleasure from him.

The next level of generality in the Chomsky Hierarchy was to simply ensure that the right hand side of any production had to contain at least as many terminal symbols as the left hand side. Such grammars are called Context Sensitive, as are the languages which can only be obtained from such grammars. (It is easy to construct a context sensitive grammar for a finite state language, merely unnecessary).

And finally, the Unrestricted ReWrite Systems allow anything whatever. It is a fact that any language which can be generated by a Turing Machine can be generated by some ReWrite system, and vice versa, and Church's Thesis is the claim that anything that is effectively computable can be computed by a Turing Machine. So if you are willing to buy Church's Thesis, any algorithm whatever for generating strings can be realised by a ReWrite system if you feel so inclined. This explains why I use the term `algorithm' in the definitions at the beginning of the chapter. [*]

Computer Scientists are familiar with the Backus-Naur Formalism, much used for specifying the structure of Computer Languages, as a slightly disguised version of the Rewrite form of a grammar; the Metamathematicians got to it first. The principal applications of formal language theory of this sort are (a) in designing Compilers, where there is virtually no application of a body of theory but some value in the discipline of thought forced by the terminology, and (b) keeping a certain sort of Pure Mathematician off the streets.[*]


next up previous contents
Next: Grammatical Inference Up: Alphabets, Languages and Grammars Previous: Definitions and Examples
Mike Alder
9/19/1997