next up previous contents
Next: ReWrite Grammars Up: Alphabets, Languages and Grammars Previous: Alphabets, Languages and Grammars

Definitions and Examples

Definition 14226

An alphabet is a set of objects called symbols.

Alphabets are usually finite, but I shall not so limit myself. Think of the ASCII character set to make your ideas concrete, but watch out for more bizarre examples coming up.

Definition 14227

A string on an alphabet is a finite sequence of symbols from the alphabet.

Definition 14228

A language over an alphabet A is a set of strings on A.

Languages may be finite, but these aren't the interesting ones. It is usual to define A* to be the set of all possible strings over A, which makes A* a language. This allows us to define a language over A as a subset of A*.

Example 14229

If A is the set of ASCII characters, the set of correctly spelled English words is a language. The set of syntactically correct PASCAL programs is another. The set of English sentences is a third.

Example 14230

If A is the set of words in the Oxford English Dictionary, then the set of English sentences is a language over A. It is a different language from the same set of sentences over the ASCII character set.

Definition 14233

A recognition grammar for a language $\cal 
L$ over A is an algorithm which when given a string on A will output either a 1 or a 0, and will output a 1 whenever the string is in $\cal 
L$and a 0 when it isn't.

I shall not define an algorithm here, you can think of a computer program of some sort if you want to make it reasonably definite.

The definition I give is not quite standard, because the usual next move in the literature is to go to some kinds of restrictions on the algorithms which might be employed and I do not wish to consider the usual class of restrictions. I shall discuss this further below.

Definition 14236

A generative grammar for a language $\cal 
L$ over A is an algorithm which has no input but produces strings on A such that only strings in $\cal 
L$ are produced and every string is produced eventually.

If the alphabet is finite, A* is countable, so generative grammars may exist. Indeed it is hard at first glance to see how they might not, since how can I tell you what $\cal 
L$ is, except by specifying a generative grammar? You may wish to brood upon this point.

Definition 14244

A stochastic language over A is a language over A together with a real number in [0,1], p(w), assigned to each string w, so that (a) ifw < w' (w is a substring ofw') then $p(w') \leq p(w)$and (b) the limit of the sums of these numbers over the entire language exists and is 1.

Example 14251

Let A consist of the set a,b and $\cal 
L$ be the set

\begin{displaymath}
\{ aba, abba,... ab^na,..\} \end{displaymath}

for every positive integer n. Let the number p(abna) assigned to the string abna be $\frac{1}{2^n}$.

Example 14254

Let A be the ASCII character set, $\cal 
L$ the set of English words in the Oxford English dictionary, and for $w \in {\cal L}, p(w) $ is defined to be the relative frequency of occurence of the word in the present book.

Definition 14258

A stochastic recognition grammar for a stochastic language $\cal 
L$ over A is an algorithm which when given a string w on A returns p(w).

Definition 14261

A stochastic generative grammar for a stochastic language $\cal 
L$ over A is an algorithm with no input which produces strings on A so that the relative frequency of the string w in a sample of size N strings produced by the algorithm, tends to p(w) as N gets larger.

Example 14270

Suppose we are given the finite state diagram of fig.7.1 and a pseudo-random number generator. The algorithm traces a path through the diagram from < to > in order to generate a string, and whenever it makes a jump from one state to another, or back to itself via a self loop, it produces the symbol attached to the transition arc. As soon as it gets to > it starts over again to get a new string. The algorithm decides whether to go by one arc or another out of a state by running its random number generator and deciding in accord with the probabilities marked on the arcs out of a state. Thus if there is only one arc out of a node or state it moves along the single arc, while if there are two with equal probabilities 0.5 as in the diagram, it does the pseudo-random number equivalent of tossing a coin. It is easy to persuade oneself that it produces only strings abna for positive integer n, and that the frequencies of these strings will tend to $p(ab^na) 
= \frac{1}{2^n}$, given only moderate optimism about the pseudo-random number generator's capacity to reliably simulate a 0.5 probability choice.


  
Figure 7.1: A Markov Process for generating a stochastic language.
\begin{figure}
\vspace{8cm}
\special {psfile=patrecfig6.1.ps}\end{figure}

Definition 14271

A finite network of nodes and arcs, with probabilities attached and symbols attached to the arcs, such as that illustrated in the last example, is known as a first order markov process for generating the stochastic language. If we take the probabilities away and forget about frequency counts, the diagram is known as a finite state grammar or regular automaton for the language.

Clearly, some languages can have finite state generative grammars for them and others may not have such grammars. The set of finite state languages, otherwise known in some quarters as the regular languages are those for which a finite state grammar can be found. The set is widely regarded as an important subset of all possible languages, and deciding whether a language is regular or not is a problem of some interest. It is trivial that all finite languages are finite state, but the converse is obviously false.

The distinction between generative grammars and recognition grammars can be thought to be important or regarded as an inadequacy of the language and a sign of a poorly constructed framework for talking about things. The distinction is nugatory for stochastic grammars for languages over a finite alphabet. If I have a recognition grammar for such a language, then I merely need to produce all the strings on the alphabet in some order and then test to see if the string is in the language. If it is, I put it in the box $\cal 
L$ and I have generated a string of the language. This clearly works for the stochastic and non-stochastic cases. Conversely, if I have a generative grammar and you give me a string w and want to know if it is in $\cal 
L$, I need to generate the strings until I can match the given string or fail to match it. But a failure to match it in any finite number of moves is inconclusive for the non-stochastic case. Whereas in the stochastic case I can obtain estimates for p(w) by generating sequences of strings and counting occurrences of w.

Of course, one might doubt whether this is an algorithm for assigning a probability or relative frequency of occurrence to a string; it only yields estimates based on finite samples. A platonist philosopher or even a logician might argue that only production of the whole infinite set would yield the procedure, and this cannot be accomplished by a real program running on a real machine. Much depends here on what one accepts as an algorithm, and also on what one regards as an algorithm producing a real number measuring a relative frequency of occurrence. I apply a commonsense test of meaning here. For my purposes, a program which assures me that the probability of an event is 0.3333 is (a) feasible to write and (b) may be useful to run. Likewise, being told that the probability of occurrence of a string is less than 0.01 is helpful, whereas being told that I haven't got a stochastic recognition grammar unless presentation of a string of symbols yields an infinite decimal string, specifying the probability, would be bad for my blood pressure. My choices are either to specify what I mean very carefully indeed, as I should have to if I were writing a program for a computer or a text for Pure Mathematicians, or rely on the maturity and common sense of the reader to work out what I intend for practical purposes, which is quicker but a little risky. I choose to take a chance. I may regret this later.

By the trace of an algorithm I mean the sequence of operations it performs on a particular input.

Definition 14277

The trace of a recognition grammar for a particular string is called a parse of the string.


next up previous contents
Next: ReWrite Grammars Up: Alphabets, Languages and Grammars Previous: Alphabets, Languages and Grammars
Mike Alder
9/19/1997