Definition 14226
An alphabet is a set of objects called symbols.
Alphabets are usually finite, but I shall not so limit myself. Think of the ASCII character set to make your ideas concrete, but watch out for more bizarre examples coming up.
Definition 14227
A string on an alphabet is a finite sequence of symbols from the alphabet.
Definition 14228
A language over an alphabet A is a set of strings on A.
Languages may be finite, but these aren't the interesting ones. It is usual to define A* to be the set of all possible strings over A, which makes A* a language. This allows us to define a language over A as a subset of A*.
Example 14229
If A is the set of ASCII characters, the set of correctly spelled English words is a language. The set of syntactically correct PASCAL programs is another. The set of English sentences is a third.
Example 14230
If A is the set of words in the Oxford English Dictionary, then the set of English sentences is a language over A. It is a different language from the same set of sentences over the ASCII character set.
Definition 14233
A recognition grammar for a language
over A is an algorithm which when given a
string on A will output either a 1 or a 0, and
will output a 1 whenever the string is in
and a 0 when it isn't.
I shall not define an algorithm here, you can think of a computer program of some sort if you want to make it reasonably definite.
The definition I give is not quite standard, because the usual next move in the literature is to go to some kinds of restrictions on the algorithms which might be employed and I do not wish to consider the usual class of restrictions. I shall discuss this further below.
Definition 14236
A generative grammar for a language
over A is an algorithm which has no input
but
produces strings on A such that only strings in
are produced and every string is
produced eventually.
If the alphabet is finite, A* is countable, so
generative grammars may exist. Indeed it is hard
at
first glance to see how they might not, since
how can I tell you what
is,
except
by specifying a generative grammar? You may wish
to brood upon this point.
Definition 14244
A stochastic language over A is a language
over A together with a real number in [0,1],
p(w), assigned to each string w, so that (a)
ifw < w' (w is a substring ofw') then
and (b) the
limit of the sums of these numbers over the entire
language exists and is 1.
Example 14251
Let A consist of the set a,b and
be the set
![]()
Example 14254
Let A be the ASCII character set,
the
set of English words in the Oxford English
dictionary, and for
is
defined to be the relative frequency of occurence
of the word in the present book.
Definition 14258
A stochastic recognition grammar for a stochastic
language
over A is an algorithm
which when given a string w on A returns p(w).
Definition 14261
A stochastic generative grammar for a stochastic
language
over A is an algorithm with
no input which produces strings on A so that the
relative frequency of the string w in a sample
of
size N strings produced by the algorithm, tends
to p(w) as N gets larger.
Example 14270
Suppose we are given the finite state diagram
of fig.7.1 and a pseudo-random number generator.
The algorithm traces a path through the diagram
from < to > in order to generate a string,
and whenever it makes a jump from one state to
another, or back to itself via a self loop, it
produces the symbol attached to the transition
arc. As soon as it gets to > it starts over
again
to get a new string. The algorithm decides whether
to go by one arc
or another out of a state by running its random
number generator and deciding in accord with
the probabilities marked on the arcs out of a
state. Thus if there is only one arc out of a
node or state it moves along the single arc, while
if there are two with equal probabilities 0.5
as in the diagram, it does the pseudo-random number
equivalent of tossing a coin. It is easy to
persuade oneself that it produces only strings
abna for positive integer n, and that the
frequencies of these strings will tend to
, given only moderate optimism
about the pseudo-random number generator's capacity
to reliably simulate a 0.5 probability choice.
Definition 14271
A finite network of nodes and arcs, with probabilities attached and symbols attached to the arcs, such as that illustrated in the last example, is known as a first order markov process for generating the stochastic language. If we take the probabilities away and forget about frequency counts, the diagram is known as a finite state grammar or regular automaton for the language.
Clearly, some languages can have finite state generative grammars for them and others may not have such grammars. The set of finite state languages, otherwise known in some quarters as the regular languages are those for which a finite state grammar can be found. The set is widely regarded as an important subset of all possible languages, and deciding whether a language is regular or not is a problem of some interest. It is trivial that all finite languages are finite state, but the converse is obviously false.
The distinction between generative grammars and
recognition grammars can be thought
to be important or regarded as an inadequacy of
the language and a sign of a poorly constructed
framework for talking about things. The distinction
is nugatory for stochastic grammars for
languages over a finite alphabet. If I have a
recognition grammar for such a language, then
I merely
need to produce all the strings on the alphabet
in some order and then test to see if the string
is in the language. If it is, I put it in the
box
and I have generated a string
of the
language. This clearly works for the stochastic
and non-stochastic cases. Conversely, if I have
a generative grammar and you give me a string w
and want to know if it is in
, I need
to generate the strings until I can match the given
string or fail to match it. But a failure to
match it in any finite number of moves is inconclusive
for the non-stochastic case. Whereas in the
stochastic case I can obtain estimates for p(w)
by generating sequences of strings and
counting occurrences of w.
Of course, one might doubt whether this is an algorithm for assigning a probability or relative frequency of occurrence to a string; it only yields estimates based on finite samples. A platonist philosopher or even a logician might argue that only production of the whole infinite set would yield the procedure, and this cannot be accomplished by a real program running on a real machine. Much depends here on what one accepts as an algorithm, and also on what one regards as an algorithm producing a real number measuring a relative frequency of occurrence. I apply a commonsense test of meaning here. For my purposes, a program which assures me that the probability of an event is 0.3333 is (a) feasible to write and (b) may be useful to run. Likewise, being told that the probability of occurrence of a string is less than 0.01 is helpful, whereas being told that I haven't got a stochastic recognition grammar unless presentation of a string of symbols yields an infinite decimal string, specifying the probability, would be bad for my blood pressure. My choices are either to specify what I mean very carefully indeed, as I should have to if I were writing a program for a computer or a text for Pure Mathematicians, or rely on the maturity and common sense of the reader to work out what I intend for practical purposes, which is quicker but a little risky. I choose to take a chance. I may regret this later.
By the trace of an algorithm I mean the sequence of operations it performs on a particular input.
Definition 14277
The trace of a recognition grammar for a particular string is called a parse of the string.