Ken ewe won ah star net.
Car knew one a stair nit.
Care new oner sten at.
Can you understand it?
The last is the most likely fit to an actual utterance
in English, but the earlier versions are likely
to be better from an acoustic point of view. In
order to have good grounds for preferring the
last, one
needs to know something of the allowable syntax
of the spoken language. The constraints, given
by
a stochastic grammar of some sort, a kind of statistical
model for the strings of the language,
are needed in order to control the horrendous
number of possibilities. And even working out
what the
grammar for spoken language is can be rather
trying
. For instance,
the most common word in the
spoken English language is `uh' or `ah' or `um'
or `er'. This is a transactional word which means,
roughly,
`shut up, I haven't finished talking yet'. It
does not occur, for obvious reasons, in the written
language at all. So when a system has had the
constraints of the written language put in and
then
somebody says to the system `Tell me about, uh,
China', the system may well start telling him
about
America, since there is no country called Uhchina
and the system `knows' it has to be a proper
name in its
data base beginning and ending with a schwah.
There is an apocryphal tale about such a system
which asked
the user which day he wanted to come back and
he sneezed. After thirty seconds of agonising
the system
replied: `We're closed Saturdays'. So the business
of using constraints to determine which word
string
was used can run into troubles.
The basic problem of sentence recognition is that people don't speak in words. There is no acoustic cue corresponding to the white space in text. This leads the reflective person to wonder if we could cope were all the white spaces and punctuation to be removed from English text, something which invites a little experiment.
In general, the attempt to use the constraints of natural language to improve recognition systems have succeeded only because the systems that didn't use them were unbelievably bad. There are many different kinds of information which human beings are known to use, from syntactic at the phonemic level (knowing what strings of noises can occur) to syntactic at the lexical level (knowing what strings of words can occur) to semantic (knowing what the words and sentences mean), pragmatic (knowing what the speaker is talking about) and prosodic (stress: pitch and loudness variation). Workers are still exploring the issues involved in using these sources of information intelligently. They have been since the seventies; see the many reports on the HARPY and HEARSAY projects from Carnegie-Mellon University. A discussion on the subject of inferring (stochastic) grammars for a language from a sample will be a feature of the next chapter. You have already seen it done once, rather unconvincingly, using the EM algorithm for Hidden Markov Models.
It should be plain to the thoughtful person that we are a long way from telling how to build the ultimate speech recognition system, and that this discussion merely outlines some of the difficulties. The situation is not actually hopeless, but the problem is difficult, and it would be as well not to underestimate its complexity. The AI fraternity did a magnificent job of underestimating the complexity of just about all the problems they have looked at since the sixties: see the reference to Herbert Simon in the bibliography to chapter one. Engineering builds wonderful devices once the basic science has been done. Without it, engineers are apt to want to plunge swords into the bodies of slaves at the last stage of manufacture so as to improve the quality of the blade: sea water or suet pudding would work at least as well and be much cheaper. The business of discovering this and similar things is called `metallurgy' and was poorly funded at the time in question. Similarly, the basic science of what goes on when we hear some spoken language at the most primitive levels is in its infancy. Well, not even that. It's positively foetal and barely post-conception. It's an exciting area, as are most of the areas involved in serious understanding of human information processing, and progress is being made. But for the man or woman who wants to make a robot behave intelligently, abandon now the thought of having a discussion with it and hope, in the immediate future, for it to be able to handle only single words, or sentences with artificial and restricted syntax.