next up previous contents
Next: Other HMM applications Up: Automatic Speech Recognition Previous: Network Topology and Initialisation

Invariance

Although not as up to date as possible, it is as well to remember that VQ and HMM is the staple of single word recognition systems and rather more. It works reasonably well provided that the system is trained on the speakers it is proposed to use it on; this makes it a dead loss for selling to banks to allow people to tell an automatic system their mastercard number. Even a modest amount of speaker independence has to be purchased by training on a whole raft of people and hoping that your speaker is in there somewhere, or at least a clone or close relative of the same sex. The standard el cheapo boxes for single word recognition which you buy for your PC or MAC work this way. They often solemnly advertise that the system will only respond to your voice, as though this is a great virtue, but in fact it doesn't even do that particularly well.

I have mentioned one way of improving the VQ side of this approach. Another is to try to explicitly recognise the gaussian mixture model is an estimate for a continuous pdf over the states. The only difference in practice is that instead of choosing the closest gausian distribution, we may want to take account of the weights associated with it and choose likelihoods. This yields a semi-continuous Hidden Markov Model. There are some simple ways to improve the HMM side of it too: one can go further and try to model the probability density function for trajectories in the space by directly training a gaussian mixture model on the data consisting of graphs of repetitions of utterances of a single word, and then computing likelihoods for each new trajectory relative to the family of models. After all, it is the set of trajectories with which one is concerned, this is the data. The natural desire to model the pdfs for the different words is thwarted in practice by the fact that there are a number of transformations that are done to the data, and this means we never have enough data to get sensible estimates. It is rather as if you wanted to tell a letter A from a letter B when they had been scribbled on a wall, by storing information about which bricks were painted on. You don't get much help from the replication of A's if they occur in different places, when represented in this form, whereas if you had been able to get a description which was shift invariant, it would be possible to pool the data provided by the different A's and B's.

But that requires knowing the transformation we want to factor out, and in general we don't.

The only thing the prudent human being can do to make life easier for his automatic system is to build in all the invariance he expects to be useful.

One thing, then, that ought to make the ASR practitioners nervous is the big, fat, fundamental assumption that phonemic elements are regions of the space.

To see what other possibilities there are, consider the problem of asking a small girl and a large man to say the words `yes' and `no'. They have different sized and shaped vocal tracts and it is clear as day that their utterances will occupy very different regions of the speech space. Had we given them a piece of chalk apiece and asked them to write the words upon a wall instead of speaking them, we could reasonably have expected that the two dimensional trajectories of the movement of the chalk would also have occupied different parts of the space. Trying to distinguish the words `yes' and `no', when written down, by looking to see which bricks they are written upon would not, on the face of things, be a good way to go.


  
Figure 6.2: Translation and scale invariance.
\begin{figure}
\vspace{8cm}
\special {psfile=patrecfig6.3.ps}\end{figure}

It is clear that the problem arises because the fundamental assumption is clearly wrong in the case of handwritten words, and it might be just as wrong for spoken words too. It isn't the regions of the space you pass through, it's the shape of the trajectory, in some scale invariant way, that determines what a written word is. The HMM if used on the bricks of the wall to determine written words might more or less work for one little girl and one large man, but would flip when confronted with an intermediate size writer.

The problem is that the written word is invariant under both shifts and scaling. And the spoken word is also invariant under some part of a group of transformations: simply making the sound louder throughout will not change the word. Lowering or raising the pitch, within reason, will not change the word. And it is easy to believe that continuous enlargement of the vocal tract won't change the word, either. Adding white noise by breathing heavily all over the microphone, as favoured by certain pop-stars, might also be said to constitute a direction of transformation of a word which leaves the category invariant.

If one excises vowels from the speech of many different speakers, plots the short segments of trajectory which one then obtains as short sequences in ${\fam11\tenbbb R}^{16}$, and then projects down onto the screen of a computer, one sees what looks like a good approximation to a seperate gaussian cluster for each vowel. But the cluster has quite a large extension, close vowels sounds appear to overlap, and the distance between centres is not as big as the extension along the major axes. It is tempting to try to split the space into a part which differs in what vowel is being articulated, and an orthogonal part which tells you who is speaking, how loudly, and how hoarse they are, along with much other data which seems to be largely irrelevant to working out what is being said. It would be nice to try to decompose the space in this way, and it is possible to attack this problem, if not cheap.

Consider, as another basis for doubting the fundamental assumption when applied to speech, the various sign languages used by deaf people and others. These consist of stylised gestures of various sorts, performed in succession, at rates comparable with ordinary speech as far as the rate of transfer of ideas is concerned. The question is, are the target gestures actually made in Sign? It can be plausibly argued that they are hardly ever actually `stated' except in teaching the language, and that fluent signers only sketch out enough of a gesture for the signee to be able to decode it, after which they move on rapidly to the next gesture. So signed `words' are idealisations never actually attained. Moreover, the space of gestures references such things as heads and faces, and is thus not readily reducible to trajectories in an absolute space.

Speech has been described as `gestures of the vocal tract', and we have to consider the possibility that it resembles sign languages in its trajectory structure rather more than meets the eye. If so, the crudity of the VQ/HMM model becomes embarrassing. One would have to be desperate to suggest this approach to sign languages. In fact the problem of automatic reading of sign via a camera is an interesting if little explored area for research which might well illuminate other languages. Perhaps the difficulties are simply more apparent for sign languages than for spoken ones.

In the case of speech, it is generally argued that the path is what is wanted, not the trajectory, since going over the same part of the space at a diferent rate will not change what is said, only how fast you say it. This gives another family of transformations of the space of trajectories which would, if this view is correct, leave the word itself invariant.

In general the problem of speaker independent speech recognition is a profoundly difficult one and one which still needs critical thought, as well as more creative approaches.

For confirmation of some of the pessimistic observations of this section, see the program fview by Gareth Lee on the ftp site ciips.ee.uwa.edu.au, referenced in the bibliography. Pull it back and play with it: it is the best free gift of the decade.


next up previous contents
Next: Other HMM applications Up: Automatic Speech Recognition Previous: Network Topology and Initialisation
Mike Alder
9/19/1997