next up previous contents
Next: Traditional methods: VQ and Up: Automatic Speech Recognition Previous: Automatic Speech Recognition

Talking into a microphone

If you say a word, perhaps the name of a digit, for example `one', into a microphone, then it is straightforward to sample and digitise the resulting signal, and feed it into a computer as a longish sequence of numbers measuring the voltage generated by the microphone and your voice. Typically, a word may take one third to half a second to enunciate, and the signal is sampled

perhaps twenty thousand times a second, giving around seven thousand numbers. Each number will be quantised to perhaps 12 or 16 bit precision. Thus we may be looking at a data rate of around 30 to 40 kilobytes per second. This present paragraph, would, if spoken at a reasonable reading rate, occupy over two megabytes of disk space. If printed, it would occupy around a kilobyte. There is therefore a considerable amount of compression involved in ASR.

There are various methods of proceeding from this point, but the most fundamental and conceptually simplest is to take a Discrete Fourier Transform (DFT) of a short chunk of the signal, referred to as the part of the signal inside a window. The FFT or Fast Fourier Transform of Cooley and Tukey accomplishes this with admirable efficiency and is much used for these purposes. I have already discussed the ideas involved in taking a Fourier Transform: there are two ways of thinking about it which are of value. The first is to imagine that we have a sound and something like a harp, the strings of which can resonate to particular frequencies. For any sound whatever, each string of the harp will resonate to some extent, as it absorbs energy at the resonant frequency from the input sound. So we can represent the input sound by giving the amount of energy in each frequency which the harp extracts, the so-called energy spectrum. This explanation serves to keep the very young happy, but raises questions in the mind of those with a predisposition to thought: what if the sound changes this spectrum in time? If it sweeps through the frequency range fast enough, there may not be enough time to say it has got a frequency, for example. And what kind of frequency resolution is possible in principle? These considerations raise all sorts of questions of how filters work that need more thought than many writers consider worthwhile.

The second way to look at the DFT and FFT is to go back to the old Fourier Series, where we are simply expanding one function from $[-\pi, \pi]$ to ${\fam11\tenbbb R}$ in terms of an orthogonal set of functions, the sine and cosine functions. This is just linear algebra, although admittedly in infinite dimensions.

Now we have to reckon with two issues: one is that the signal is sampled discretely, so in fact we have not an algebraically expressed function but a finite set of values, and the other is that the time window on which we do our expansion is constantly changing as we slide it down the signal. What sense does it make to take out a chunk of a continuously changing signal, pretend it is periodic, analyse it on that assumption, and then slide the window down to a different signal and do the same again? Of course, these two ways of looking at the problem give equivalent results, principally some honest doubt as to whether this is the way to go. Alternative transforms such as the Wigner transform and various wavelet families are available for those who want to follow this line of thought. They are beyond the scope of this book, but the reader who is of a reflective disposition will be ready for them when he meets them. I shall skip these interesting isues on the grounds that they are somewhat too technical for the present work, which is going to concern itself with more mundane matters, but the reader needs to know that there are problems about fixing up the window so that the FFT gives acceptable answers

reasonably free of artefacts of the analysis process. See the standard works on Speech Recognition and many, many issues of the IEEE transactions on Signal Processing, and the IEE Proceedings part I for relevant papers.

We take, then, some time interval, compute the FFT and then obtain the power spectrum of the wave form of the speech signal in that time interval of, perhaps, 32 msec. Then we slide the time interval, the window, down the signal, leaving some overlap in general, and repeat. We do this for the entire length of the signal, thus getting a sequence of perhaps ninety vectors, each vector in dimension perhaps 256, each of the 256 components being an estimate of the energy in some frequency interval between, say, 80 Hertz and ten KHz. The FFT follows a divide and conquer strategy, and when it divides it divides by two, so it works best when the number of frequency bands is a power of two, and 256 or 512 are common. Instead of dealing with the raw signal values in the window, we may first multiply them by a window shaping function such as a Hamming Window, which is there to accomodate the inevitable discontinuity of the function when it is regarded as periodic. Usually this just squashes the part of the signal near the edge of the window down to zero while leaving the bit in the middle essentially unchanged.

Practical problems arise from trying to sample a signal having one frequency with a sampling rate at another; this is called `aliasing' in the trade, and is most commonly detected when the waggon wheels on the Deadwood Stage go backwards, or a news program cameraman points his camera at somebody's computer terminal and gets that infuriating black band drifting across the screen and the flickering that makes the thing unwatchable. There is a risk that high frequencies in the speech signal will be sampled at a lower frequency and will manifest themselves as a sort of flicker. So it is usual to kill off all frequencies not being explicitly looked for, by passing the signal through a filter which will not pass very high or very low frequencies. Very high usually means more than half the sampling frequency, and very low means little more than the mains frequency[*].

The 256 numbers may usefully be `binned' into some smaller number of frequency bands, perhaps sixteen of them, also covering the acoustic frequency range[*]. The frequency bands may have their centres selected judiciously in a more or less logarithmic division of the frequency range, their widths also adjusted accordingly, and the result referred to as a simulated filterbank of sixteen filters covering the audio spectrum. Alternatively, you could have a bank of 16 bandpass filters, each passing a different part of the audio spectrum, made up from hardware. This would be rather old fashioned of you, but it would be faster and produce smoother results. The hardware option would be more popular were it not for the tendency of hardare to evolve, or just as often devolve, in time.

The so called `Bark' scale, slightly different from logarithmic and popularly supposed to correspond more closely to perceptual differences, is used by the more sophisticated, and, since speech has been studied since Helmholtz, there is an extensive literature on these matters. Most of the literature, it must be confessed, appears to have had minimal effect on the quality of Speech Recognition Systems.

Either of these approaches turns the utterance into a longish sequence of vectors in ${\fam11\tenbbb R}^{16}$representing the time development of the utterance, or more productively as a trajectory, discretely sampled, in ${\fam11\tenbbb R}^{16}$. Many repetitions of the same word by the same speaker might reasonably be expected to be described as trajectories which are fairly close together in ${\fam11\tenbbb R}^{16}$. If I have a family of trajectories corresponding to one person saying `yes' and another family corresponding to the same person saying `no', then if I have an utterance of one of those words by the same speaker and wish to know which it is, then some comparison between the new trajectory and the two families I already have, should allow us to make some sort of decision as to which of the two words we think most likely to have been uttered.

Put in this form, we have opened up a variant of traditional pattern recognition which consists of distinguishing not between different categories of point in a space, but different categories of trajectory in the space. Everything has become time dependent; we deal with changing states.

Note that when I say `trajectory' I mean the entire map of the time (discretely sampled) into the space, and not just the set of points in the path. These are very different things. If you walk across the highway and your path intersects that of a bus, this may not be important as the bus may have long since gone by. On the other hand if your trajectory intersects that of the bus, then unless it does so when you are both at rest, you are most unlikely to come back to reading this book or, indeed, any other. I am prepared to reconstruct the time origin of any two utterances so that they both start from time zero, but one may travel through the same set of points twice as fast as another, and the trajectory information will record this, while the image of the two trajectories will be the same. A trajectory is a function of time, the path is the image of this function[*]. If you want to think of the path as having clock ticks marked on it in order to specify the trajectory, that's alright with me. Now it might be the case, and many argue that in the case of speech it is the case, that it is the path and not the trajectory that matters. People seldom mean it in these terms, since the words /we/ and /you/ are almost the same path, but the direction is reversed. Still, if two trajectories differ only in the speed along the path, it can be argued that they must sound near enough the same. This may or may not be the case; attempts to compare two trajectories so as to allow for differences in rate which are not significant are commonly implemented in what is known as Dynamic Time Warping (DTW). I shall not describe this method, because even if it applies to speech, it may not hold for everything, and DTW is more commonly replaced by Hidden Markov Modelling these days.

On first inspection, one might argue that a trajectory in one space is simply a point in a function space. This is true, but not immediately helpful, as different trajectories may not be in the same space, as function spaces are traditionally defined. It is rather hard to put a sensible metric structure on the set of maps from any interval of the real numbers into ${\fam11\tenbbb R}^n$ without any other considerations. So the abstract extension from points to trajectories needs some extra thought, which may depend upon the nature of the data.

It would be a mistake to think that binning the power spectrum into some number n of intervals is the only way of turning speech into a trajectory in ${\fam11\tenbbb R}^n$, there are others involving so called cepstra or LPC coefficients which are believed in some quarters to be intrinsically superior.


next up previous contents
Next: Traditional methods: VQ and Up: Automatic Speech Recognition Previous: Automatic Speech Recognition
Mike Alder
9/19/1997