If you say a word, perhaps the name of a digit, for example `one', into a microphone, then it is straightforward to sample and digitise the resulting signal, and feed it into a computer as a longish sequence of numbers measuring the voltage generated by the microphone and your voice. Typically, a word may take one third to half a second to enunciate, and the signal is sampled
perhaps twenty thousand times a second, giving around seven thousand numbers. Each number will be quantised to perhaps 12 or 16 bit precision. Thus we may be looking at a data rate of around 30 to 40 kilobytes per second. This present paragraph, would, if spoken at a reasonable reading rate, occupy over two megabytes of disk space. If printed, it would occupy around a kilobyte. There is therefore a considerable amount of compression involved in ASR.
There are various methods of proceeding from this point, but the most fundamental and conceptually simplest is to take a Discrete Fourier Transform (DFT) of a short chunk of the signal, referred to as the part of the signal inside a window. The FFT or Fast Fourier Transform of Cooley and Tukey accomplishes this with admirable efficiency and is much used for these purposes. I have already discussed the ideas involved in taking a Fourier Transform: there are two ways of thinking about it which are of value. The first is to imagine that we have a sound and something like a harp, the strings of which can resonate to particular frequencies. For any sound whatever, each string of the harp will resonate to some extent, as it absorbs energy at the resonant frequency from the input sound. So we can represent the input sound by giving the amount of energy in each frequency which the harp extracts, the so-called energy spectrum. This explanation serves to keep the very young happy, but raises questions in the mind of those with a predisposition to thought: what if the sound changes this spectrum in time? If it sweeps through the frequency range fast enough, there may not be enough time to say it has got a frequency, for example. And what kind of frequency resolution is possible in principle? These considerations raise all sorts of questions of how filters work that need more thought than many writers consider worthwhile.
The second way to look at the DFT and FFT is to
go back to the old Fourier Series, where we are
simply expanding one function from
to
in terms of an orthogonal set of functions,
the sine and cosine functions. This is just linear
algebra, although admittedly in infinite dimensions.
Now we have to reckon with two issues: one is that the signal is sampled discretely, so in fact we have not an algebraically expressed function but a finite set of values, and the other is that the time window on which we do our expansion is constantly changing as we slide it down the signal. What sense does it make to take out a chunk of a continuously changing signal, pretend it is periodic, analyse it on that assumption, and then slide the window down to a different signal and do the same again? Of course, these two ways of looking at the problem give equivalent results, principally some honest doubt as to whether this is the way to go. Alternative transforms such as the Wigner transform and various wavelet families are available for those who want to follow this line of thought. They are beyond the scope of this book, but the reader who is of a reflective disposition will be ready for them when he meets them. I shall skip these interesting isues on the grounds that they are somewhat too technical for the present work, which is going to concern itself with more mundane matters, but the reader needs to know that there are problems about fixing up the window so that the FFT gives acceptable answers
reasonably free of artefacts of the analysis process. See the standard works on Speech Recognition and many, many issues of the IEEE transactions on Signal Processing, and the IEE Proceedings part I for relevant papers.
We take, then, some time interval, compute the FFT and then obtain the power spectrum of the wave form of the speech signal in that time interval of, perhaps, 32 msec. Then we slide the time interval, the window, down the signal, leaving some overlap in general, and repeat. We do this for the entire length of the signal, thus getting a sequence of perhaps ninety vectors, each vector in dimension perhaps 256, each of the 256 components being an estimate of the energy in some frequency interval between, say, 80 Hertz and ten KHz. The FFT follows a divide and conquer strategy, and when it divides it divides by two, so it works best when the number of frequency bands is a power of two, and 256 or 512 are common. Instead of dealing with the raw signal values in the window, we may first multiply them by a window shaping function such as a Hamming Window, which is there to accomodate the inevitable discontinuity of the function when it is regarded as periodic. Usually this just squashes the part of the signal near the edge of the window down to zero while leaving the bit in the middle essentially unchanged.
Practical problems arise from trying to sample
a signal having one frequency with a sampling
rate at
another; this is called `aliasing' in the trade,
and is most commonly detected when the waggon
wheels on the Deadwood Stage go backwards, or
a news program cameraman points his camera at
somebody's
computer terminal and gets that infuriating black
band drifting across the screen and the flickering
that makes the thing unwatchable. There is a risk
that high frequencies in the speech signal will
be sampled at a lower frequency and will manifest
themselves as a sort of flicker. So it is usual
to kill off all frequencies not being explicitly
looked for, by passing the signal through a filter
which will not pass very high or very low frequencies.
Very high usually means more than half the
sampling frequency, and very low means little
more than the mains frequency
.
The 256 numbers may usefully be `binned' into
some smaller number of frequency bands, perhaps
sixteen of them, also covering the acoustic frequency
range
. The frequency bands may have their
centres selected judiciously in a more or less
logarithmic division of the frequency range,
their widths also adjusted accordingly, and the result
referred to as a simulated filterbank of sixteen
filters covering the audio spectrum. Alternatively,
you could have a bank of 16 bandpass filters,
each passing a different part of the audio spectrum,
made up from hardware. This would be rather old
fashioned of you, but it would be faster and produce
smoother results. The hardware option would
be more popular were it not for the tendency of hardare
to evolve, or just as often devolve, in time.
The so called `Bark' scale, slightly different from logarithmic and popularly supposed to correspond more closely to perceptual differences, is used by the more sophisticated, and, since speech has been studied since Helmholtz, there is an extensive literature on these matters. Most of the literature, it must be confessed, appears to have had minimal effect on the quality of Speech Recognition Systems.
Either of these approaches turns the utterance
into a longish sequence of vectors in
representing the
time development of the utterance, or more productively
as a trajectory, discretely sampled, in
. Many repetitions of the same word by
the same speaker might reasonably be expected
to be described as trajectories which are fairly
close together in
. If I have a family
of trajectories corresponding to one person saying
`yes' and another family corresponding to the
same person saying `no', then if I have an utterance
of one of those words by the same speaker
and wish to know which it is, then some comparison
between the new trajectory and the two families
I already have, should allow us to make some sort
of decision as to which of the two words we think
most likely to have been uttered.
Put in this form, we have opened up a variant of traditional pattern recognition which consists of distinguishing not between different categories of point in a space, but different categories of trajectory in the space. Everything has become time dependent; we deal with changing states.
Note that when I say `trajectory' I mean the entire
map of the time (discretely sampled) into the
space, and not just the set of points in the path.
These are very different things. If you walk
across the highway and your path intersects that of a
bus, this may not be important as the bus may
have long since gone by. On the other hand if your
trajectory intersects that of the bus,
then unless it does so when you are both at rest, you
are most unlikely to come back to reading this
book or,
indeed, any other. I am prepared to reconstruct
the time origin of any two utterances so that
they both start from time zero, but one may travel
through the same set of points twice as fast
as another, and the trajectory information will record
this, while the image of the two
trajectories will be the same. A trajectory is
a function of time, the path is the image of
this function
.
If you want to think of the path as having clock
ticks marked on it in order to specify the
trajectory, that's alright with me. Now it might
be the case, and many argue that in the case
of speech
it is the case, that it is the path and
not the trajectory that matters. People seldom
mean it
in these terms, since the words /we/ and /you/
are almost the same path, but the direction is
reversed.
Still, if two trajectories differ only in the
speed along the path, it can be argued that they
must
sound near enough the same. This may or may not
be the case; attempts to compare two trajectories
so as
to allow for differences in rate which are not
significant are commonly implemented in what
is known as
Dynamic Time Warping (DTW). I shall not
describe this method, because even if it applies
to speech, it may
not hold for everything, and DTW is more commonly
replaced by Hidden Markov Modelling these days.
On first inspection, one might argue that a trajectory
in one space is simply a point in a function
space. This is true, but not immediately helpful,
as different trajectories may not be in the same
space, as function spaces are traditionally defined.
It is rather hard to put a sensible metric
structure on the set of maps from any interval
of the real numbers into
without any other
considerations. So the abstract extension from
points to trajectories needs some extra thought,
which may depend upon the nature of the data.
It would be a mistake to think that binning the
power spectrum into some number n of intervals
is the only way of turning speech into a trajectory
in
, there are others involving so called
cepstra or LPC coefficients which are believed
in some quarters to be intrinsically superior.