It appears to be an article of faith with practitioners of Automatic Speech Recognition (ASR) that an utterance is a string of phonemes and a phoneme is a region of the space of speech sounds. This is not quite what a phonetician means by a phoneme, but phoneticians are people who know, roughly, how to tell each other how to hold their faces so as to make noises like foreigners talking, and are generally quite unable to say much about the speech space. The anthropologists who feel the clash of disparate cultures have nothing on a speech engineer going to a meeting of phoneticians or vice-versa. The more courageous find it exhilarating, the less, terrifying.
If we follow, temporarily and without prejudice, the ASR engineers, then we solve the problem of recognition of trajectories by first chopping up the space into regions, and labelling each region with a symbol. Ideally, the region of the space would correspond to what a phonetician would call a phoneme. In practice we may have the region rather arbitrary in position and shape. A trajectory, complete with clock ticks, then gets turned into a sequence of symbols, the labels of the region it is in at each clock tick. If we have a large number of utterances, we obtain from our trajectories, and way of chopping the space up into regions, a corresponding number of symbol strings. We then try to model the strings corresponding to a word such as `yes' by means of a stochastic grammar, the strings corresponding to the word `no' by a different stochastic grammar, and a new word has its probabilities computed by applying each grammar in turn. Then we use Bayesian arguments to get back from the probability of each grammar having generated the string, to the probability, given the string, of one grammar or another having been the source. A stochastic grammar, to anticipate the next chapter somewhat, is an algorithm which assigns a probability to a string of symbols to say how likely it is to occur in a sample from a particular language, and a (stochastic) language is a set of strings of symbols with a frequency attached to each of them. It is clear that a certain collection of assumptions about the nature of speech have been employed to come to this approach.
Chopping up the space into lumps is known in the
trade as Vector Quantisation, which no
doubt sounds a little classier put that way
, and may
be accomplished in a variety of ways. Some of these
are so arbitrary and whimsical as to be embarrassing.
One of the better ways is actually to take the
speech data as just a big stack of points in
or whatever, and use a gaussian mixture model
with some judiciously chosen number of gaussians.
Then each gaussian may be used to compute the
likelihood of a given piece of speech sound having
been obtained from it, and the largest likelihood
yields the name of that particular gaussian as
the symbol. Alternatively we can use the Mahalanobis
distance from each of the centres. This gives
somewhat better results than the conventionally
used LBG algorithm, a variant of k-means clustering
which I shall now describe.
Suppose we have a set of points in
and
we wish to chop them up into clusters. We first
decide how many clusters we propose to have, say
k, and choose ourselves k different kinds
of points, call them centres. Now we throw the centres
down at random in the space, or perhaps uniformly,
and then partition the points so as to assign
each point to the closest centre, thus constructing
subclusters. Then we move the centres to the centroids
of the subclusters, and recompute the assignment
of points to subclusters,
and the shifting of centres to centroids of subclusters,
until the process stabilises. We may have to
get rid of centres which coincide, in extreme
cases. This may be said in algebra:
Given
, D finite and
, let
be an initial assignment
of centres; this might be uniform on the region
containing D or random. Now let
, for m=0 in the first instance,
be the partition of D which has
![]()

It turns out to be rather too dependent on the initial assignment of centres to leave the thoughtful reader altogether happy.
In The LBG (Linde, Buzo, Gray, q.v.) algorithm, the k means are obtained by a splitting process: first we go to the centroid of the whole data set D. Then we choose a random line and split the centre into two centres, each moved apart a little distance along the line. Then we partition the data set by assigning each point of it to the closest centre, just as before. Then we move to the centroids of the partitioned bit, as before. And we proceed to replicate the splitting process until we get bored or run out of points. This leads to k being always a power of two. The method again is dependent on the orientation of the lines used and often leads to bad clusters, which is to say clusters that don't look like natural clusters to the naked eye. If we generate the data by a gaussian mixture, we can get some exceedingly unappealing results for the final LBG assignment of points to clusters, given that we know how we in fact did it. In abstract terms, this method places more significance on the metric than is normally warranted. It is true that many metrics will agree about where the centroid of a set of points should be, but they usually have a depressing indifference to the distribution of the data. Fitting gaussian clusters and computing Mahalanobis distances or likelihoods actually gives somewhat better performance in recognition systems, which suggests that reality doesn't love the standard metrics all that much. We can think of the quadratic form as being a local estimate of the metric tensor for the generating process, or if you prefer, the data is trying to tell you something about what `close' means, and it would be advisable to listen to it.
As a Vector Quantisation (VQ) system, gaussian mixture modelling by fairly crude means is as quick and easy and more defensible than most. The simplest method is to replace the k-means approach by computing not just the mean of the subcluster assigned to the centre, but also its covariance matrix, and then to use that in order to compute a Mahalanobis distance when doing the next partitioning of the data points. We can think of this as taking a local estimate of the metric, or a second order approximation to the distribution of data points. You will note that this is a sort of quick and dirty version of the EM algorithm.
The labels for the different parts of the space are known as codewords and the list of them all is known as a codebook, for reasons which derive from coding theory. The general procedure of replacing points in a space by some set of central approximating points is used in many areas, for example in image compression, and it is the main idea in coding against noise, where the hope is that if you send the centres and they get corrupted, the points will be moved a short distance, but providing we choose to replace each point by the closest centre, we usually get back to where we started from. In this case a point is usually in a finite space of bytes.
It is common to try to estimate the extent to which replacing the actual point by the closest centre has deformed the trajectory, by adding up the squares of the distances of the actual points from the closest centres.This is called the distortion measure of the codebook. Of course, this presupposes that we know the right way to measure distances in the space, which we don't, so there is a certain fatuity in this approach. In image compression, for example, the Discrete Cosine Transform or DCT, a relative of the DFT which expands in terms of cosine functions only, is used on the two dimensional signal that is an image, and then the resulting function is quantised. Now the DCT produces a spectrum of spatial frequencies in the original image, and the eye is relatively insensitive to high spatial frequencies and low spatial frequencies, and is most sensitive somewhere in the middle. This is telling you that the euclidean metric on the space of DCT transforms is a particularly useless one to use if you want to say when two images look similar to human beings. Nor is there any reason to believe that the euclidean metric on a space of power spectra is of much use as a measure of acoustic similarity as perceived by people. So attempts to obtain distortion measurements by this method betray a certain naivete on the part of those making them. There's a lot of it about.
Having now chopped the space up into lumps and replaced the original trajectory by the sequence of codewords to which each point of the trajectory is closest (using, of course, a Mahalanobis distance in the case of the gaussian mixture modelling) we have the problem of representing the resulting strings by means of stochastic grammars. The stochastic grammars most commonly employed are known as Hidden Markov Models or HMMs for short. A hidden Markov model is offered here, heuristically, as a description of the vocal tract going from one state to another. Typically ASR workers use a state diagram such as Fig. 6.1.
Here the states have been labelled for ease of talking about the diagram. Variants having more states or additional arcs (such as from A to D) are common.
We turn this diagram into a stochastic grammar for a set of strings, a language sample from a stochastic language, over the alphabet of labels of the regions into which the space has been chopped up into lumps, sorry, vector quantised. The stochastic grammar is obtained as follows.
Imagine someone in the next room to you stepping over the arcs represented by arrowed lines, between discs represented by the circles of Fig.6.1. Let them be supposed to have a handful of coins or dice or other means of generating random numbers. Suppose the diagram over which they trip their light fantastic, has been augmented in two ways. The first is that for each arc out of a node or disk, there has been a probability assigned. In going from node < to node A there is no choice, so we can put a probability of 1 on this arrow, but once at node A we have three choices, jumping up in the air and coming back down on node A, hopping to node B, or skipping to node C. In order to decide which to do, your colleague throws a die or dice, and makes a decision depending on the outcome. If the three probabilities (AA), (AB), (AC) were the same, then your friend might hop if the die showed one or two, skip if it showed three or four, and jump if it showed five or six. The probabilities out of each node all add up to one, and we shall suppose that every node with an arc leading out of it has had probabilities assigned to it.
The other augmentation is to suppose that for each arc there is a probability distribution over the alphabet which assigns a probability of your friend shouting out each symbol. Again, having chosen probabilistically an arc to move along, your friend now has, while moving along it, to make another throw of coins or dice in order to decide which symbol to call out. Tricky, but not impossible. You may recognise this as a first order markov process for the grammar.
As your friend next door starts from the < node and proceeds through the diagram, you hear the names of symbols being shouted out (along with the clatter of falling dice and coins), with a pause as she comes to the end node, >, and marches back to the beginning to do it again.
The system may be simplified, at modest cost, by ensuring that your friend calls out a symbol on arriving at a node, which means that the probability distributions over the symbols of the alphabet are the same for any two arcs terminating in the same node.
It would be easy to write a program to do the same, and less taxing for your friend.
The thing that is hidden about this markov model, is that you hear the string of symbols, but you don't know by which particular path through the network it was obtained. It may be possible that each string has a unique parse; that is to say, for any string there is either no way it could be obtained, or there is a unique sequence of hops, skips and jumps which was responsible. On the other hand there might be a large number of different possible routes which could have yielded the string.
To set the model up by choosing probabilities
assigned to arcs or symbols according to one's
whim is
easy enough but unprofitable. The question of
interest is, suppose we have a diagram such as
the
last figure and a set of strings, how do we select
the probabilities of arcs and symbols so as to
get a good stochastic grammar for our sample?
And other questions of interest are, where did
that
diagram come from, and is there a better choice?
Avoiding the embarrassment of the last two
questions pro tem., let us suppose that
we have obtained the diagram itself by inspecting
the backs of the tablets on which the ten commandments
were inscribed, but that the creator has neglected
to insert the probability distributions (stored
usually as tables and called the A-matrix and
B-matrix
in the literature). We may obtain some suitable
numbers by the following procedure
which
the astute
reader may recognise.