next up previous contents
Next: The EM algorithm for Up: Computing PDFs: Gaussians Previous: Dimension 2

Lots of Gaussians: The EM algorithm

Fitting a single gaussian to the point set of Fig.4.1 would leave out a lot of the structure. One way to improve the description of the point set would be to go to higher order moments, since we are using only second order moments here. If we do this we certainly get more information about the distribution of the points. Another approach is the localise the data approach. This amounts to fitting several ellipses, or if you prefer, several gaussians.

If you think of how you might describe handprinted characters as vectors, either by moments or some other means, then it is reasonable that much of the time two exemplars of the same character will be fairly close in the representation space. On the other hand, there are equally clearly cases where they won't; if for example half of the digits of a postcode are written by Europeans with a predilection for putting a horizontal bar across the vertical stroke of a seven, and the other half are written by Americans or British who have no such urge, then the sevens will form two quite distinct clusters in the representation space. Each cluster might, if we are optimistic, be representable by a single gaussian, but the two clusters together might not. So it would be as well to have several gaussians available for modelling digits, several for each digit according to the various strategies which are available for forming them.

The use of a mixture of gaussians is well recognised as a statistical technique, and the so called EM algorithm can be used to fit a fixed number of gaussians to a data set such as Fig.4.1. The bibliography contains pointers to books telling you the theory of handling mixture models quite generally. I shall concentrate exclusively on mixtures of gaussians, although it is not hard to see how to adapt the methods to other pdfs. It is found experimentally that the EM algorithm, although very fashionable, is very vulnerable to initialisation, and converges at a rate somewhat slower than a slug bungee-jumping in treacle. I shall have a little to say about methds of doing something about this a little later.

Given a data set such as that of Fig.4.1, it is not at all unreasonable to decide to fit six gaussians. It is also reasonable to wonder what one ought to do in higher dimensions where it might not be so easy to decide what a good number of gaussians would be. This is an important point I shall come to later, but for the time being let us suppose that we have a data set X of points all of the same category in ${\fam11\tenbbb R}^n$, some number, k, of gaussians for distributing over the points, and we seek the maximum likelihood model. Algebraically, we have to find the maximum likelihood model for

\begin{displaymath}
\sum_{i=1}^k w_i g_{[{\bf m}_i,{\bf V}_i]}({\bf x}) \end{displaymath}

where the wi are weights which have to be chosen so that the integral over the whole space is 1. So we have k weights, each a real number, k centres, each a vector in ${\fam11\tenbbb R}^n$, and k symmetric positive definite matrices, each n by n, each to be found so as to give the product of the likelihoods for each data point a maximum, that is:

\begin{displaymath}
\prod_{{\bf x}\in {\cal X}} \sum_{i=1}^k w_i g_{[{\bf m}_i,{\bf V}_i]}({\bf x}) \end{displaymath}

where of course each gaussian $g_{[{\bf m}_i,{\bf V}_i]}$ is a function of the form:

\begin{displaymath}
g_{[{\bf m}_i,{\bf V}_i]} ({\bf x}) =
 \frac{1}{(\sqrt{2\pi}...
 ...({\bf x}-{\bf m}_i)^T {\bf V}_i^{-1} ({\bf x}-{\bf m}_i) }{2}} \end{displaymath}

and the whole has to be maximised by suitable choice of the weights and parameters for the gaussians. This is a rather horrid problem in constrained optimisation. Rather than try to solve it exactly, it is common to use the EM algorithm to find a solution. EM stands for `Expectation Maximisation' and describes the two steps in the algorithm. See the references for the origins and generality of the EM algorithm; here I shall give a procedure for obtaining the Maximum Likelihood Gaussian mixture model. I am most grateful to Arthur Nadas of the Automatic Speech Recognition group, IBM's Thomas J. Watson Research Center, Yorktown Heights for explaining this to me, among other things, in the dim and distant past.



 
next up previous contents
Next: The EM algorithm for Up: Computing PDFs: Gaussians Previous: Dimension 2
Mike Alder
9/19/1997