1. Initialise the k gaussians so that each of theis the identity matrix, each of the wi is
, and the k centres are taken to be some small random step in
away from a randomly selected point of the data set X.
2. For each point
, and for each i between 1 and k, compute the weighted likelihood
using the last estimate of the parameters
. For each point
, let
be the likelihood assigned to it by the ith member of the mixture:
and let Sx be the sum
and let
We think of the point
as ready to be split up between the k gaussians according to the k fractions
In consequence, we have that
, the cardinality of
.
3. Now re-estimate the wi as
and the centres
as the weighted centroids
and finally, the ith covariance matrix,
is re-estimated as
4. Run through steps 2 and 3 until there is no significant change in any of the numbers being estimated, then stop.
It may be shown that in principle this converges to a maximum likelihood mixture model for the data set X with k gaussians in the mixture. In practice it is found that the initialisation is rather critical and one does not always get good results. It is easy to experiment with data such as Fig.4.1 in two dimensions, where the data comes from a known number of gaussians, and to discover that very unsatisfactory `solutions' are found by the above process, as Fig.4.3 shows.
The problem of suitably initialising the EM algorithm will be discussed later, as will the issue of finding the `right' value of k, the number of gaussians. In the chapter on Quadratic Neural Nets I shall describe a wholly dynamical approach to the business of modelling point sets which has affinities with mixture modelling.
Inspecting the data by projection can help in suggesting a suitable number of gaussians to employ in a mixture model, and can also suggest reasonable initialisations of the centres.
It is worth noting that the estimates of the eigenvalues obtained from diagonalising the covariance matrix are biased in an interesting way: the true ellipsoid is likely to be shorter and fatter than the estimate obtained from a data set of limited size. In other words the larger eigenvalues are overestimated and the smaller ones underestimated. One can accomodate this in various ways which are left to the imagination and enterprise of the reader.