Suppose we are agreed that we are going to use
k gaussians in
for modelling our data,
subject to the caveats given above on boundedness,
we have no deep feelings about what k ought
to be. If we are going to encode our data set
relative to the pdf given by the k gaussians
and their weights, we need to transmit or store
not just the data encoded (N data points, each
comprising n real numbers, each to precision
using the gaussian mixture model but also (1)
the number of gaussians
k, (2) the k weights, real numbers encoded to
some precision P', (3) the k centres, each
vector in
, and (4) the k covariance
matrices, each having n(n+1)/2 values.
We can make some plausible simplifications: let
us assume the precision of the parameters, P',
be the same as the precision of the components
of the vectors, P. This merits some thought,
let's duck the issue. Let us also suppose that
the parameters can take any values inside some
bounded region of
with a uniform pdf.
let us also suppose that k is going to vary
within some small range, from, say 1 to 20, uniformly.
Thus we lose generality but make the sums
easier for the kinds of cases we need to consider.
Doing this in full generality is interesting
statisticians, but we have more concrete problems
in mind at present.
I shall suppose that the receiver knows what a gaussian is, and that I don't have to teach him what the idea of a function is, and so on. This is not a trivial matter if we believe that we learn to use some compression systems, but I simplify down to comparing between gaussian mixtures only. Extra overheads will therefore be the same and can be neglected in the calculation.
Then with these crude assumptions, I can code
N data points in
directly by sending
NnP bits. I assume here that the precision P
affects all the components of the vectors, including
the digits representing the integer part. If necessary,
fill to the left with zeros. This is
wasteful, but not as wasteful as sending you the
length each time if the length variation is
small.
If I send the mixture model, I get a fixed number of bits to specify k, and then kP bits for the weights, nkP bits for the centres, and kPn(n+1)/2 bits for the covariance matrices.
The total number of bits I get to use is therefore approximately

where f(x) is the gaussian mixture model discussed
earlier, and where an additive constant for
specifying k, which I shall suppose does not
depend on k (because it will be the same for
all
the values of k within our range) has been omitted.
I also take it that
I am integrating the gaussian f over some bounded
region B in
. Naturally, I should take
B to be centred on the region where the data
is. The obvious choice for B is to make it
some
fixed number of `standard deviations' from the
k centres. A simpler choice is to take some
hypercube within which the data may be found and
use that. This is what I shall do.
If I have faith in my (truncated) model, then the amount of compression can be computed directly from the above formula: it represents the best compression for which I can hope, for data sets size N generated according to the model. A more cautious approach is to take the actual data merely try to compress that, using the model as a convenient device for doing so.
What we do here is to sum the logarithms of the improbabilities of each data point according the model. If each real number is specified to P bits, and if we have normalised everything inside the unit hypercube of dimension n, then there are 2nP cells each having measure 2-nP. The probabilities are the likelihoods multiplied by this last number. So the compressed data will have size in bits given by the sum

It must be remembered that the function f needs to be adjusted to have integral 1 over the hypercube into which the data has been compressed.
Now given two such functions, all we need to do to get maximum compression in the case where is the same and the data set fixed, is to make the sum of the log likelihoods as big as possible.
This of course maximises the likelihood product. If the two fs also differ in respect of the number k, the penalty for increasing k has to be balanced against an increase in the log likelihood sum. This gives you, according to Rissanen, the better of the two models.