next up previous contents
Next: Example Up: How many things in Previous: How many things in

Overhead

Suppose we are agreed that we are going to use k gaussians in ${\fam11\tenbbb R}^n$ for modelling our data, subject to the caveats given above on boundedness, we have no deep feelings about what k ought to be. If we are going to encode our data set relative to the pdf given by the k gaussians and their weights, we need to transmit or store not just the data encoded (N data points, each comprising n real numbers, each to precision using the gaussian mixture model but also (1) the number of gaussians k, (2) the k weights, real numbers encoded to some precision P', (3) the k centres, each vector in ${\fam11\tenbbb R}^n$, and (4) the k covariance matrices, each having n(n+1)/2 values. We can make some plausible simplifications: let us assume the precision of the parameters, P', be the same as the precision of the components of the vectors, P. This merits some thought, let's duck the issue. Let us also suppose that the parameters can take any values inside some bounded region of ${\fam11\tenbbb R}$ with a uniform pdf. let us also suppose that k is going to vary within some small range, from, say 1 to 20, uniformly. Thus we lose generality but make the sums easier for the kinds of cases we need to consider. Doing this in full generality is interesting statisticians, but we have more concrete problems in mind at present.

I shall suppose that the receiver knows what a gaussian is, and that I don't have to teach him what the idea of a function is, and so on. This is not a trivial matter if we believe that we learn to use some compression systems, but I simplify down to comparing between gaussian mixtures only. Extra overheads will therefore be the same and can be neglected in the calculation.

Then with these crude assumptions, I can code N data points in ${\fam11\tenbbb R}^n$ directly by sending NnP bits. I assume here that the precision P affects all the components of the vectors, including the digits representing the integer part. If necessary, fill to the left with zeros. This is wasteful, but not as wasteful as sending you the length each time if the length variation is small. [*]

If I send the mixture model, I get a fixed number of bits to specify k, and then kP bits for the weights, nkP bits for the centres, and kPn(n+1)/2 bits for the covariance matrices.

The total number of bits I get to use is therefore approximately

\begin{displaymath}
NPn + N\int_{B} f(x)\log_2(\frac{1}{f(x)}) 
dx + kP(\frac{n^2+3n+2}{2}) \end{displaymath}

where f(x) is the gaussian mixture model discussed earlier, and where an additive constant for specifying k, which I shall suppose does not depend on k (because it will be the same for all the values of k within our range) has been omitted. I also take it that I am integrating the gaussian f over some bounded region B in ${\fam11\tenbbb R}^n$. Naturally, I should take B to be centred on the region where the data is. The obvious choice for B is to make it some fixed number of `standard deviations' from the k centres. A simpler choice is to take some hypercube within which the data may be found and use that. This is what I shall do.

If I have faith in my (truncated) model, then the amount of compression can be computed directly from the above formula: it represents the best compression for which I can hope, for data sets size N generated according to the model. A more cautious approach is to take the actual data merely try to compress that, using the model as a convenient device for doing so.

What we do here is to sum the logarithms of the improbabilities of each data point according the model. If each real number is specified to P bits, and if we have normalised everything inside the unit hypercube of dimension n, then there are 2nP cells each having measure 2-nP. The probabilities are the likelihoods multiplied by this last number. So the compressed data will have size in bits given by the sum

\begin{displaymath}
NPn - \sum_{{\bf x}\in {\cal X}} \log_2(f({\bf x})) 
+ kP(\frac{n^2+3n+2}{2}) \end{displaymath}

It must be remembered that the function f needs to be adjusted to have integral 1 over the hypercube into which the data has been compressed.

Now given two such functions, all we need to do to get maximum compression in the case where is the same and the data set fixed, is to make the sum of the log likelihoods as big as possible.

This of course maximises the likelihood product. If the two fs also differ in respect of the number k, the penalty for increasing k has to be balanced against an increase in the log likelihood sum. This gives you, according to Rissanen, the better of the two models.


next up previous contents
Next: Example Up: How many things in Previous: How many things in
Mike Alder
9/19/1997