next up previous contents
Next: Summary of Rissanen Complexity Up: Minimum Description Length Models Previous: Compression for coin models

Compression for pdfs

Although the ideas are clear enough in the case of discrete distributions, it is not quite so easy to see how to use a gaussian pdf, for example, to compress a set of data. And yet in real life we always specify our real numbers to some precision, and the use of an infinite string of digits is a pretence. It simplifies some aspects of thinking about the world if we decline to say how many digits following the decimal point are going to be used, but it brings us some formidable complications in the language at a later date. (All the paradoxes of infinite sets, the existence of uncountable sets, arise from a desire to go beyond the confines of a finite language for essentially aesthetic reasons.) It is clear that to send one real number alone would require, for almost all real numbers, an infinite amount of information. So let us choose, from the beginning, some sensible bound on the precision of all the numbers we are going to transmit.

Now suppose we decide to work to one part in 2P for some positive integer P. And suppose we are dealing with a set of points on the real line, in order to keep the example simple, and let us choose the five points {0.999, 0.099, 0.009, 0.001, 0.000}.

Suppose we decide to send these points to three decimal places; each place is a digit which takes four bits to code, since there are ten of them. So there are (5)(4)(3) = 60 bits. If we found a more efficient coding of the digits, we would be able to get $\log_2(10) =3.32193$ bits per digit, giving 49.8289 as the total number of bits. Can we do better with an improved model which takes into account the fact that most of the density is near 0?

Suppose we define a $pdf,\ f(x)$, over the interval [0,1] by having the height of f(x) as

$\frac{1}{3.7}$ for $x \in (0.1, 1.0]$,
$\frac{10}{3.7}$ for $x \in (0.01, 0.10]$,
$\frac{100}{3.7}$ for $x \in (0.001, 0.01]$,
$\frac{1000}{3.7}$ for $x \in [0, 0.001]$.

Then let us divide the unit interval up into the two parts from 0 to 0.1 and from 0.1 to 1, and for each point, write <.0??????> if the point is in the left tenth and <.1???????> if it is in the right 90%. The question marks have to be filled in by repeating the process a few times.

It is clear that repeating it three times to get <.000> locates the point as being in the interval from 0.00000.. to 0.001 (using decimal notation) so we have correctly specified the first three digits of 0.000. To write <.001> specifies a number in the interval from 0.001000... to 0.001999.. in decimal, which adequately specifies the first three digits of 0.001.

We can specify then in binary the points 0.000 and 0.001 using the bit strings <.000> and <.001> respectively, thus getting two of the points in just 6 bits. The point 0.009 starts off in binary as <.001???> which tells us that it lies between 0.001 and 0.01. Since the pdf is uniform in this region, I simply code the interval in straight, equal interval, binary. I need to divide the region into 10 intervals to get 0.009. This can be done with an extra four bits, with a little wastage, Thus 0.009 is coded as <.0011001>. So 7 bits suffices for the third point out from the origin.

0.099 is the next point, and we start off with <.01???> to say it lies within the interval from 0.01 to 0.1. This has to be divided uniformly into 100 intervals which can be done using an extra 7 bits, again using equal interval binary, giving a total of 9 bits for this point.

Finally we have to code 0.999 and this starts off <.1????> to tell us it lies in the interval from 0.1000.. to 1.0000... If we divide this interval into 1000 intervals we can specify the location to the required precision with an extra 10 bits. This gives 11 bits for the last number. The total number of bits is thus 3 + 3 + 7 + 9 + 11 = 33 bits. I have used a sneaky convention to cut down on the bit lengths, and a pessimist might throw in another few bits, but this is a saving of over 13 bits in anybody's language.

If we consider the uniform probability distribution, recalling that we quantised to a precision of three figures, we see that the uniform distribution would give each of the five points a probability of $\frac{1}{1000}$ and so the information is $5 \log_2(1000)$ which is the 49.8289 of the starting point. Using the assumed pdf we get that for the point <.000> for example, the probability is the height of the function multiplied by the width, which I shall generously put at $\frac{1}{2000}$, and the improbability is the reciprocal of this which is $(3.7)(\frac{2000}{1000}) 
= 7.4$, so the surprise or information content is $\log_2(7.4) = 2.8875$. We get the same for the string <.001> while for the point at 0.009 we get $\log_2(74) = 6.20945$. These correspond to the 3 bit and 7 bit codings of the first three points. Continuing we get:

\begin{displaymath}
2 \log_2(7.4) + \log_2(74) + \log_2(740) + 
\log_2(7400) = 31.48167 \end{displaymath}

which is a little better than we can realise with our coding scheme but shows the principle at work.

This, of course, all assumes that the overhead is negligible, which is hardly reasonable. On the other hand, the overhead can be specified by giving the pdf, and in transmitting the data in any other system there is some convention of coding employed. If I send it all in base 10, I ought to send also a translation of the ten digits in binary and the convention for using decimals if I want to compare like with like. I need to say something about the intended precision and the number of data points, too, so you will not need separators in the code to tell you when one stops and another starts. These issues apply to both the uniform coding and the non-uniform coding, so neglecting the overhead is unreasonable only if you take the view that there is a unique way to code real numbers and it presupposes a uniform density function.

It is plain, on a little reflection, that sending a list of N real numbers in the interval [0,1] by a process such as the above to a precision of 2-P, for positive integers N and P, will require, if we use the uniform pdf, NP bits. If we have a $pdf,\ f(x)$, defined on the interval and replace it with an approximation uniform on intervals of size 2-P, and if we suppose that the model is a good representation of the distribution of the data, then in any interval on the number x, the fraction of data points in that interval will be Nf(x)(2-P) and they will be able to be coded using $\log_2(\frac{1}{f(x)(2^{-P})})$ bits each, i.e. $ P + \log_2(\frac{1}{f(x)})$ each, so the total number of bits will be

\begin{displaymath}
\sum_{i=1}^{2^P} \frac{N}{2^P} f(i)( P + \log_2(\frac{1}{f(i)})) \end{displaymath}

Since $ \sum_{i=1}^{2^P} f(i) = 2^P $ this simplifies to:

\begin{displaymath}
NP + N\sum_{i=1}^{2^P} f(i) log_2(\frac{1}{f(i)}) 
\frac{1}{2^P} \end{displaymath}

which in the limit for a finer and finer approximation to an integrable function f is:

\begin{displaymath}
NP + N\int_0^1 f(x)\log_2(\frac{1}{f(x)}) 
dx \end{displaymath}

If we compute $\int_0^1 f(x)\log_2(\frac{1}{f(x)}) 
dx$ for the pdf of the example, we get :

\begin{displaymath}
\left(\frac{1000}{3.7}\frac{1}{1000} \log_2(\frac{3.7}{1000}...
 ... 
\left(\frac{1}{3.7}\frac{9}{10} \log_2(\frac{3.7}{1})\right) \end{displaymath}

which comes out as

-2.183313 -1.156945 - 0.348909 + 0.4591278 = -3.2300396

which multiplied by the number of points, 5, gives -16.1502, a fair approximation to the number of bits saved by our choice of pdf. Purists might say I have been overly generous to myself here and there, but I merely wish to illustrate the principle.

The formula

\begin{displaymath}
NP + N\int_0^1 f(x)\log_2(\frac{1}{f(x)}) 
dx \end{displaymath}

or equivalently

\begin{displaymath}
NP - N\int_0^1 f(x)\log_2(f(x)) dx \end{displaymath}

is of some use in estimating potential savings. It should be noted that all the savings come from the case where the value of f(x) is greater than 1, since here the contribution is negative so there is a saving. It is essential to remember that this is the minimum possible bit length for representing the data on the assumption that the model is the best possible probabilistic model for the data.

Everyone will, of course, recognise the expression for the integral as defining the entropy of the $pdf,\ f$. If we use natural logarithms instead of logarithms to base 2, we merely change the units of measurement from bits to nats (or for some, less serious writers, nits).

If instead of working over the interval [0,1] we were to work over, say, [-1,1], we would need an extra bit to determine the sign of the number. On the other hand, if we are prepared to work to the same relative precision, we can give up a bit on the location, since knowing where things are on a region of twice the size means we can go to fewer binary places. It really rather depends therefore on what we mean by the precision of specification of a point. This in turn hangs rather on why we want to send a data set in the first place.

If you ask me to measure the diameter of some coins and to send you the results, I might give results in centimetres or inches, and it is likely to matter which. So the choice of a unit is vital and I have to send this information somehow. If you received a value in centimetres of 1.823456789, it is idle to pretend that you are going to care equally about the different digits. You will care, perhaps about the first digit, the 1, more than the second digit, unless as a result of having some crude measurements yourself you have already guessed the first will be 1. And you are likely to care less and less about succeeding digits until you don't care at all about the digits past some point. In fact you probably wouldn't believe them. If you discovered I had been using interferometric methods to get the eighth and ninth decimal places, you might think I was pretty daft, because those extra decimal places cost money and effort to obtain and normally do not carry a good deal of information, in the sense that you probably don't care too much about the values. Let us assume that you really do need the answers correct to one part in 2P. If some third party were to choose units which were half the size of mine, he would be producing answers which were numerically twice as large, and in order to give the same precision relatively, he would need one less binary place than me. My interval of points will be, say between 0 and 1, his will be between 0 and 2, and if our intervals are divided up into 2P subintervals, his will be twice as big as mine to convey the same amount of information. This has to be taken into account when specifying a pdf,f, because part of the specification is the domain space.

Suppose I consider a set of points given by a triangular pdf on the interval [0,1],

\begin{displaymath}
f(x) = 4x \ if\ 0 \leq x \leq 0.5; f(x) = 
4-4x \ if\ 0.5 \leq x \leq 1 \end{displaymath}

Suppose we choose to specify N points each to P binary places. I shall measure the information in nats to make the integration simpler, where 1 bit = $ \ln(2) \approx 0.6931472$ nats

Then the best possible compression on this model is

\begin{displaymath}
NP\ln(2) - N\int_0^{\frac{1}{2}} 4x \ln(4x) 
dx -N \int_{\frac{1}{2}}^1(4-4x)\ln(4-4x) dx \end{displaymath}

nats.

This becomes

\begin{displaymath}
NP\ln(2) - N\frac{2}{4} \left[ \frac{u^2}{2} 
\ln(u) - \frac{u^2}{4}\right]_{u=0}^{u=2} \end{displaymath}

which is $NP\ln(2) + \frac{1}{2}N(1-2\ln(2))$ nats. Which is a rather small saving.

Now suppose the measurement units are changed so that the interval is halved. The required precision to specify the data is increased to $N(P+1)\ln(2)$ nats. The function f(x) now becomes

\begin{displaymath}
f(x) = 16x \ if\ 0 \leq x \leq 0.25; f(x) 
= 8-16x \ if\ 0.25 \leq x \leq 0.5 \end{displaymath}

This still has to have integral 1 over the interval which is half the size. The total saving then comes out to be $N(P+1)\ln(2) + N\frac{1}{2}(1- 
2\ln(4))$ nats. So we have lost, for every point, one bit on the first term, but gained it on the second, giving no net change. It is easy to persuade yourself by a generalisation of this reasoning, that linear scaling does not change the compression resulting from using the model f, which is moderately comforting.

It follows that for a model defined over any bounded interval [a,b] of ${\fam11\tenbbb R}$, it is safe to use the formula

\begin{displaymath}
NP + N\int_0^1 f(x)\log_2(\frac{1}{f(x)}) 
dx \end{displaymath}

as an estimate of a bound on the compression attainable by using model f, provided we simply change the limits of the integral and bear in mind the implicit conditions on the precision with which we expect to specify the data points.

Intuitively, one would expect that there would always be a saving, that the worst case is when the function f is constant when the saving is clearly zero. In other words, reflecting on the function f as distorting the density of points in the space and on coding this information suitably as a way of expressing the data more efficiently, leads one to the belief that $\int_a^b f(x)\log(f(x)) dx$ will be non-negative for any function f such that $\int_a^b f(x)dx =
1$, and for any logarithm base. This is indeed the case, and I shall now prove it.

Proposition 12797

If data is generated in a bounded interval with density given by a pdf f and if the pdf is used to code the data, then for sufficiently large amounts of data f may be used to compress it except when the distribution is uniform.

Proof By the earlier observations, without loss of generality, we may suppose the interval is [0,1]. First, if the interval [0,1] is divided up into M equal subintervals and used to represent a finite probability distribution $\{ p(i): 1 \leq 
i \leq M\} $ it is easy to see that the expression $ -\sum_{i=1}^M p(i) \log p(i) $ is maximised when $\forall i, p(i) = \frac{1}{M}$. It has a value in this case of $\log(M)$. It may be minimised by making any one of the p(i) 1 and the rest zero, when the minimum value of 0 is attained. If we transfer this to a function f on the unit interval, we have the same result for $f(i) \frac{1}{M} = p(i)$. Thus the value of

$ \sum p(i) \log(p(i))$ lies between -log(M) and 0:

\begin{displaymath}
-\log(M) \leq \sum_{i=1}^M p(i) \log p(i) \leq 
0 \end{displaymath}

and since we can model this by the piecewise constant function f with $p(i) = f(i)\frac{1}{M}$,

\begin{displaymath}
-\log(M) \leq \sum_{i=1}^M f(i)\log(\frac{f(i)}{M}) 
\frac{1}{M} \leq 0 \end{displaymath}

which is to say

\begin{displaymath}
-\log(M) \leq \sum_{i=1}^M f(i)\log(f(i)) \frac{1}{M} 
- \sum_{i=1}^M f(i)\log(M)\frac{1}{M} \leq 0 \end{displaymath}

which becomes

\begin{displaymath}
-\log(M) \leq \sum_{i=1}^M (f(i)\log(f(i))) 
\frac{1}{M} - \log(M) \leq 0 \end{displaymath}

i.e.

\begin{displaymath}
0 \leq \sum_{i=1}^M (f(i)\log(f(i))) \frac{1}{M} 
 \leq \log(M) \end{displaymath}

And in the limit, when we replace the sum by an integral and the $\frac{1}{M}$ by dx we get

\begin{displaymath}
0 \leq \int_0^1 f(x)\log(f(x))dx < \infty\end{displaymath}

The zero value is attained only for the uniform distribution. $\Box$

The cheerful, and some would say irresponsible, treatment of limits will give a severe case of the uglies to analysts, who may divert themselves by showing at length how it should be done.

It is easy to see that the value of $\int_0^1 
f(x)\log(f(x))dx$ can be made arbitrarily large by concentrating f at some point. The wonderful Dirac delta function would allow an infinite amount of compression, if it only existed. Of course, if we agree to consider finite precision in the specification of the data, it makes little sense to worry overmuch about such things as limits or Dirac delta functions, they are accidents of the language to which we have become committed for historical reasons that no longer apply. An obsession with adventitious aspects of the linguistic apparatus makes sense only for lunatic Platonists. [*]

It is plain that you never get the total number of bits reduced below zero, but it nevertheless follows that any pdf on a bounded interval which is not constant does some good in allowing compression of a data set described closely by the pdf. Moreover, any reasonable scheme for compressing a set of real numbers in a bounded interval into binary strings, which makes sense before you know what the numbers are, must assign binary strings differentially to intervals and hence determines a pdf.

If we want to work over the whole real line, the situation is somewhat more complicated. For points that are a long way from the origin, the number of bits necessary to get just the integer part goes up without bound. This, of course, means that it makes sense to shift the origin to where the points are and store the shift; this amounts to subtracting off the centroid and this obviously gets you some compression, although it is less than clear what model this corresponds to.

It is easy to compute $ \int_{-\infty}^{\infty} 
f(x)\ln(\frac{1}{f(x)}) dx $ for the case where f(x) is the gaussian or normal function $g_{[0,\sigma]}(x)$ and obtain $\ln(\sigma\sqrt{2\pi e})$. This is negative when $\sigma\sqrt{2\pi e} < 1$, i.e. when $ \sigma < \frac{1}{\sqrt{2\pi e}}$. The conclusion is that for fat gaussians there is no advantage to be gained over just sending the data, and a fat gaussian is one where $\sigma \gt \frac{1}{\sqrt{2 \pi e}} \approx 0.2419707$. This is quite different from the case where the pdf has compact support. It is easy to persuade yourself that the trouble is that unbounded intervals provide you, for any given size of data set, some residual probability of occurrence of a datum that nobody would believe for a minute belongs in with the rest of the data set. Pragmatic treatments of outliers has almost always been to pretend they didn't occur, but rationales for this conduct tend to lack something.

A gaussian pdf for the height of adult males fits the data quite well over several standard deviations, but the total number of male adults of negative height will always be zero, no matter how large the population. The reduction of the overhead in specifying f in a simple formula will have a cost in compression.

In practice, one can cheat quite sensibly by putting the same number of bits available for coding every data point past some distance from the origin, which means that for points which are very far

from the origin, there may be too few bits to specify them at all- a sort of overflow condition.

This is equivalent to making a decision to simply ignore some data if it is simply too far from the rest. Alternatively, I can decide on the precision I need by looking at the biggest distance from the origin of any of the data points, or I can have different precisions for different points and in this case I need a pdf on ${\fam11\tenbbb R}$ which falls off as the number of bits needed to specify the integer part. This may be done rather neatly: See Rissanen's universal pdf on ${\fam11\tenbbb R}$.

The fact that we get a sensible scale invariance for finite precision on compact intervals and no such result when the precision is allowed to become infinite and the domain of the pdf unbounded, suggests that the simplifications which we introduced into our lives many years ago in order to have very general theorems (of a rather impractical sort)[*] may have been a horrible mistake. Maybe we should never have messed with infinite sets at all. Certainly we might argue for finite precision results, since nobody except a lunatic or a platonist philosopher thinks he is going to measure anything exactly with a ruler or any other sort of meter. And there is something metaphysical about unbounded pdfs in that no finite amount of data is going to demand them, and finite data is all we'll ever get.

My solution to the problem is going to be to choose some number N of standard deviations of a gaussian and use it to compress data within N standard deviations of the centre. If I ever get a datum outside my limit of N standard deviations, I shall simply ignore it. I shall choose N so that I get some compression from using the gaussian. How much compression I require will depend on how sober I am and other such matters contingent on the problem details.

This is ${\cal SIN}$ to platonist philosophers, and may give them conniption fits, which is good for them and stimulates their livers.

The extension to higher dimensions than one is essentially straightforward; in dimension n you need to send n numbers instead of just 1. The reader is invited to decide whether it is ever worth storing multivariate gaussians. Note that provided the entropy of the pdf is negative, then there is some number of points at which the saving beats the overhead on sending the pdf. Which we have not, so far, considered.


next up previous contents
Next: Summary of Rissanen Complexity Up: Minimum Description Length Models Previous: Compression for coin models
Mike Alder
9/19/1997