Although the ideas are clear enough in the case of discrete distributions, it is not quite so easy to see how to use a gaussian pdf, for example, to compress a set of data. And yet in real life we always specify our real numbers to some precision, and the use of an infinite string of digits is a pretence. It simplifies some aspects of thinking about the world if we decline to say how many digits following the decimal point are going to be used, but it brings us some formidable complications in the language at a later date. (All the paradoxes of infinite sets, the existence of uncountable sets, arise from a desire to go beyond the confines of a finite language for essentially aesthetic reasons.) It is clear that to send one real number alone would require, for almost all real numbers, an infinite amount of information. So let us choose, from the beginning, some sensible bound on the precision of all the numbers we are going to transmit.
Now suppose we decide to work to one part in 2P for some positive integer P. And suppose we are dealing with a set of points on the real line, in order to keep the example simple, and let us choose the five points {0.999, 0.099, 0.009, 0.001, 0.000}.
Suppose we decide to send these points to three
decimal places; each place is a digit which takes
four bits to code, since there are ten of them.
So there are (5)(4)(3) = 60 bits. If we found
a
more efficient coding of the digits, we would
be able to get
bits
per digit, giving 49.8289 as the total number
of bits. Can we do better with an improved model
which takes into account the fact that most of
the density is near 0?
Suppose we define a
, over the interval
[0,1] by having the height of f(x) as
for
,
for
,
for
,
for
.
Then let us divide the unit interval up into the two parts from 0 to 0.1 and from 0.1 to 1, and for each point, write <.0??????> if the point is in the left tenth and <.1???????> if it is in the right 90%. The question marks have to be filled in by repeating the process a few times.
It is clear that repeating it three times to get <.000> locates the point as being in the interval from 0.00000.. to 0.001 (using decimal notation) so we have correctly specified the first three digits of 0.000. To write <.001> specifies a number in the interval from 0.001000... to 0.001999.. in decimal, which adequately specifies the first three digits of 0.001.
We can specify then in binary the points 0.000 and 0.001 using the bit strings <.000> and <.001> respectively, thus getting two of the points in just 6 bits. The point 0.009 starts off in binary as <.001???> which tells us that it lies between 0.001 and 0.01. Since the pdf is uniform in this region, I simply code the interval in straight, equal interval, binary. I need to divide the region into 10 intervals to get 0.009. This can be done with an extra four bits, with a little wastage, Thus 0.009 is coded as <.0011001>. So 7 bits suffices for the third point out from the origin.
0.099 is the next point, and we start off with <.01???> to say it lies within the interval from 0.01 to 0.1. This has to be divided uniformly into 100 intervals which can be done using an extra 7 bits, again using equal interval binary, giving a total of 9 bits for this point.
Finally we have to code 0.999 and this starts off <.1????> to tell us it lies in the interval from 0.1000.. to 1.0000... If we divide this interval into 1000 intervals we can specify the location to the required precision with an extra 10 bits. This gives 11 bits for the last number. The total number of bits is thus 3 + 3 + 7 + 9 + 11 = 33 bits. I have used a sneaky convention to cut down on the bit lengths, and a pessimist might throw in another few bits, but this is a saving of over 13 bits in anybody's language.
If we consider the uniform probability distribution,
recalling that we quantised to a precision
of three figures, we see that the uniform distribution
would give each of the five points a
probability of
and so the information
is
which is the 49.8289
of the starting point. Using the assumed pdf
we get
that for the point <.000> for example, the probability
is the height of the function multiplied
by the width, which I shall generously put at
, and the improbability is the
reciprocal of this which is
, so the surprise or information content
is
. We get the same for
the string <.001> while for the point at 0.009
we get
. These correspond
to the 3 bit and 7 bit codings of the first three
points. Continuing we get:
![]()
This, of course, all assumes that the overhead is negligible, which is hardly reasonable. On the other hand, the overhead can be specified by giving the pdf, and in transmitting the data in any other system there is some convention of coding employed. If I send it all in base 10, I ought to send also a translation of the ten digits in binary and the convention for using decimals if I want to compare like with like. I need to say something about the intended precision and the number of data points, too, so you will not need separators in the code to tell you when one stops and another starts. These issues apply to both the uniform coding and the non-uniform coding, so neglecting the overhead is unreasonable only if you take the view that there is a unique way to code real numbers and it presupposes a uniform density function.
It is plain, on a little reflection, that sending
a list of N real numbers in the interval [0,1]
by a process such as the above to a precision
of 2-P, for positive integers N and P,
will require, if we use the uniform pdf, NP bits. If we
have a
, defined on the interval
and replace it with an approximation uniform on intervals
of size 2-P, and if we suppose that the
model is a good representation of the distribution of
the data, then in any interval on the number
x, the fraction of data points in that interval will
be Nf(x)(2-P) and they will be able to be
coded using
bits
each, i.e.
each,
so the total number of bits will be


which in the limit for a finer and finer approximation to an integrable function f is:

If we compute
for the pdf of the example, we get :
![]()
-2.183313 -1.156945 - 0.348909 + 0.4591278 = -3.2300396
which multiplied by the number of points, 5, gives -16.1502, a fair approximation to the number of bits saved by our choice of pdf. Purists might say I have been overly generous to myself here and there, but I merely wish to illustrate the principle.The formula

![]()
Everyone will, of course, recognise the expression
for the integral as defining the entropy
of the
. If we use natural logarithms
instead of logarithms to base 2, we merely change
the units of measurement from bits to nats
(or for some, less serious writers, nits).
If instead of working over the interval [0,1] we were to work over, say, [-1,1], we would need an extra bit to determine the sign of the number. On the other hand, if we are prepared to work to the same relative precision, we can give up a bit on the location, since knowing where things are on a region of twice the size means we can go to fewer binary places. It really rather depends therefore on what we mean by the precision of specification of a point. This in turn hangs rather on why we want to send a data set in the first place.
If you ask me to measure the diameter of some coins and to send you the results, I might give results in centimetres or inches, and it is likely to matter which. So the choice of a unit is vital and I have to send this information somehow. If you received a value in centimetres of 1.823456789, it is idle to pretend that you are going to care equally about the different digits. You will care, perhaps about the first digit, the 1, more than the second digit, unless as a result of having some crude measurements yourself you have already guessed the first will be 1. And you are likely to care less and less about succeeding digits until you don't care at all about the digits past some point. In fact you probably wouldn't believe them. If you discovered I had been using interferometric methods to get the eighth and ninth decimal places, you might think I was pretty daft, because those extra decimal places cost money and effort to obtain and normally do not carry a good deal of information, in the sense that you probably don't care too much about the values. Let us assume that you really do need the answers correct to one part in 2P. If some third party were to choose units which were half the size of mine, he would be producing answers which were numerically twice as large, and in order to give the same precision relatively, he would need one less binary place than me. My interval of points will be, say between 0 and 1, his will be between 0 and 2, and if our intervals are divided up into 2P subintervals, his will be twice as big as mine to convey the same amount of information. This has to be taken into account when specifying a pdf,f, because part of the specification is the domain space.
Suppose I consider a set of points given by a triangular pdf on the interval [0,1],
![]()
Then the best possible compression on this model is

This becomes
![\begin{displaymath}
NP\ln(2) - N\frac{2}{4} \left[ \frac{u^2}{2}
\ln(u) - \frac{u^2}{4}\right]_{u=0}^{u=2} \end{displaymath}](img204.gif)
Now suppose the measurement units are changed
so that the interval is halved. The required
precision to specify the data is increased to
nats. The function f(x) now
becomes
![]()
It follows that for a model defined over any bounded
interval [a,b] of
, it is safe to use the
formula

Intuitively, one would expect that there would
always be a saving, that the worst case
is when the function f is constant when the saving
is clearly zero. In other words, reflecting on
the function f as distorting the density of
points in the space and on coding this information
suitably as a way of expressing the data more
efficiently, leads one to the belief that
will be non-negative
for any function f such that
, and for any logarithm base. This is indeed
the case, and I shall now prove it.
Proposition 12797
If data is generated in a bounded interval with density given by a pdf f and if the pdf is used to code the data, then for sufficiently large amounts of data f may be used to compress it except when the distribution is uniform.
Proof
By the earlier observations, without loss of generality,
we may suppose the interval is [0,1].
First, if the interval [0,1] is divided up into
M equal subintervals and used to represent a
finite probability distribution
it is easy to see that the
expression
is
maximised when
.
It
has a value in this case of
. It may
be minimised by making any one of the p(i)
1 and
the rest zero, when the minimum value of 0 is
attained. If we transfer this to a function f
on
the unit interval, we have the same result for
. Thus the value of
lies between -log(M)
and 0:



which becomes


And in the limit, when we replace the sum by an
integral and the
by dx we get
![]()
The zero value is attained only for the uniform
distribution. ![]()
The cheerful, and some would say irresponsible, treatment of limits will give a severe case of the uglies to analysts, who may divert themselves by showing at length how it should be done.
It is easy to see that the value of
can be made arbitrarily large
by
concentrating f at some point. The wonderful Dirac
delta function would allow an infinite amount
of compression, if it only existed. Of course,
if we agree to consider finite precision in the
specification of the data, it makes little sense
to worry overmuch about such things as limits
or Dirac
delta functions, they are accidents of the language
to which we have become committed for
historical reasons that no longer apply. An obsession
with adventitious aspects of the linguistic apparatus
makes sense only for lunatic Platonists.
It is plain that you never get the total number of bits reduced below zero, but it nevertheless follows that any pdf on a bounded interval which is not constant does some good in allowing compression of a data set described closely by the pdf. Moreover, any reasonable scheme for compressing a set of real numbers in a bounded interval into binary strings, which makes sense before you know what the numbers are, must assign binary strings differentially to intervals and hence determines a pdf.
If we want to work over the whole real line, the situation is somewhat more complicated. For points that are a long way from the origin, the number of bits necessary to get just the integer part goes up without bound. This, of course, means that it makes sense to shift the origin to where the points are and store the shift; this amounts to subtracting off the centroid and this obviously gets you some compression, although it is less than clear what model this corresponds to.
It is easy to compute
for the case
where f(x) is the gaussian or normal function
and obtain
. This is negative when
, i.e. when
.
The conclusion is that for fat gaussians there
is no advantage
to be gained over just sending the data, and a
fat gaussian is one where
.
This is quite different from the case where
the pdf has compact support. It
is easy to persuade yourself that the trouble
is that unbounded intervals provide you, for
any
given size of data set, some residual probability
of occurrence of a datum that nobody would
believe for a minute belongs in with the rest
of the data set. Pragmatic treatments of outliers
has almost always been to pretend they didn't
occur, but rationales for this conduct tend to
lack
something.
A gaussian pdf for the height of adult males fits the data quite well over several standard deviations, but the total number of male adults of negative height will always be zero, no matter how large the population. The reduction of the overhead in specifying f in a simple formula will have a cost in compression.
In practice, one can cheat quite sensibly by putting the same number of bits available for coding every data point past some distance from the origin, which means that for points which are very far
from the origin, there may be too few bits to specify them at all- a sort of overflow condition.
This is equivalent to making a decision to simply
ignore some data if it is simply too far from
the
rest. Alternatively, I can decide on the
precision I need by looking at the biggest distance
from the origin of any of the data points, or
I can have different precisions for different
points and in this case I need a pdf on
which
falls off as the number of bits needed to specify
the integer part. This may be done rather
neatly: See Rissanen's universal pdf on
.
The fact that we get a sensible scale invariance
for finite precision on compact intervals and
no such
result when the precision is allowed to become
infinite and the domain of the pdf unbounded,
suggests
that the simplifications which we introduced into
our lives many years ago in order to have very
general theorems (of a rather impractical sort)
may have been a horrible mistake.
Maybe
we should never have messed with infinite sets
at all. Certainly we might argue for finite precision
results, since nobody except a lunatic or a platonist
philosopher thinks he is going to measure anything
exactly with a ruler or any other sort of
meter. And there is something metaphysical about
unbounded
pdfs in that no finite amount of data is going
to demand them, and finite data is all we'll
ever get.
My solution to the problem is going to be to choose some number N of standard deviations of a gaussian and use it to compress data within N standard deviations of the centre. If I ever get a datum outside my limit of N standard deviations, I shall simply ignore it. I shall choose N so that I get some compression from using the gaussian. How much compression I require will depend on how sober I am and other such matters contingent on the problem details.
This is
to platonist philosophers,
and may give them conniption fits, which
is good for them and stimulates their livers.
The extension to higher dimensions than one is essentially straightforward; in dimension n you need to send n numbers instead of just 1. The reader is invited to decide whether it is ever worth storing multivariate gaussians. Note that provided the entropy of the pdf is negative, then there is some number of points at which the saving beats the overhead on sending the pdf. Which we have not, so far, considered.