Once one has taken the step of extending finite
maps having values in S0 by smooth maps into
[-1,1], the smooth maps being from some manifold
of such functions, the subject starts to
attract the interest of Functional Analysts, a
collection of Mathematicians who, having reduced
much
of the theory of functional approximation to linear
algebra, find themselves underemployed and
reduced to manufacturing rather artificial problems,
those real problems that are left being much
too
hard. There have been a number of theorems showing
that any function from
to [-1,1]
which is sufficiently well behaved (for example,
continuous and having compact support
) may
be
approximated arbitrarily well by three layer neural
nets. Such statements are true but of very
limited practical utility. They are not offered
to the engineer because they have practical
utility, but because they are publishable, and
they are publishable because engineers who are
not familiar with a branch of mathematics are
more likely to be impressed with it than they
are
with familiar tools. It is desirable therefore
that the elements be explained in order to
reduce the reader's gullibility quotient.
Whenever one talks of approximating one thing with another, one has some kind of a notion of a distance between the things, and this distance being small. More precisely, to be told that one can approximate a function arbitrarily well by polynomials is to be told that there is a sequence of polynomials, and that the further you go down the sequence, the smaller the distance between the polynomial you choose and the function you are getting closer to. And in fact the function you are approximating has to be the limit of the sequence of polynomials; the distance between polynomials in the sequence and the function you are trying to sneak up on can be made as small as anyone wants just by going far enough down the sequence. This leads us to ask how you measure the distance between functions. Suggesting, as it does, that there is one and only one `right' way, this is actually the wrong question, but only a bit wrong.
There are many ways of measuring the distance
between functions, but two are of particular
importance. The first is the so called
norm, which measures the distance of f from
g by taking
![]()

In either sense, it is often useful to be able
to find out what functions can be expressed as
limits of some class of functions. If, for example,
we take functions on the real line which
are pdfs, i.e. they are non-negative and have
integral 1, then it might be cheering to be
able to prove rigorously the statement I threw
off rather casually some time back, that such
a
thing can be approximated as well as you wish
by a mixture of gaussians. To do this, you need
a
careful definition of what it means to approximate
a function g by some family fp, and
of course it means I can choose members of the
family
such that
the distance
can be made less than
any given number
by choosing n
large enough. Thus, it is easy
to prove that in the space
of continuous functions
from [0,1] to
with the
norm,
the family of polynomials can be used to
approximate any continuous function. This is a
celebrated theorem of Weierstrass.
The theorem of Weierstrass must be compared with the well known Taylor expansion of a function about a point, which for the case of infinitely differentiable functions actually gives you a polynomial and bounds on the difference between the function and the approximating polynomial. Sometimes the bounds can be estimated, which makes for a really practical theorem. You can replace one, possibly horrid, function by a polynomial and work out the worst case discrepancy. The Weierstrass theorem works for a larger class of functions, but doesn't tell you how bad or good a particular approximation is. (Although there are constructive proofs which do give the information. Functional Analysts often seem quite indifferent to these practicalities, alas.)
Moving away from the real line into
and
even more awful generality, the generalisation
of Weierstrass' theorem which appeals to Analysts
is the Stone-Weierstrass Theorem which
may be used to show that if any family of functions
on a compact domain is an Algebra of
continuous functions, which
means that the sum of functions in the family
is in, the scalar multiple of any functions in
the algebra is in, and the product of any functions
in the algebra is in the algebra, and if
there are enough functions to separate distinct
points (i.e. if there are two distinct points
x and y in the domain of the algebra functions
then there are functions f and g such
that
) and if, finally, the algebra
contains the constant functions, then
any continuous function with the same (compact)
domain can be approximated as closely as you
like in the
sense. Well, a little work
has to be done to bend the Stone-Weierstrass
theorem to make it go, but it can be used to show
that for compact subsets of
, any
continuous function can be approximated as closely
as you like by the functions obtained using
three layer neural nets with any continuous sigmoid
function.
It is worth drawing some pictures here to see
what is being said. If you take a square in
, any single unit will implement a function
that is just a smoothed version of the
function that assigned +1 to Fred and -1 to
Gladys, with a dividing line down in between
them that went right across the whole plane.
Fig.5.12 tries to give an incompetent
artists impression of what the function looks
like, a sort of Wave Rock effect.
Now putting three of these in at the first processing
layer of a three layer neural net, and
then following them with one output unit, means
that we can get scaled multiples of translates
of things like the Wave Rock
. We can even turn it upside
down by multiplying by a negative number. So we
have to add the result of
stretching such a thing out in the x-y plane and
then multiplying its height by some number
and adding to the result of doing the same thing
to two other such
functions. It is clear that the result must have
three waves reaching right across the plane.
If the lines are parallel, we could use two of
them to make an infinitely long ridge. Four of
them could make two ridges with a sort of tower
where they cross. But the waves go all across
the plane. Just as we politely rejected the three
layer committee net for the generalised XOR
data set, on the grounds that a model where the
complexity goes up with the amount of data,
leaves something to be desired, so we should
be rather inclined to abandon the three layer
smoothed version as a suitable basis for modelling
functions, unless the behaviour whereby
there are waves which extend across the space
were a feature of the functions we wanted to
model. The analysts claim that it can always
be done is correct, but the implied claim that
it
is worth doing is deeply suspect. The analysts
will cheerfully explain that they never said
it
was a good idea to actually do it, their theorems
only say it can be done. The engineer will
reply that they should, in that case, publish
these results in some Pure Mathematical journal
dedicated to useless trivia. This is an argument
that can go on a long time. The moral is that
although the analysts and other pure mathematicians
are all God's creatures, and although they
often say things that are of interest, one should
read what they say carefully and very, very
literally. They are not necessarily on your side.
Of course the one dimensional case is quite useful since it can be used to give a quick proof that any pdf can be approximated by gaussians, as you can use half a gaussian turned upside down and glued to the other half as a (sort of) sigmoid.
The theory of Fourier Analysis and the Fourier series whereby we build up a function out of sines and cosines extends the Weierstrass result in a certain sense to the case where we may use the L2 norm, and hence to a larger class of functions. The theory can be modified to give results for three layer nets, but such results are no more practical than the Stone-Weierstrass Theorem. The question of interest is, how fast does the discrepancy disappear with the number of units used? Contemplation of Fig.5.13 suggests that discrepancies are likely to be of the order of 1/m, where m is the number of units This does not compare well with, say, Taylor's theorem for most analytic functions encountered in practice. Analysts are not usually interested in what functions are actually encountered in practice, and it is as well to remember this. Mathematics is a splendid servant but a thoroughly foolish master and it is essential to let it know who is the boss.
What we did with four layer nets and the hard limiter sgn function can be done with smooth sigmoids and function approximation, if you should feel a need to approximate functions. Since reality always gives you finite data sets to finite precision when you measure it, the need is seldom irresistible, although there are many circumstances when the procedure is useful.