next up previous contents
Next: Committees vs Back-Propagation Up: Smooth thresholding functions Previous: Back-Propagation

Mysteries of Functional Analysis

Once one has taken the step of extending finite maps having values in S0 by smooth maps into [-1,1], the smooth maps being from some manifold of such functions, the subject starts to attract the interest of Functional Analysts, a collection of Mathematicians who, having reduced much of the theory of functional approximation to linear algebra, find themselves underemployed and reduced to manufacturing rather artificial problems, those real problems that are left being much too hard. There have been a number of theorems showing that any function from ${\fam11\tenbbb R}^n$ to [-1,1] which is sufficiently well behaved (for example, continuous and having compact support[*]) may be approximated arbitrarily well by three layer neural nets. Such statements are true but of very limited practical utility. They are not offered to the engineer because they have practical utility, but because they are publishable, and they are publishable because engineers who are not familiar with a branch of mathematics are more likely to be impressed with it than they are with familiar tools. It is desirable therefore that the elements be explained in order to reduce the reader's gullibility quotient.

Whenever one talks of approximating one thing with another, one has some kind of a notion of a distance between the things, and this distance being small. More precisely, to be told that one can approximate a function arbitrarily well by polynomials is to be told that there is a sequence of polynomials, and that the further you go down the sequence, the smaller the distance between the polynomial you choose and the function you are getting closer to. And in fact the function you are approximating has to be the limit of the sequence of polynomials; the distance between polynomials in the sequence and the function you are trying to sneak up on can be made as small as anyone wants just by going far enough down the sequence. This leads us to ask how you measure the distance between functions. Suggesting, as it does, that there is one and only one `right' way, this is actually the wrong question, but only a bit wrong.

There are many ways of measuring the distance between functions, but two are of particular importance. The first is the so called $L_\infty$ norm, which measures the distance of f from g by taking

\begin{displaymath}
\Vert f-g\Vert _{\infty} = \sup\{\vert(f-g)(x)\vert\} \end{displaymath}

If f and g are continuous and the domain of both is compact, then this is always defined. The second is the so called L2 norm, in which the distance of f from g is obtained by taking

\begin{displaymath}
\Vert f-g\Vert _2 = \sqrt{\int(f(x) -g(x))^2} \end{displaymath}

This is defined in the case where f and g are continuous and the support is compact, but in much more general cases also. If you don't know what compact means, the excellent book by G.F. Simmons given in the references at chapters end will tell you, along with much more material worth knowing if only so as not to be intimidated by it. Although the numbers come out as different for any two functions, the two norms are equivalent on the smaller space of continuous functions having compact support in the sense that if you have a sequence of functions approaching a limit function in the one sense of distance, then the same sequence will also be converging to the same limit in the other sense too. More bizarre possibilities exist, so watch out, but for the cases discussed, the different norms are not too different.

In either sense, it is often useful to be able to find out what functions can be expressed as limits of some class of functions. If, for example, we take functions on the real line which are pdfs, i.e. they are non-negative and have integral 1, then it might be cheering to be able to prove rigorously the statement I threw off rather casually some time back, that such a thing can be approximated as well as you wish by a mixture of gaussians. To do this, you need a careful definition of what it means to approximate a function g by some family fp, and of course it means I can choose members of the family $f_1, f_2, \cdots f_n,\cdots$ such that

the distance $\Vert f_n - g\Vert$ can be made less than any given number $\varepsilon$ by choosing n large enough. Thus, it is easy[*] to prove that in the space ${\cal C}^0([0,1])$ of continuous functions from [0,1] to ${\fam11\tenbbb R}$ with the $L_\infty$ norm, the family of polynomials can be used to approximate any continuous function. This is a celebrated theorem of Weierstrass.

The theorem of Weierstrass must be compared with the well known Taylor expansion of a function about a point, which for the case of infinitely differentiable functions actually gives you a polynomial and bounds on the difference between the function and the approximating polynomial. Sometimes the bounds can be estimated, which makes for a really practical theorem. You can replace one, possibly horrid, function by a polynomial and work out the worst case discrepancy. The Weierstrass theorem works for a larger class of functions, but doesn't tell you how bad or good a particular approximation is. (Although there are constructive proofs which do give the information. Functional Analysts often seem quite indifferent to these practicalities, alas.)

Moving away from the real line into ${\fam11\tenbbb R}^n$ and even more awful generality, the generalisation of Weierstrass' theorem which appeals to Analysts is the Stone-Weierstrass Theorem which may be used to show that if any family of functions on a compact domain is an Algebra of continuous functions, which means that the sum of functions in the family is in, the scalar multiple of any functions in the algebra is in, and the product of any functions in the algebra is in the algebra, and if there are enough functions to separate distinct points (i.e. if there are two distinct points x and y in the domain of the algebra functions then there are functions f and g such that $f(x) \neq g(x)$) and if, finally, the algebra contains the constant functions, then any continuous function with the same (compact) domain can be approximated as closely as you like in the $L_\infty$ sense. Well, a little work has to be done to bend the Stone-Weierstrass theorem to make it go, but it can be used to show that for compact subsets of ${\fam11\tenbbb R}^n$, any continuous function can be approximated as closely as you like by the functions obtained using three layer neural nets with any continuous sigmoid function.

It is worth drawing some pictures here to see what is being said. If you take a square in ${\fam11\tenbbb R}^2$, any single unit will implement a function that is just a smoothed version of the function that assigned +1 to Fred and -1 to Gladys, with a dividing line down in between them that went right across the whole plane. Fig.5.12 tries to give an incompetent artists impression of what the function looks like, a sort of Wave Rock effect.


 
Figure 5.12: A sigmoid classifying.
\begin{figure}
\vspace{8cm}
\special {psfile=patrecfig5.12.ps}\end{figure}

Now putting three of these in at the first processing layer of a three layer neural net, and then following them with one output unit, means that we can get scaled multiples of translates of things like the Wave Rock [*]. We can even turn it upside down by multiplying by a negative number. So we have to add the result of stretching such a thing out in the x-y plane and then multiplying its height by some number and adding to the result of doing the same thing to two other such functions. It is clear that the result must have three waves reaching right across the plane. If the lines are parallel, we could use two of them to make an infinitely long ridge. Four of them could make two ridges with a sort of tower where they cross. But the waves go all across the plane. Just as we politely rejected the three layer committee net for the generalised XOR data set, on the grounds that a model where the complexity goes up with the amount of data, leaves something to be desired, so we should be rather inclined to abandon the three layer smoothed version as a suitable basis for modelling functions, unless the behaviour whereby there are waves which extend across the space were a feature of the functions we wanted to model. The analysts claim that it can always be done is correct, but the implied claim that it is worth doing is deeply suspect. The analysts will cheerfully explain that they never said it was a good idea to actually do it, their theorems only say it can be done. The engineer will reply that they should, in that case, publish these results in some Pure Mathematical journal dedicated to useless trivia. This is an argument that can go on a long time. The moral is that although the analysts and other pure mathematicians are all God's creatures, and although they often say things that are of interest, one should read what they say carefully and very, very literally. They are not necessarily on your side.


 
Figure 5.13: Sigmoids summing.
\begin{figure}
\vspace{8cm}
\special {psfile=patrecfig5.13.ps}\end{figure}

Of course the one dimensional case is quite useful since it can be used to give a quick proof that any pdf can be approximated by gaussians, as you can use half a gaussian turned upside down and glued to the other half as a (sort of) sigmoid.

The theory of Fourier Analysis and the Fourier series whereby we build up a function out of sines and cosines extends the Weierstrass result in a certain sense to the case where we may use the L2 norm, and hence to a larger class of functions. The theory can be modified to give results for three layer nets, but such results are no more practical than the Stone-Weierstrass Theorem. The question of interest is, how fast does the discrepancy disappear with the number of units used? Contemplation of Fig.5.13 suggests that discrepancies are likely to be of the order of 1/m, where m is the number of units This does not compare well with, say, Taylor's theorem for most analytic functions encountered in practice. Analysts are not usually interested in what functions are actually encountered in practice, and it is as well to remember this. Mathematics is a splendid servant but a thoroughly foolish master and it is essential to let it know who is the boss.

What we did with four layer nets and the hard limiter sgn function can be done with smooth sigmoids and function approximation, if you should feel a need to approximate functions. Since reality always gives you finite data sets to finite precision when you measure it, the need is seldom irresistible, although there are many circumstances when the procedure is useful.


next up previous contents
Next: Committees vs Back-Propagation Up: Smooth thresholding functions Previous: Back-Propagation
Mike Alder
9/19/1997