next up previous contents
Next: Smooth thresholding functions Up: Committees Previous: Four Layer Nets

Building up functions

Any net implements a function from the space of inputs to the space of outputs. A single unit in ${\fam11\tenbbb R}^2$, as in Fig.5.4 is a map from ${\fam11\tenbbb R}^2$ to S0 which sends the point % latex2html id marker 5857
$\left( \begin{array}
{c} x \\  y \end{array} \right) 
$ to the value sgn(ax+by+c). A three layer net such as Fig.5.6 implements a function

sgn(u*sgn (a1 x + b1 y + c1) + v*sgn(a2x + b2 y + c2) + w*sgn(a3x + b3 y + c3))

and it is easy to see that this is the weighted composite of functions implemented by each unit in the first layer of processing elements. This sort of thing can be carried on for as many layers as you like, although don't get too carried away just yet.

The data set constitutes a sort of function; it may be taken to be a function defined on the finite set of points and having values in S0. And a neural net is not just one function from ${\fam11\tenbbb R}^2$ to S0, it is an infinite family of them, parametrised by all those weights. In other words, there is a manifold of functions, in general from ${\fam11\tenbbb R}^n$ to S0, and a data set, a function defined on a finite subset of ${\fam11\tenbbb R}^n$ taking values in S0, and the problem of training a data set is to find one of the functions which agrees with the data set function. The data set functions we consider in pattern recognition are binary valued of course, and the convention I use here is that they take values in S0, although there are other applications where we take real valued functions. I shall not treat such cases on the grounds that you are already getting good value for money and if you have too many things to think about you might get depressed, but this section and the next two comprise a brief aside on the issues involved.

Any training procedure can be viewed as a process of starting off with some arbitrary function from the family and moving about the weight space until we get agreement, until we fit the data. In general we have no guarantee that there is such a function in the family. The XOR problem with one unit is an example. Here one must think of fitting step functions defined on the plane, ${\fam11\tenbbb R}^2$, and taking values $\pm 1$. It is easy to see that for three or four layer nets of the sort we have described, the complexity of the data set, however defined, is the critical consideration in whether or not the problem is solvable at all. All problems have some size net which can solve it; equally, all nets have some problems they cannot solve. This is not a particularly profound observation, but it raises the ante from a sort of Minskian disenchantment with the capacity of the net to other more interesting issues.

If I put the problem in the frame of function fitting from a known family of functions, then I raise immediately the question of whether the family is a good one to try to fit from. The question of what makes a sensible basis of functions for describing point sets came up in chapter two, where we were trying to describe characters, and we meditated on the merits of the Fourier basis of sines and cosines against polynomials. For a given type of data, some basis functions make more sense than others. Just choosing a basis set because it is there is not particularly intelligent; it may be that choosing it because it is easy to do the sums with it makes sense, or it may be that the kinds of functions you want to fit are likely to be well suited to the family you want to fit with. But a little thought can save a lot of computing time.[*]

The geometry of a neural net determines the function class from which you are trying to fit the data. This function class is the important thing, and the net diagram is otherwise largely uninformative, although it may make the programming task of evaluating it somewhat simpler. It determines the dimension of the space of weights through which you are proposing to step in order to find a solution. Stepping blindly may be the best you can do, but it is to be avoided if possible. It is clear that the time to find a solution goes up as the dimension of the search space, and rather worse than linearly in general. There is a trade off between finding, on the one hand that there is no solution with your net because it isn't big enough, against at least discovering, on the other hand, this uncomfortable fact fairly quickly. Bunging everything into a huge net, running it on a supercomputer and leaving it to chug away for a few weeks (or years) is the sort of cretinous stupidity one associates with......, well, not with engineers, I hope.


next up previous contents
Next: Smooth thresholding functions Up: Committees Previous: Four Layer Nets
Mike Alder
9/19/1997