Any net implements a function from the space of
inputs to the space of outputs. A single unit
in
, as in Fig.5.4 is a map from
to S0 which sends the point
to the value sgn(ax+by+c). A three
layer net such as Fig.5.6 implements a function
sgn(u*sgn (a1 x + b1 y + c1) + v*sgn(a2x + b2 y + c2) + w*sgn(a3x + b3 y + c3))
and it is easy to see that this is the weighted composite of functions implemented by each unit in the first layer of processing elements. This sort of thing can be carried on for as many layers as you like, although don't get too carried away just yet.
The data set constitutes a sort of function; it
may be taken to be a function defined on the
finite set of points and having values in S0.
And a neural net is not just one
function from
to S0, it is an infinite
family of them, parametrised by all those
weights. In other words, there is a manifold of
functions, in general from
to S0,
and a data set, a function defined on a finite
subset of
taking values in S0, and
the
problem of training a data set is to find one
of the functions which agrees with the data set
function. The data set functions we consider in
pattern recognition are binary valued of course,
and the convention I use here is that they take
values in S0, although there are other
applications where we take real valued functions.
I shall not treat such cases on the grounds that
you are already getting good value for money and
if you have too many things to think about you
might get depressed, but this section and the
next two comprise a brief aside on the issues
involved.
Any training procedure can be viewed as a process
of starting off with some arbitrary function
from the family and moving about the weight space
until we get agreement, until we fit the
data. In general we have no guarantee that there
is such a function in the family.
The XOR problem with one unit is an example. Here
one must think of fitting step functions defined
on
the plane,
, and taking values
.
It is easy to see that for three or four layer
nets of the sort we have described, the complexity
of the data set, however defined, is the
critical consideration in whether or not the problem
is solvable at all. All problems have
some size net which can solve it; equally, all
nets have some problems they cannot solve. This
is not a particularly profound observation, but
it raises the ante from a sort of Minskian
disenchantment with the capacity of the net to
other more interesting issues.
If I put the problem in the frame of function
fitting from a known family of functions, then
I
raise immediately the question of whether the
family is a good one to try to fit from. The
question
of what makes a sensible basis of functions for
describing point sets came up in chapter two,
where we were trying to describe characters, and
we meditated on the merits of the Fourier
basis of sines and cosines against polynomials.
For a given type of data, some basis functions
make more sense than others. Just choosing a basis
set because it is there is not particularly
intelligent; it may be that choosing it because
it is easy to do the sums with it makes sense,
or it may be that the kinds of functions you want
to fit are likely to be well suited to the
family you want to fit with. But a little thought
can save a lot of computing
time.
The geometry of a neural net determines the function class from which you are trying to fit the data. This function class is the important thing, and the net diagram is otherwise largely uninformative, although it may make the programming task of evaluating it somewhat simpler. It determines the dimension of the space of weights through which you are proposing to step in order to find a solution. Stepping blindly may be the best you can do, but it is to be avoided if possible. It is clear that the time to find a solution goes up as the dimension of the search space, and rather worse than linearly in general. There is a trade off between finding, on the one hand that there is no solution with your net because it isn't big enough, against at least discovering, on the other hand, this uncomfortable fact fairly quickly. Bunging everything into a huge net, running it on a supercomputer and leaving it to chug away for a few weeks (or years) is the sort of cretinous stupidity one associates with......, well, not with engineers, I hope.