There is a simple solution to the problem of the last subsection. We simply add another layer. The next diagram, Fig.5.10, gives a diagram of a net which solves the generalised XOR problem of Fig 5.8. Each node is supposed to do a weighted linear summation of the inputs and to follow this by the sgn function. I have left out all arcs where the weight is zero, and also the constant in the input layer, because it looked lonely with no arcs coming from it.
Now suppose I give you as input to the net a point in the first quadrant, that is, both x and y are positive . Then A outputs -1, B outputs +1, C outputs +1 and D outputs -1. E and F receive input sum zero and with negative threshold output -1, G has two -1 inputs which overcome the weight of +1, so the output from the net is -1.
If I input a point in the second quadrant, x is negative and y is positive. So both A and B output +1, while C and D output -1. E outputs +1 and F outputs -1. With that input and a positive `threshold', G outputs +1, the output from the net. If the input has both x and y negative, from the third quadrant, then as in the first quadrant, E and F output -1 and the net output is also -1. Finally, in the fourth quadrant where x is positive and y is negative, C and D output +1 while A and B output -1, so E outputs -1 and F outputs +1, hence the net outputs +1. So for input in the first or third quadrants, the unit outputs -1, while in the second and fourth it outputs +1. In other words it solves the generalised XOR problem.
Inspection of the net and a little reflection shows how this is accomplished. It is useful to draw the lines in the plane implemented by the first layer units; they are indicated, somewhat displaced from their true positions in Fig.4.11. A and B return positive for the second quadrant, C and D for the fourth quadrant. Unit E is an AND gate which fires (+1) if and only if the input is in the second quadrant, unit F is and AND gate which fires iff the input is in the fourth quadrant. Unit G is an OR gate which fires if E fires or if F fires. It is easy to see that AND gates and OR gates are arranged (for any number of inputs) by judicious choice of the threshold. The effect of this is to partition the space so that we no longer have the extreme annoyance of having to put up with hyperplanes that run right across the whole of it.
Some more reflection makes it plain that it is easy to take any data set and chop it up into regions which are bounded by convex polytopes and where each region contains points of only one category. This holds in any dimension and for any number of categories.
Thus we conclude several things; first that four layer nets (three layers of processing units) suffices for any data set with the advantage over three layer (two layers of processing units) nets that the number of units required is of order the complexity of the data set, not the number of points, where the complexity bears some relationship to the geometry. This means that we do not feel any particular attraction towards nets with many more layers. The simpler models are more appealing to anyone who has read the last chapter thoughtfully.
Second, that it is tempting to fix the weights on all but the first layer, and to tackle a problem with some fixed number of convex sets ready to wrap around the regions. It is true that we do not know in advance whether we can do the job with a given number of units arranged in a particular configuration, but that is the nature of the beast. We never know, in general, whether there is a solution space for our net. There is a trade off: if we allow the weights between the second and third layers to change, we have more flexibility in the matter of the number of units making up any polytope, and if we allow changes in the weights generally we may have more flexibility in the way the gates work, but flexibility is not always a virtue. Worms are flexible, but cockroaches can run faster. Jelly is flexible, but not the preferred medium of choice for building houses unless you are very young indeed. There is a case for having articulated rigid parts. It is found experimentally that it is possible to train an articulated neural net ten times as fast as a flexible one on problems of low complexity. The flexible ones tend to fail to converge at all in many cases. If you contemplate lines hurtling over the initial space, the next layer doing the same with the results of the output of the lines in a higher level space, and so on, you are probably inclined to find this unsurprising.
In conclusion, both three and four layer nets using a step function after a weighted summation of the input values, can classify any finite data set, but the three layer net will, in the case where the data set has non-trivial complexity, that is to say where it does not fall into bands which extend across the space, need a number of units which is of order the number of points, while the four layer net is able to do so with a number of units which are of order the simplicial complexity of the data set, where the simplicial complexity is defined as the smallest number of simplices which can enclose the data in such a way that inside each simplex the data points have the same category. Approaching a classification problem with the AND and OR weights all fixed can speed up convergence by an order of magnitude or more when the problem is solvable by such a net. We remind you that visual inspection of the data by a projection program can aid in deciding what class of net to use, how many convex regions are likely to be necessary and how complex each convex region may need to be. Such simple methods can reduce enormously the amount of time spent making essentially random hops around a huge weight space, a practice which has in recent times used up, one hesitates to say `wasted', almost as much computer time as playing arcade games.