First, if the net is wrong about a point, we may suppose without loss of generality that the net thinks it a -1 when it is really a +1. So if the weights, call them u,v,w, on the last unit are all +1, then two or three of the lines have the point on the wrong side. We could therefore adapt the system by taking one of the lines that is wrong about the point, and simply use the Perceptron Training rule to change that line.
As an alternative, we could change the weights u,v,w. If we had all three lines wrong, we should have to change the sign of at least one weight, which is exactly equivalent to swapping the corresponding line's orientation, while leaving it in the same place. It seems unreasonable to keep the lines fixed and merely adjust their weights, since if the lines are simply in the wrong places, they will stay in the wrong places. In fact we have simply done a fixed transform to the input and then fed the result into a single unit. We might be lucky and have done by good luck a transform which improves things, but it would be silly to bet on it. So mucking about with the weights u,v,w is not a particularly attractive option. So we may reasonably prefer to keep the voting strengths the same and at +1, and concentrate on moving the lines.
This raises the question of which line to move. Now a line is either right or it's wrong, but it could be argued that it can be said to be very wrong in some cases, but not so wrong as all that in others. The conservative thing to do, is to make whichever change is smaller. The rationale for this is that presumably some of the data set has been learnt by the system, even if only by chance, and that it would be silly to throw away everything, so work out which change in the state space is least and do that one. This amounts to the following strategy:
take the lines
and the point
such that
when the category
, and find the line
such that
![]()
Experiments show that this works fairly well, and much better than such strategies as choosing the ` maximally wrong' line. A committe of three members such as Fig.5.6 can find a solution fairly quickly in practice, for such trivial data sets as the XOR. It is easy to write a little program which draws the lines on the screen and you can watch them hop about spasmodically until they converge to a solution. There is no advantage and some disadvantage in also modifying the weights u,v,w. Higher dimensional analogues to XOR can be constructed by taking a binary string of length n and assigning the category +1 to it if there is an even number of 1s in the string, and a -1 if there is an odd number. This `parity' data set has 2n points and it is an amusing exercise to compute the smallest number of units in the second layer (committee members) required to ensure a solution.
Such committees, started off from random initial
locations, can sometimes fail to converge, and
a little innocent experimenting rapidly shows
why. Suppose that two of the three wrong lines
are `a long way away' from the datum
,
and indeed from all the data. Then the same
line may be corrected every time a datum is presented
to the net, and the two remote units might
just as well not be there. The
correction process ignores them. In this case
we are trying to solve XOR with a single unit,
which is impossible. A `long way away' simply
means that
is large.
This will happen for XOR if the constant term
is big in absolute value.
There are several ways of fixing things up so that all units are employed usefully. One quite attractive procedure is the following: we do not simply select the closest unit for correction, we correct all of them, but we correct the closest unit most and the remoter units much less. Thus the most remote of units will eventually be pulled into the region where the data is being classified by the hyperplanes until the data is classified correctly, except where there are simply not enough units to do the job. Since we seldom know what this minimum number of units is except in rather trivial problems such as the parity problem, the prudent thing to do is to start with a lot and keep your fingers crossed.
The choice of correction rule which does this differential weighting of correction according to how remote the line is from the datum in the weight space, needs a function which starts off at some positive value and then decreases monotonically to zero. Something like e-x springs immediately to mind (since distances are always positive!). It is not a particularly good choice, because it takes much longer to compute transcendental functions than polynomials or rational functions, and something like 1/(1+x) is easier to compute. This function has the property that remote objects get pulled in rather fast, which may be a bad idea if they are not so remote as all that, and are in fact doing a good job of classifying some other data points. One can therefore have a preference for, say, 1/(1+x)2. The possibilities are endless.
The end result of this purely heuristic analysis
then, is that a workable correction process to
the committee net can be obtained by keeping the
final weights fixed at +1, and adapting the
wrong units by taking a step in the appropriate
direction ( i.e. along the direction
or in the opposite direction according to the
sign of the category) of size some constant times
a function which is decreasing to zero from 1
(or some other positive constant) and might
conveniently be chosen to be e-x or 1/(1+x)2.
It is easy to explore this training rule
on real data and it seems to behave itself sensibly.
We shall see later that this rule is
intimately related to the back-propagation procedure.