next up previous contents
Next: Four Layer Nets Up: Committees Previous: Training Committees

Capacities of Committees: generalised XOR

It is intuitively clear, and easy to prove, that some committee can classify any finite data set in ${\fam11\tenbbb R}^n$. I do this now, so you can brood over the proof.

Proposition 13407

Any finite binary data set in ${\fam11\tenbbb R}^n$ can be classified by some committee net.

Proof We induct on the number of data points, N. It is clear that when N = 1 there is no problem, indeed it can be accomplished with one unit when $N \leq n+1.$ Suppose then that all data sets having N data points in ${\fam11\tenbbb R}^n$ an be classified by some committee, and consider a data set having N+1 points. One of these, call it p, has to be on the outside, that is must be separable from the remaining data set by a single hyperplane.[*] There is a committee net which can classify correctly the remaining N data points, by the induction hypothesis. If the same net is correct in its classification of p there is nothing to prove, so we suppose that the committee assigns the wrong category to p.

Since p is separable from the rest of the data set, we may surround p by hyperplanes which do not intersect the convex hull of the remainder of the data set and which change the net's opinion of p, by putting them between p and the rest of the data, and we may put an equal number on the other side of p to ensure that outside this band containing p no change is made. The result is a committee which is correct about all the N+1 data points. The Principal of Mathematical Induction completes the proof. $\Box$

Note that the size of the committee may have to be increased faster than the size of the data set. This should, after some exposure to the ideas of Rissanen, make one sceptical about the extent to which such modelling actually accomplishes anything. I shall quantify this later.

The point is brought home if one considers a more `realistic' data set than XOR. Suppose one defines a generalised XOR data set in ${\fam11\tenbbb R}^2$ by assuming a large number of points in the plane, all the points in the first and third quadrants have category -1 and all the points in the second and fourth quadrants have category +1.


 
Figure 5.8: Generalised XOR.
\begin{figure}
\vspace{8cm}
\special {psfile=patrecfig5.8.ps}\end{figure}

An example of such a data set is sketched in Fig. 5.8, where the different categories are represented by circles black and circles white. Generalisation to higher dimensions is left as an easy exercise for the diligent student. Generalisation to more than two categories is also left as an exercise for the diligent reader.

Now the smallest number of oriented lines which can correctly classify the data set looks to be, in general, on the large side. A solution to the particular distribution of points of Fig. 5.8. is offered in the next sketch, Fig.5.9.


 
Figure 5.9: ....and a solution...
\begin{figure}
\vspace{8cm}
\special {psfile=patrecfig5.9.ps}\end{figure}

The numbers refer to the votes for and against. An additional unit or a suitably chosen threshold of 1 would make the discrimination.

Now it is easy to see that putting more data points in the four quadrants, according to the arrangement whereby points in the third and first categories are of type +1 and points in the second and fourth are of type +1, will soon start putting this model in trouble. The difficulty is quite plain: the units define lines across the whole plane, the rule for alternate quadrants has different regimes in half planes, and to continue briskly across the whole space is inevitably going to screw things up. Moreover, this is a generic problem not confined to the generalised XOR data set; if a data set consists of a collection of convex regions in ${\fam11\tenbbb R}^n$, with each convex set containing points of the same category, and if the convex sets are chosen so as to make their number minimal, then if there are more than four of them, they will look like the generalised XOR set or worse.

The situation is evidently most unsatisfactory; it is easy to arive at the situation where one is trying to fit pairs of lines in so as to sneak through the regions largely occupied by the black circles, without actually enclosing any between the lines, with the intention of actually enclosing a single white circle. And the problem arises from the nature of the committee. Proposition 5.1 therefore is of limited practical value; the existence of generalised XOR and the fact that we can expect data sets which are structurally similar whenever the complexity of the pattern is high enough, tells us that three layer committees can be expected to perform badly as the number of units will be of order the number of data points. As we shall see later, the three layer MultiLayer Perceptron with a sigmoidal squashing function (and any algorithm for finding a solution) is subject to the same carping comments. The predictive power of such a system will be close to zilch. The thoughts on compression and information extraction lead us to doubt if the model implemented by such a net is worth having, a point to which I shall return later.


next up previous contents
Next: Four Layer Nets Up: Committees Previous: Training Committees
Mike Alder
9/19/1997