Proposition 13407
Any finite binary data set in
can be classified by some committee net.
Proof We induct on the number of data points,
N. It is clear that when N = 1 there
is no problem, indeed it can be accomplished with
one unit when
Suppose then that
all data sets having N data points in
an be classified by some committee, and consider
a
data set having N+1 points. One of these, call
it p, has to be on the outside, that is must
be separable from the remaining data set by a
single hyperplane.
There is a
committee net which
can classify correctly the remaining N data
points, by the induction hypothesis. If the same
net
is correct in its classification of p there
is nothing to prove, so we suppose that the
committee assigns the wrong category to p.
Since p is separable from the rest of the data
set, we may surround p by hyperplanes which
do
not intersect the convex hull of the remainder
of the data set and which change the net's opinion
of p, by putting them between p and the rest
of the data, and we may put an equal number on
the
other side of p to ensure that outside this
band containing p no change is made. The result
is a committee which is correct about all the
N+1 data points. The Principal of Mathematical
Induction completes the proof. ![]()
Note that the size of the committee may have to be increased faster than the size of the data set. This should, after some exposure to the ideas of Rissanen, make one sceptical about the extent to which such modelling actually accomplishes anything. I shall quantify this later.
The point is brought home if one considers a more
`realistic' data set than XOR. Suppose one
defines a generalised XOR data set in
by assuming a large number of points in the
plane, all the points in the first and third quadrants
have category -1 and all the points in
the second and fourth quadrants have category
+1.
An example of such a data set is sketched in Fig. 5.8, where the different categories are represented by circles black and circles white. Generalisation to higher dimensions is left as an easy exercise for the diligent student. Generalisation to more than two categories is also left as an exercise for the diligent reader.
Now the smallest number of oriented lines which can correctly classify the data set looks to be, in general, on the large side. A solution to the particular distribution of points of Fig. 5.8. is offered in the next sketch, Fig.5.9.
The numbers refer to the votes for and against. An additional unit or a suitably chosen threshold of 1 would make the discrimination.
Now it is easy to see that putting more data points
in the four quadrants, according to the
arrangement whereby points in the third and first
categories are of type +1 and points in the
second and fourth are of type +1, will soon
start putting this model in trouble. The
difficulty is quite plain: the units define lines
across the whole plane, the rule for
alternate quadrants has different regimes in half
planes, and to continue briskly across the
whole space is inevitably going to screw things
up. Moreover, this is a generic problem not
confined to the generalised XOR data set; if a
data set consists of a collection of convex
regions in
, with each convex set containing
points of the same category, and if the
convex sets are chosen so as to make their number
minimal, then if there are more than four of
them, they will look like the generalised XOR
set or worse.
The situation is evidently most unsatisfactory; it is easy to arive at the situation where one is trying to fit pairs of lines in so as to sneak through the regions largely occupied by the black circles, without actually enclosing any between the lines, with the intention of actually enclosing a single white circle. And the problem arises from the nature of the committee. Proposition 5.1 therefore is of limited practical value; the existence of generalised XOR and the fact that we can expect data sets which are structurally similar whenever the complexity of the pattern is high enough, tells us that three layer committees can be expected to perform badly as the number of units will be of order the number of data points. As we shall see later, the three layer MultiLayer Perceptron with a sigmoidal squashing function (and any algorithm for finding a solution) is subject to the same carping comments. The predictive power of such a system will be close to zilch. The thoughts on compression and information extraction lead us to doubt if the model implemented by such a net is worth having, a point to which I shall return later.