The purpose of using neural nets is to try to extract the maximum amount of information about the category of the data, on the hypothesis that there will be some more points along soon and we want to make the most intelligent guess at their category. If the complexity of the data is sufficiently high compared with the number of points, then the simplest neural net is worse than a simple list of the points and their categories. It is true that such a procedure gives no suggestion as to what to make of the category of the next point along, on the other hand if an attempt to find stucture in the data has failed, then we might as well guess. And a data set which has shown no compression using a neural net is one where guessing merely confronts the fact that we don't have any idea of what the category of a new point is, rather than permitting us to deceive ourselves by using a neural net.
Imagine that we have a data set such as the male and female height-weight points of chapter one. It is possible to visualise an immense amount of data and to be able to believe that two gaussians, one for each category, fit the data pretty well out to three or four standard deviations. Now a four layer neural net which got every point correct would be hard to believe, and a three layer neural net utterly impossible to take seriously. All those lines which carefully snuck through the enemy camp would lead you to expect points of each category in the most peculiar places. How can we articulate our natural scepticism using Rissanen's ideas?
If we add to each datum in
its category
regarded as a number either
, then it
takes one extra bit to do a binary classification,
over the bits needed to say where the datum
is in the space. Similarly, if we have gaussians
doing a classification, we need to add in a bit
to say which category of gaussian they are supposed
to be.
In order to specify a single unit in
we
need, on the face of things, (n+1)P bits, where
P is the precision used to specify each coefficient.
It is slightly better than this in
practice; multiplying the n+1 numbers by a positive
constant will not change the output of the
net or the position of the line in any way; we
might therefore stipulate that the coefficients
be normalised, say by making the sum of the squares
of them equal to 1. This ensures that the
coefficients or weights always lie between
,
and that if we have n of them we can
calculate the last. We therefore need nP bits
to specify a single unit in the second layer.
If n = 2 and P = 10, we specify the line with
coefficients which are
accurate to one point in a thousand. This is often
excessive, since the amount of wobble of the
line which can be applied and still have it implement
the dichotomy may be quite large. To analyse
the situation, suppose we have a binary data set
in the unit square with points of category +1
at values of x and y greater than 0.6 and
points of category -1 with both x and y
less than 0.3. Then a specification of the position
of the line needs only about three bits precision
on each coefficient, if the coefficients are chosen
to be between +1 and -1. To specify the data
set takes some number of bits to specify the positions,
say NP, and to specify the category takes
a further N bits when there are N points.
To use a line instead for the specification of
the category, takes 6 bits, three for each (normalised)
coefficient. So provided there are more than
6 data points, we have made a saving by using the line.
More generally, we have that the number of points
needed in dimension n in order to justify using
a hyperplane is nP, where P is the
precision required. If the points form two clusters
which are well separated, then P goes down,
which seems reasonable. There is no attempt to
specify the location of the data by the model
in this analysis, we only use it to store the category.
If we use a committee net with k units to
store the category for a binary data set in
,
then we have knP bits required to
specify the state of the net, and again N bits
for the data. If we have m categories instead
of just two, we need
bits for the data
set. In addition, there is some overhead needed
to specify the network topology.
A further problem is that we do not know in advance
what the precision of the coefficients of the
net should be: we can put a bound on this by pointing
out that it is senseless to have the
precision of the coefficients much greater than
that of the data, but the precision of the data
is largely irrelevant in the model I propose. It
is infeasible to go much further without some
delicate arguments which might be thought tendentious,
but it is clear that a neural net having k
nodes in
must classify at least knP binary
data points, for P at least of order 10, before
it can be said to be extracting any structure
from the data set. Papers discussing the classification
of data in dimension 100 with nets having hundreds
of units must therefore have about half a
million data points if the application is not
to be an exercise in self-deception. This criterion
would exclude some well known papers in the literature.
Since P is often 64 bits, and the specification
of the network topology may be of order k(k-1)
bits, this is an optimistic analysis of the state
of affairs. And there are papers in which thousands
of units in dimension several hundred have
been cheerfully applied to data sets having only
a few hundred points in them. The naive belief
that in doing this the protagonists were contributing
anything to the advance of human knowledge, would
appear to be sadly in error. Had they spent the
computer time playing Super Mario Brothers
they might have had more fun and not wasted the
resources any more flagrantly. The figure of
10 for P
agrees with rules of thumb used by the neural
net community for some time, but there has been
little rationale offered until recently.
It is important to understand then that when you read a paper describing a neural net application in which there are fewer than knP binary data classified by a net, the writer has not used his net to extract information from the data. On the contrary, he has put information in. The extra information is not about the data, it is about the deficiencies of his education.
It is very hard to believe that a model of a process which has failed to extract any information from the data given, will have any real capacity to predict the results of more data measurements. In the case of neural nets which, like Feed-forward Back-Propagation nets, start with some random initialisation, then there is not, in general, any unique solution. Moreover, the less information is extracted from the data in finding a solution, the larger the range of solutions there will be. Now two different solutions will disagree with each other in their predictions, by definition, and the larger the space of different possible solutions, the greater will be the extent to which solutions can disagree with each other about the result of classifying a new datum. We cannot therefore feel any confidence about any one prediction which such a net provides, because we know that there are other solutions, just as compatible with the data, which yield other, contradictory, predictions.
In the XOR case, for example, we have four data points and three units in a committee which can solve the problem.
If we consider the two solutions of the committee net shown in fig. 5.15, we see that they differ about the state of a point in the middle, as well as at other regions. Now which solution you get will depend on the initialisation of the weights, and the random sequence of calling data points. So if you ask for the category of a point in the middle at coordinates (1/2,1/2), you will find that the answer supplied by the net will depend on the random initialisation. In other words, you are consulting a rather bizarre random number generator. It would be simpler to toss a coin. Instead of initialising a coin with an unknown starting velocity and momentum, one has initialised a back-propagation network with random values, and the resulting prediction for a new datum depends on just which initialisation was chosen. Coins are quicker, cheaper and no less likely to give the right answer.
Anyone who has any kind of model fitted to data where there are less data than parameters to be fitted, is deceiving himself if he imagines that his model has any useful predictive power, or indeed any value other than as an indicator of incompetence. All he is doing is the equivalent of simulating a coin toss, with the drawback that it takes much longer and wastes a deal of computer time.