next up previous contents
Next: Other types of (Classical) Up: Decisions: Neural Nets(Old Style) Previous: Committees vs Back-Propagation

Compression: is the model worth the computation?

The purpose of using neural nets is to try to extract the maximum amount of information about the category of the data, on the hypothesis that there will be some more points along soon and we want to make the most intelligent guess at their category. If the complexity of the data is sufficiently high compared with the number of points, then the simplest neural net is worse than a simple list of the points and their categories. It is true that such a procedure gives no suggestion as to what to make of the category of the next point along, on the other hand if an attempt to find stucture in the data has failed, then we might as well guess. And a data set which has shown no compression using a neural net is one where guessing merely confronts the fact that we don't have any idea of what the category of a new point is, rather than permitting us to deceive ourselves by using a neural net.

Imagine that we have a data set such as the male and female height-weight points of chapter one. It is possible to visualise an immense amount of data and to be able to believe that two gaussians, one for each category, fit the data pretty well out to three or four standard deviations. Now a four layer neural net which got every point correct would be hard to believe, and a three layer neural net utterly impossible to take seriously. All those lines which carefully snuck through the enemy camp would lead you to expect points of each category in the most peculiar places. How can we articulate our natural scepticism using Rissanen's ideas?


 
Figure 5.14: An implausibly complicated theory.
\begin{figure}
\vspace{8cm}
\special {psfile=patrecfig5.15.ps}\end{figure}

If we add to each datum in ${\fam11\tenbbb R}^n$ its category regarded as a number either $\pm 1$, then it takes one extra bit to do a binary classification, over the bits needed to say where the datum is in the space. Similarly, if we have gaussians doing a classification, we need to add in a bit to say which category of gaussian they are supposed to be.

In order to specify a single unit in ${\fam11\tenbbb R}^n$ we need, on the face of things, (n+1)P bits, where P is the precision used to specify each coefficient. It is slightly better than this in practice; multiplying the n+1 numbers by a positive constant will not change the output of the net or the position of the line in any way; we might therefore stipulate that the coefficients be normalised, say by making the sum of the squares of them equal to 1. This ensures that the coefficients or weights always lie between $\pm 1$, and that if we have n of them we can calculate the last. We therefore need nP bits to specify a single unit in the second layer.

If n = 2 and P = 10, we specify the line with coefficients which are accurate to one point in a thousand. This is often excessive, since the amount of wobble of the line which can be applied and still have it implement the dichotomy may be quite large. To analyse the situation, suppose we have a binary data set in the unit square with points of category +1 at values of x and y greater than 0.6 and points of category -1 with both x and y less than 0.3. Then a specification of the position of the line needs only about three bits precision on each coefficient, if the coefficients are chosen to be between +1 and -1. To specify the data set takes some number of bits to specify the positions, say NP, and to specify the category takes a further N bits when there are N points. To use a line instead for the specification of the category, takes 6 bits, three for each (normalised) coefficient. So provided there are more than 6 data points, we have made a saving by using the line. More generally, we have that the number of points needed in dimension n in order to justify using a hyperplane is nP, where P is the precision required. If the points form two clusters which are well separated, then P goes down, which seems reasonable. There is no attempt to specify the location of the data by the model in this analysis, we only use it to store the category. If we use a committee net with k units to store the category for a binary data set in ${\fam11\tenbbb R}^n$, then we have knP bits required to specify the state of the net, and again N bits for the data. If we have m categories instead of just two, we need $N \log_2(m)$ bits for the data set. In addition, there is some overhead needed to specify the network topology.

A further problem is that we do not know in advance what the precision of the coefficients of the net should be: we can put a bound on this by pointing out that it is senseless to have the precision of the coefficients much greater than that of the data, but the precision of the data is largely irrelevant in the model I propose. It is infeasible to go much further without some delicate arguments which might be thought tendentious, but it is clear that a neural net having k nodes in ${\fam11\tenbbb R}^n$ must classify at least knP binary data points, for P at least of order 10, before it can be said to be extracting any structure from the data set. Papers discussing the classification of data in dimension 100 with nets having hundreds of units must therefore have about half a million data points if the application is not to be an exercise in self-deception. This criterion would exclude some well known papers in the literature. Since P is often 64 bits, and the specification of the network topology may be of order k(k-1) bits, this is an optimistic analysis of the state of affairs. And there are papers in which thousands of units in dimension several hundred have been cheerfully applied to data sets having only a few hundred points in them. The naive belief that in doing this the protagonists were contributing anything to the advance of human knowledge, would appear to be sadly in error. Had they spent the computer time playing Super Mario Brothers they might have had more fun and not wasted the resources any more flagrantly. The figure of 10 for P agrees with rules of thumb used by the neural net community for some time, but there has been little rationale offered until recently.

It is important to understand then that when you read a paper describing a neural net application in which there are fewer than knP binary data classified by a net, the writer has not used his net to extract information from the data. On the contrary, he has put information in. The extra information is not about the data, it is about the deficiencies of his education.

It is very hard to believe that a model of a process which has failed to extract any information from the data given, will have any real capacity to predict the results of more data measurements. In the case of neural nets which, like Feed-forward Back-Propagation nets, start with some random initialisation, then there is not, in general, any unique solution. Moreover, the less information is extracted from the data in finding a solution, the larger the range of solutions there will be. Now two different solutions will disagree with each other in their predictions, by definition, and the larger the space of different possible solutions, the greater will be the extent to which solutions can disagree with each other about the result of classifying a new datum. We cannot therefore feel any confidence about any one prediction which such a net provides, because we know that there are other solutions, just as compatible with the data, which yield other, contradictory, predictions.

In the XOR case, for example, we have four data points and three units in a committee which can solve the problem.


  
Figure 5.15: Two committee solutions for XOR.
\begin{figure}
\vspace{8cm}
\special {psfile=prf5.15a.ps}\end{figure}

If we consider the two solutions of the committee net shown in fig. 5.15, we see that they differ about the state of a point in the middle, as well as at other regions. Now which solution you get will depend on the initialisation of the weights, and the random sequence of calling data points. So if you ask for the category of a point in the middle at coordinates (1/2,1/2), you will find that the answer supplied by the net will depend on the random initialisation. In other words, you are consulting a rather bizarre random number generator. It would be simpler to toss a coin. Instead of initialising a coin with an unknown starting velocity and momentum, one has initialised a back-propagation network with random values, and the resulting prediction for a new datum depends on just which initialisation was chosen. Coins are quicker, cheaper and no less likely to give the right answer.

Anyone who has any kind of model fitted to data where there are less data than parameters to be fitted, is deceiving himself if he imagines that his model has any useful predictive power, or indeed any value other than as an indicator of incompetence. All he is doing is the equivalent of simulating a coin toss, with the drawback that it takes much longer and wastes a deal of computer time.


next up previous contents
Next: Other types of (Classical) Up: Decisions: Neural Nets(Old Style) Previous: Committees vs Back-Propagation
Mike Alder
9/19/1997