next up previous contents
Next: Hopfield Networks Up: Other types of (Classical) Previous: The Kohonen Net

Probabilistic Neural Nets

Donald Specht published in the 1990 issue of Neural Networks mentioned in the bibliography a paper called Probabilistic Neural Networks. This followed earlier work which was not widely known to the Neural Net community which dates from the later sixties. The earlier work talked of polynomial discriminant functions and didn't mention neural nets which indeed were barely off the ground in those days.

One of the problems associated with MLPs, which is complained about from time to time by the more thoughtful users, is that whenever there is much structure in the data, the wildly hopping hyperplanes do not do a good job of collaborating to find it. A well known test case is the so called double spiral data set, shown in Fig. 5.19


  
Figure 5.19: The Double Spiral Data Set.
\begin{figure}
\vspace{8cm}
\special {psfile=patrecfig5.16.ps}\end{figure}

The prospect of training a four layer neural net to learn this arrangement of data is depressing to anyone who has ever tried, and also to the even smaller group who have thought about what it entails.

The difficulty of course lies in the restrictions of thought which Neural Netters have imposed upon themselves. Some of these restrictions arise from a training in mathematics which has emphasised the formal at the expense of the intuitive; some from an inability to abstract the essentials of a problem. What a Multi-layer Percepron does is to chop up the space into chunks, each chunk of which contains points from the data set of the same class. There is no compelling reason why the chunks should have boundaries which are parts of hyperplanes unless there is compelling reason to think that real neurons do in fact implement such sets, and there isn't. So we can choose any set which is convenient. Specht elects, in effect, to put spheres around the data points, and to label the spheres with the class of the data. He then takes the radius of the spheres as a parameter to be played with, and pumps it up or down until it interferes with spheres of a different class. He has the same radius for every sphere; he can interpret the sphere as one standard deviation of a spherical gaussian distribution and claim that he has a sort of gaussian mixture model for the data, constrained so as to have a covariance matrix which is some constant times the identity matrix. This allows him to not only classify the data, but also to have a probabilistic assessment of the category of another point. All the algorithm actually does is to take each data point as it comes in and centre a sphere on it. Then there is some messing about to decide what the radii of the spheres ought to be. By interpreting the sphere as one standard deviation of a gaussian distribution, it is possible to compute likelihoods for subsequent data. One usually avoids thinking of the geometry by describing the process by means of a covariance matrix which is some constant times the identity matrix, but only a small amount of thought shows that this defines a sphere, with the constant specifying the radius. It is easy to visualise the algorithm applied to the double spiral problem, and this allows one to see its merits and drawbacks rather clearly.

Parzen had earlier (1962) suggested essentially the same thing with a variety of different functions replacing the spherical gaussian.

The function fitting aspects of this procedure are worth developing; if instead of a finite number of categories each data point is assigned a real number, we can take a gaussian function centred on some point and multiply it by a number just sufficient to make the value of the scaled gaussian equal to the value assigned to the point. If we do this for every point in the data set and also pump up the standard deviation of the gaussians, we may get a reasonable approximation to a smooth curve or function through the data. Fig. 5.20 shows this done for the case where the data is in ${\fam11\tenbbb R}$; the original values are given by spikes of the right height. We are, of course, under no compulsion to use gaussians, and function fitting by a generalised version of this is well known under the title of radial basis functions.


  
Figure 5.20: Radial Basis Functions, a.k.a. PNNs fitting a function
\begin{figure}
\vspace{8cm}
\special {psfile=patrecfig5.17.ps}\end{figure}

Putting a gaussian at each point of the data is rather wasteful, and this leads to trying to adapt the centres of the spheres so as to be somewhere in the middle of clusters of data and to have rather fewer of them. This is called using Adaptive Probabilistic Neural Nets, which in tune with modern methods of causing confusion are abbreviated to APNN. I shall say something about how to accomplish this later.

It may be asked what Adaptive Probabilistic Neural Nets have to offer over gaussian mixture modelling and the answer appears to be largely as given in the preamble to this chapter. There are packages one can use, or algorithms one can copy without the least understanding of what is actually going on. Function fitting or modelling probability density functions sounds, perhaps, less glamorous than supervised and unsupervised learning by Neural Nets. This is silly and indicates the naivety of much work in the area. Perhaps the reason that many writers in the area are obscure is because it conceals the poverty of the ideas. Where there are implementation issues, such as making PNN out of hardware or the (remote) possibility that they model real wetware neurons, these strictures of a professional grouch may be lifted somewhat, but one cannot help wishing that Mathematicians had done a better job of exposition of the ideas of their subject when teaching Engineers.

Probabilistic Neural Nets are not attractive from the point of view of their Rissanen complexity, since you get to send the whole data set plus at least one number specifying the radii of the gaussians. Adaptive PNNs are more attractive, and there are justifications of them which I shall consider later in the section on Quadratic Neural Nets, which contain APNNs as a special case. The network diagrams for PNN's do not convey much information except that it is possible to implement such things via parallel algorithms, which may or may not be an advantage. If the data looks like the double spiral data, one might wish that some acknowledgement of relationships between the bits were possible. But we shall return to such issues when we come to Syntactic Pattern Recognition.


next up previous contents
Next: Hopfield Networks Up: Other types of (Classical) Previous: The Kohonen Net
Mike Alder
9/19/1997