next up previous contents
Next: Statistical Methods Up: Decisions, decisions.. Previous: Metric Methods

Neural Net Methods (Old Style)

Artificial Neural Nets have become very popular with engineers and computer scientists in recent times. Now that there are packages around which you can use without the faintest idea of what they are doing or how they are doing it, it is possible to be seduced by the name neural nets, into thinking that they must work in something like the way brains do. People who actually know the first thing about real brains and find out about the theory of the classical neural nets are a little incredulous that anyone should play with them. It is true that the connection with real neurons is tenuous in the extreme, and more attention should be given to the term artificial, but there are some connections with models of how brains work, and we shall return to this in a later chapter. Recall that in this chapter we are doing this once over briefly, so as to focus on the underlying ideas, and that at present we are concerned with working out how to think about the subject.

I shall discuss other forms of neural net later, here I focus on a particular type of net, the Multilayer Perceptron or MLP, in its simplest avatar.

We start with the single unit perceptron , otherwise a three layer neural net with one unit in the hidden layer. In order to keep the dimensions nice and low for the purposes of visualising what is going on, I shall recycle Fig.1.2. and use x and y for the height and weight values of a human being. I shall also assume that, initially, I have only two people in my data set, Fred who has a height of 200 cm and weighs in at 100 kg, and Gladys who has a height of 150 cm and a weight of 60 kg. We can picture them graphically as in Fig.1.5., or algebraically as

\begin{displaymath}
% latex2html id marker 571
Fred = \left(\begin{array}
{c} 200 \\  100 
 \end{array} \right) \end{displaymath}

\begin{displaymath}
% latex2html id marker 572
Gladys = \left(\begin{array}
{c} 150 \\  60 
 \end{array} \right) \end{displaymath}


 
Figure 1.5: Gladys and Fred, abstracted to points in ${\fam11\tenbbb R}^2$
\begin{figure}
\vspace{8cm}
\special {psfile=patrecfig5.ps}\end{figure}

The neural net we shall use to classify Fred and Gladys has a diagram as shown in Fig.1.6. The input to the net consists of two numbers, the height and weight, which we call x and y. There is a notional `fixed' input which is always 1, and which exists to represent a so called `threshold'. The square boxes represent the input to the net and are known in some of the Artificial Neural Net (ANN) literature as the first layer. The second layer in this example contains only one unit (believed in some quarters to represent a neuron) and is represented by a circle. The lines joining the first layer to the second layer have numbers attached. These are the weights, popularly supposed to represent the strength of synaptic connections to the neuron in the second layer from the input or sensory layer.


 
Figure 1.6: A very simple neural net in two dimensions
\begin{figure}
\vspace{6cm}
\special {psfile=patrecfig6.ps}\end{figure}

The node simply sums up the weighted inputs, and if the weights are a, b and c, as indicated, then the output is ax+by+c when the input vector is % latex2html id marker 685
$ \left(\begin{array}
{c} x \\  y \end{array} \right)$. The next thing that happens is that this is passed through a thresholding operation. This is indicated by the sigmoid shape. There are various forms of thresholder; the so called hard limiter just takes the sign of the output, if ax+by+c is positive, the unit outputs 1, if negative or zero it outputs -1. Some people prefer 0 to -1, but this makes no essential difference to the operation of the net. As described, the function applied to ax + by + c is called the sgn function, not to be confused with the sine function, although they sound the same.

The network is, in some respects, easier to handle if the sigmoid function is smooth. A smooth approximation to the sgn function is easy to construct. The function tanh is sometimes favoured, defined by

\begin{displaymath}
tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \end{displaymath}

If you don't like outputs which are in the range from -1 to 1 and want outputs which are in the range from 0 to 1, all you have to do is to add 1 and divide by 2. In the case of tanh this gives the sigmoid

\begin{displaymath}
sig(x) = \frac{e^x}{e^x + e^{-x}} \end{displaymath}

These sigmoids are sometimes called `squashing functions' in the neural net literature, presumably because they squash the output into a bounded range. In other books they are called activation functions.

We have, then, that the net of Fig.1.6. is a map from ${\fam11\tenbbb R}^2$ to ${\fam11\tenbbb R}$ given by

\begin{displaymath}
% latex2html id marker 575
\left(\begin{array}
{c} x \\  y \end{array} 
\right) \leadsto sig(ax+by+c) \end{displaymath}

In the case where sig is just sgn, this map sends half the plane to the number 1 and the other half to the number -1. The changeover occurs along the line ax + by +c = 0. It is common to think of the unit representing a neuron which `fires' if the weighted input ax + by exceeds the threshold -c. This is picturesque, harmless and possibly inspirational. But it is more useful to say that the net divides up the plane ${\fam11\tenbbb R}^2$ by a line, and one side of the line has points which get assigned to +1 and the other side of the line has all its points sent to -1.

So we can draw a line in the plane for any value of a, b and c, and attach a sign, + and -, to each side of the line. This is done in Fig.1.7, for two choices of the weights a,b and c.

With the choice a = 1, b = -1 and c = 100, we get the line labelled A, while with the choice a = 1, b = 1, and c = -250 we get the line labelled B. This last has the satisfactory property that it allows us to distinguish between Fred and Gladys, since Fred is taken to +1 and Gladys to -1 by the net. If you like, the neuron fires when Fred is input, but not when Gladys is input. We have a neuron which is capable of discrimination on the basis of sex, just like you and me.


 
Figure 1.7: Two possible states of the same neural net
\begin{figure}
\vspace{8cm}
\special {psfile=patrecfig7.ps}\end{figure}

It should now be clear that a neural net of the simple structure of Fig.1.6. cuts the plane into two halves by a dividing line. If the data set is as simple as { Fred, Gladys } then the problem is to find out the right place to put the dividing line, that is, we have to find a choice of a,b,c that will put the points of one category on the positive side of the line and the points of the other category all on the other side. Our thinking about this simple case therefore leads us to three issues:

1.
Can we find an algorithm for working out where to put the line in the plane?
2.
Can we do the same sort of thing in higher dimensions?

3.
What if the data set is like the male/female data of Fig.1.2? Can we put in more than one line so as to separate out the sets? It is obvious that a single line won't work.

The answers are gratifyingly positive (or I should not have asked the questions). The algorithm for a single unit as in the case of Fig. 1.6. is the well known Perceptron Convergence Algorithm and goes back to Rosenblatt and Widrow, among others, and will be described in later chapters. The dimension is largely irrelevant: if we had inputs x, y and z instead of x and y, we would have an extra weight and be looking at a separating surface with equation ax + by + cz + d = 0 which is the equation of a plane. In general, if ${\bf x}$ is a point in ${\fam11\tenbbb R}^n$ let $\hat{{\bf x}} 
$ denote the point in ${\fam11\tenbbb R}^{n+1} $ obtained by writing out the components of ${\bf x}$ and then putting a 1 in the last place. Then an element of the weight space for a single unit net with n inputs, i.e. the list of n+1 weights attached to the arcs going from the inputs (and threshold 1) to the unit can be regarded as a vector in ${\fam11\tenbbb R}^{n+1} $, and if we call it ${\bf w}$, then the space ${\fam11\tenbbb R}^n$ is divided into two halves by the hyperplane $ {\bf w}
\cdot \hat{{\bf x}} = 0 $, where $ \cdot $ denotes the standard inner or `dot' product. This is standard linear algebra, and should not trouble the well informed reader. If it boggles your mind, mind, then the remedy is to proceed to the Linear Algebra notes from the Master's Preliminary course. You should rejoin the rest of us at this point after mastering the subject to the level of being able to understand Nilsson's book, mentioned in the bibliography at the end of this chapter.

The Perceptron Convergence Algorithm works for any dimension, as we shall see later. It takes some randomly chosen initial hyperplane and operates on it by selecting a data point, usually also at random, and then kicking the hyperplane around, then repeating for new, randomly selected points, until the hyperplane moves into the right position. This is called training the net. It isn't immediately obvious that there is such a rule for kicking hyperplanes around, but there is and it takes only some elementary linear algebra to find it. I shall explain all in a later chapter, but for now it suffices to get the general idea.

For more complicated data sets, we may need more than one unit in the second layer. For practical applications, it is also a good idea to have more than one layer; again this will be discussed in a later chapter. Training these more complicated nets so that they put many hyperplanes in reasonable positions is a little harder. This is what the Back Propagation algorithm accomplishes.

The purposes of this section will have been met if the reader understands that what an ANN does is to chop the space up into regions by means of hyperplanes, so that points of the same category are generally in the same regions. The decision as to where to put the dividing hyperplanes is taken by means of a training algorithm which usually means selecting data points and operating with them on a randomly chosen initial placing of the hyperplanes until convergence has occurred or the error measure is small enough.

ANN's can take a long time to train, but they are often quite fast to use as pattern classifiers because we have to compute some number of inner products, a number usually much less than the amount of data. Chopping the space up into regions, each of which is, as we shall see later, convex, can be rationalised by observing that if a point is surrounded by points all of one category, with no points of another category in between, then it surely ought to be of the same category as its surrounding points in any reasonable universe. This is a weaker kind of assumption of reasonableness to make than the metric assumption, but whether the universe is prepared to go along with it has to be tested on particular data. It is easy to see that the hyperplane (line in dimension 2) B of Fig.1.7. which does an adequate job of telling Fred from Gladys is unlikely to keep on doing a good job of telling the guys from the gals as more data comes in. The hyperplane is a kind of theory. It has its opinions about the category of any new point that may be offered. A good theory has to be right when tested on new data, and the theory given by line B does not look promising. Another serious drawback of the ANN described by B is that an object weighing in at 50 Kg. and having a height of three metres is unequivocally theorised to be a man. Modifying the ANN so that it admits that it has never seen anything like it before and consequently doesn't have the foggiest idea what class a new point belongs to, is not particularly easy.


next up previous contents
Next: Statistical Methods Up: Decisions, decisions.. Previous: Metric Methods
Mike Alder
9/19/1997