I shall discuss other forms of neural net later, here I focus on a particular type of net, the Multilayer Perceptron or MLP, in its simplest avatar.
We start with the single unit perceptron , otherwise a three layer neural net with one unit in the hidden layer. In order to keep the dimensions nice and low for the purposes of visualising what is going on, I shall recycle Fig.1.2. and use x and y for the height and weight values of a human being. I shall also assume that, initially, I have only two people in my data set, Fred who has a height of 200 cm and weighs in at 100 kg, and Gladys who has a height of 150 cm and a weight of 60 kg. We can picture them graphically as in Fig.1.5., or algebraically as


The neural net we shall use to classify Fred and Gladys has a diagram as shown in Fig.1.6. The input to the net consists of two numbers, the height and weight, which we call x and y. There is a notional `fixed' input which is always 1, and which exists to represent a so called `threshold'. The square boxes represent the input to the net and are known in some of the Artificial Neural Net (ANN) literature as the first layer. The second layer in this example contains only one unit (believed in some quarters to represent a neuron) and is represented by a circle. The lines joining the first layer to the second layer have numbers attached. These are the weights, popularly supposed to represent the strength of synaptic connections to the neuron in the second layer from the input or sensory layer.
The node simply sums up the weighted inputs, and
if the weights are a, b and c,
as indicated, then the output is ax+by+c when
the input vector is
.
The next thing
that happens is that this is passed through a
thresholding operation.
This is indicated by the sigmoid shape. There
are various forms of thresholder;
the so called hard limiter just takes the
sign of the output,
if ax+by+c is positive, the unit outputs 1,
if negative or zero it outputs
-1. Some people prefer 0 to -1, but this
makes no essential difference
to the operation of the net. As described,
the function applied to ax + by + c is called
the sgn function, not
to be confused with the sine function, although
they sound the same.
The network is, in some respects, easier to handle if the sigmoid function is smooth. A smooth approximation to the sgn function is easy to construct. The function tanh is sometimes favoured, defined by

If you don't like outputs which are in the range from -1 to 1 and want outputs which are in the range from 0 to 1, all you have to do is to add 1 and divide by 2. In the case of tanh this gives the sigmoid
![]()
These sigmoids are sometimes called `squashing functions' in the neural net literature, presumably because they squash the output into a bounded range. In other books they are called activation functions.
We have, then, that the net of Fig.1.6.
is a map from
to
given by

In the case where sig is just sgn, this map
sends half the plane to the number
1 and the other half to the number -1. The
changeover occurs along the line
ax + by +c = 0. It is common to think of the
unit representing a neuron
which `fires' if the weighted input ax + by
exceeds the threshold -c. This is picturesque,
harmless and possibly
inspirational. But it is more useful to say that
the net divides up the
plane
by a line, and one side of the line
has points which get assigned to +1 and the
other side of the line has all
its points sent to -1.
So we can draw a line in the plane for any value of a, b and c, and attach a sign, + and -, to each side of the line. This is done in Fig.1.7, for two choices of the weights a,b and c.
With the choice a = 1, b = -1 and c = 100, we get the line labelled A, while with the choice a = 1, b = 1, and c = -250 we get the line labelled B. This last has the satisfactory property that it allows us to distinguish between Fred and Gladys, since Fred is taken to +1 and Gladys to -1 by the net. If you like, the neuron fires when Fred is input, but not when Gladys is input. We have a neuron which is capable of discrimination on the basis of sex, just like you and me.
It should now be clear that a neural net of the simple structure of Fig.1.6. cuts the plane into two halves by a dividing line. If the data set is as simple as { Fred, Gladys } then the problem is to find out the right place to put the dividing line, that is, we have to find a choice of a,b,c that will put the points of one category on the positive side of the line and the points of the other category all on the other side. Our thinking about this simple case therefore leads us to three issues:
The answers are gratifyingly positive (or I should
not have asked the questions).
The algorithm for a single unit as in the case
of Fig. 1.6. is the well
known Perceptron Convergence Algorithm and
goes back to Rosenblatt
and Widrow, among others, and will be described
in later
chapters. The dimension is largely irrelevant:
if we had inputs x,
y and z instead of x and y, we
would have an extra
weight and be looking at a separating surface
with
equation ax + by + cz + d = 0 which is the
equation of a plane. In
general, if
is a point in
let
denote the point
in
obtained by writing out the
components of
and then putting a 1 in
the last place. Then an
element of the weight space for a single
unit net with n inputs,
i.e. the list of n+1 weights attached to the
arcs going from the inputs (and threshold 1) to
the unit can be regarded as a vector in
, and if we call it
, then the
space
is
divided into two halves by the hyperplane
,
where
denotes the standard inner or
`dot' product. This is standard
linear algebra, and should not trouble the well
informed reader. If it boggles your mind,
mind, then the remedy is to proceed to the
Linear Algebra
notes from the Master's Preliminary course.
You should rejoin the
rest of us at this point after mastering
the subject to the level of being able to understand
Nilsson's book, mentioned in the bibliography at the
end of this chapter.
The Perceptron Convergence Algorithm works for any dimension, as we shall see later. It takes some randomly chosen initial hyperplane and operates on it by selecting a data point, usually also at random, and then kicking the hyperplane around, then repeating for new, randomly selected points, until the hyperplane moves into the right position. This is called training the net. It isn't immediately obvious that there is such a rule for kicking hyperplanes around, but there is and it takes only some elementary linear algebra to find it. I shall explain all in a later chapter, but for now it suffices to get the general idea.
For more complicated data sets, we may need more than one unit in the second layer. For practical applications, it is also a good idea to have more than one layer; again this will be discussed in a later chapter. Training these more complicated nets so that they put many hyperplanes in reasonable positions is a little harder. This is what the Back Propagation algorithm accomplishes.
The purposes of this section will have been met if the reader understands that what an ANN does is to chop the space up into regions by means of hyperplanes, so that points of the same category are generally in the same regions. The decision as to where to put the dividing hyperplanes is taken by means of a training algorithm which usually means selecting data points and operating with them on a randomly chosen initial placing of the hyperplanes until convergence has occurred or the error measure is small enough.
ANN's can take a long time to train, but they are often quite fast to use as pattern classifiers because we have to compute some number of inner products, a number usually much less than the amount of data. Chopping the space up into regions, each of which is, as we shall see later, convex, can be rationalised by observing that if a point is surrounded by points all of one category, with no points of another category in between, then it surely ought to be of the same category as its surrounding points in any reasonable universe. This is a weaker kind of assumption of reasonableness to make than the metric assumption, but whether the universe is prepared to go along with it has to be tested on particular data. It is easy to see that the hyperplane (line in dimension 2) B of Fig.1.7. which does an adequate job of telling Fred from Gladys is unlikely to keep on doing a good job of telling the guys from the gals as more data comes in. The hyperplane is a kind of theory. It has its opinions about the category of any new point that may be offered. A good theory has to be right when tested on new data, and the theory given by line B does not look promising. Another serious drawback of the ANN described by B is that an object weighing in at 50 Kg. and having a height of three metres is unequivocally theorised to be a man. Modifying the ANN so that it admits that it has never seen anything like it before and consequently doesn't have the foggiest idea what class a new point belongs to, is not particularly easy.