It is simplest to return to the two dimensional case of Fred and Gladys from chapter one in order to find out what is happening with the single unit alpha-perceptron. I hear alarm bells ringing; they warn me that doing the solution in the case of dimension two with only two categories, and indeed only two data points, will convince the reader he is not getting his money's worth. That there has to be more profundity than this, or why do all the other books have so much more algebra? To conceal the lack of insight of the author, dear reader. Trust me.
Recall Fig. 1.7, which I have recycled as Fig.5.3 and the more or less obligatory network diagram Fig.1.6, which I have recycled as Fig.5.4.
Now let us initialise the line ax + by +c = 0, or if you prefer the numbers a,b,c at random inside some interval. I shall suppose that none of them are zero, as seems reasonable, and if you wish to choose to have each of the numbers between, say, -1 and +1, I should not mind at all, although I have no particular preferences.
If you imagine that line A of Fig.5.3 is the result of this intialisation, then it will give us a concrete picture of the start of operations.
It is worth noting that in dimension n, I should
have to choose n+1 numbers to specify a
hyperplane. We have not yet given a name to the
initial line or hyperplane, so I shall remedy
this immediately by calling it
. It is not
a very romantic name for anything, not even a
hyperplane, but it is at least short.
Next I select a point from the data set, also
at random. In our application this means picking
either Fred or Gladys. In general there would
be, one might hope, some rather larger pool of
talent. Since I don't wish to prejudice the discussion
by leading the reader to believe that
it matters which one I pick, I shall call my randomly
selected data point
. If you want to
ask awkward questions about randomness, I evade
them by requiring only that your pseudo-random
number generator be reasonably uniform in its
selections and not, for example, too flagrantly
sexist in the case of Fred and Gladys.
Next I shall work out on which side of the line
the point
lies. The simplest
algebraic way of specifying this is to augment
the point
to be a point
in
which has all the components of
the vector
with a 1 inserted in the
(n+1)th place. If we do this simple trick,
then the summation of the inputs x and y
weighted by a and b with c added on, is
simply expressed by the dot product
. Now I threshold this value
to take the sign of it, +1 if
, -1 if
. That is to say, I obtain
. The reader will note
that although he is probably thinking of
as a line and
as a point to which a
1 has been appended, there is no reference to
the dimension in these expressions, so they make
sense for points and hyperplanes in
for
any n whatever.
Next I want to think about the category or class
of the datum,
.This is either +1 or -1 in my system of representing
such
things; It is a value which belongs with the point
and so I shall define the map from the data
set of points
to
![]()
Note that this makes it fairly clear that the
point of this business is to produce a map of
the form
for some neural state
, which takes the point
to the number
. This map will
be defined for any
in
, so it will
tell us the class, 1 or -1 of every point
. It is
of course essential that this map agrees with
c on the data set and extends it to all other points
in
. In essence, this is not different
in principle from my giving you a dozen x and y values
and asking you to draw a smooth curve through
them all; you are fitting a function to a set
of data. Whereas you usually want to fit a smooth
function, perhaps a polynomial, here you have
to fit a function which takes values in S0,
and which is given on points in
, but you are
still function fitting. That is, you have a finite
function given you, and you want to extend it
to the whole space. The curve fitting aspects
may be emphasised by returning briefly to the diagram
fig.1.11 of chapter one. Instead of fitting
a smooth sigmoid as shown there, we would be fitting a
step function, but the idea is the same. In dimension
two, we should be fitting a step that runs right
across the plane. Clearly we can focus
on the dividing line, or more generally the separating
hyperplane.
Now either
or
it doesn't. If it does, then we may say that
the line
correctly classifies
,
or that
is on the right side of the hyperplane
. Under these conditions we do absolutely
nothing about changing the line
.
The other possibility is that
. In this case we need to
beat
around a bit and change its state. It
is a naughty neuron which is getting the wrong
answer, so it should be smacked. None of this
post modernist Spockery here, no praising the
neuron for being right and talking gently to it
when it is in error; if it is right we ignore
it and if wrong we kick it.
To see how to kick it is an interesting exercise in data representation. The AI fraternity, who tell us that the representation makes a big difference, are right.
The idea is to observe that there are a lot of
possible
lines, and there has to be a rule
for operating on each such
and turning it
into a more compliant, better behaved
. So
we look at the space of all such
. In the
case of the recalcitrant line, the space is the
space of three numbers a,b,c. The actual line
is a point in this space. This may be
confusing at first, but it is easy enought to
see that 2x + 3y + 4 = 0 can be represented
in the
space of all such lines as the vector
.
Now the line has been turned into a point in this
space of all lines in the plane, and the
datum
also defines a point in this
space. And the latter point gives the right dot
product with half the points in the space, and
the wrong dot product with the other half. And
for the case when the dot product is zero, the
turn around occurs. In this dual space of
lines, otherwise called the weight space,
is a plane
through the origin. It divides the space of lines
into two parts, those that are right and those
that are wrong. So it is an error to think of
the augmented data point as being a vector in
the space of all lines. It is better to think
of it
as a plane through the origin in that space. I
draw a rather unconvincing picture of a bit of
it
in Fig. 5.5.
Now given that the line
is on the wrong side
of the plane through the origin normal to
the datum point
(and if you can take
that without blinking you are either on top of
this or not with us at all), the obvious thing
to do is to reflect it through the plane until
it is on the right side. And this is the
Perceptron convergence algorithm. Moreover, this
way of thinking about it, in the weight space,
gives some insight into why it works, which beats
memorising the recipe.
To do the sums is straightforward and requires
only the most elementary linear algebra. Suppose
. Since the neural state
is wrong
about the point, we must have
that
. The projection of
on
is pointing in the direction
opposite to
and is the vector
. We need to subtract twice this vector
from
if
is to be reflected through
the plane through the origin normal to
and made into a better and purer line which
is right about the data point
.
If
, then
since the line is wrong about the point,
and so the projection of
on
is
positive and shouldn't be. Both are on the same
side of the normal plane through the origin. So
again we subtract the vector
![]()