next up previous contents
Next: The Perceptron Training Rule Up: Decisions: Neural Nets(Old Style) Previous: The End of History

Training the Perceptron

It is simplest to return to the two dimensional case of Fred and Gladys from chapter one in order to find out what is happening with the single unit alpha-perceptron. I hear alarm bells ringing; they warn me that doing the solution in the case of dimension two with only two categories, and indeed only two data points, will convince the reader he is not getting his money's worth. That there has to be more profundity than this, or why do all the other books have so much more algebra? To conceal the lack of insight of the author, dear reader. Trust me.

Recall Fig. 1.7, which I have recycled as Fig.5.3 and the more or less obligatory network diagram Fig.1.6, which I have recycled as Fig.5.4.


 
Figure 5.3: Two different states, A and B, of a neural net.
\begin{figure}
\vspace{8cm}
\special {psfile=patrecfig5.3.ps}\end{figure}


 
Figure 5.4: The neural net of the above diagram, in general.
\begin{figure}
\vspace{8cm}
\special {psfile=patrecfig5.4.ps}\end{figure}

Now let us initialise the line ax + by +c = 0, or if you prefer the numbers a,b,c at random inside some interval. I shall suppose that none of them are zero, as seems reasonable, and if you wish to choose to have each of the numbers between, say, -1 and +1, I should not mind at all, although I have no particular preferences.

If you imagine that line A of Fig.5.3 is the result of this intialisation, then it will give us a concrete picture of the start of operations.

It is worth noting that in dimension n, I should have to choose n+1 numbers to specify a hyperplane. We have not yet given a name to the initial line or hyperplane, so I shall remedy this immediately by calling it ${\bf a}$. It is not a very romantic name for anything, not even a hyperplane, but it is at least short.

Next I select a point from the data set, also at random. In our application this means picking either Fred or Gladys. In general there would be, one might hope, some rather larger pool of talent. Since I don't wish to prejudice the discussion by leading the reader to believe that it matters which one I pick, I shall call my randomly selected data point ${\bf x}$. If you want to ask awkward questions about randomness, I evade them by requiring only that your pseudo-random number generator be reasonably uniform in its selections and not, for example, too flagrantly sexist in the case of Fred and Gladys.

Next I shall work out on which side of the line ${\bf a}$ the point ${\bf x}$ lies. The simplest algebraic way of specifying this is to augment the point ${\bf x}$ to be a point $\hat{{\bf x}} 
$in ${\fam11\tenbbb R}^{n+1} $ which has all the components of the vector ${\bf x}$ with a 1 inserted in the (n+1)th place. If we do this simple trick, then the summation of the inputs x and y weighted by a and b with c added on, is simply expressed by the dot product ${\bf a}\cdot \hat{{\bf x}}$. Now I threshold this value to take the sign of it, +1 if ${\bf a}\cdot \hat{{\bf x}} \geq 0$, -1 if ${\bf a}\cdot 
\hat{{\bf x}} < 0$. That is to say, I obtain $ sgn({\bf a}\cdot \hat{{\bf x}})$. The reader will note that although he is probably thinking of ${\bf a}$as a line and $\hat{{\bf x}} 
$ as a point to which a 1 has been appended, there is no reference to the dimension in these expressions, so they make sense for points and hyperplanes in ${\fam11\tenbbb R}^n$ for any n whatever.

Next I want to think about the category or class of the datum, ${\bf x}$.This is either +1 or -1 in my system of representing such things; It is a value which belongs with the point and so I shall define the map from the data set of points $\cal D$ to $\pm 1$

\begin{displaymath}
c: {\cal D} \longrightarrow S^0 \end{displaymath}

which assigns to each point ${\bf x}$ its class.[*]

Note that this makes it fairly clear that the point of this business is to produce a map of the form $ sgn(\b \cdot -)$ for some neural state $\b$, which takes the point ${\bf x}$ to the number $sgn(\b \cdot \hat{{\bf x}}), \pm 1$. This map will be defined for any ${\bf x}$ in ${\fam11\tenbbb R}^n$, so it will tell us the class, 1 or -1 of every point ${\bf x}$. It is of course essential that this map agrees with c on the data set and extends it to all other points in ${\fam11\tenbbb R}^n$. In essence, this is not different in principle from my giving you a dozen x and y values and asking you to draw a smooth curve through them all; you are fitting a function to a set of data. Whereas you usually want to fit a smooth function, perhaps a polynomial, here you have to fit a function which takes values in S0, and which is given on points in ${\fam11\tenbbb R}^n$, but you are still function fitting. That is, you have a finite function given you, and you want to extend it to the whole space. The curve fitting aspects may be emphasised by returning briefly to the diagram fig.1.11 of chapter one. Instead of fitting a smooth sigmoid as shown there, we would be fitting a step function, but the idea is the same. In dimension two, we should be fitting a step that runs right across the plane. Clearly we can focus on the dividing line, or more generally the separating hyperplane.

Now either $c({\bf x}) = sgn({\bf a}\cdot \hat{{\bf x}})$ or it doesn't. If it does, then we may say that the line ${\bf a}$ correctly classifies ${\bf x}$, or that ${\bf x}$ is on the right side of the hyperplane ${\bf a}$. Under these conditions we do absolutely nothing about changing the line ${\bf a}$.

The other possibility is that $c(\hat{{\bf x}}) = -sgn({\bf a}
\cdot \hat{{\bf x}})$. In this case we need to beat ${\bf a}$ around a bit and change its state. It is a naughty neuron which is getting the wrong answer, so it should be smacked. None of this post modernist Spockery here, no praising the neuron for being right and talking gently to it when it is in error; if it is right we ignore it and if wrong we kick it.

To see how to kick it is an interesting exercise in data representation. The AI fraternity, who tell us that the representation makes a big difference, are right.

The idea is to observe that there are a lot of possible ${\bf a}$ lines, and there has to be a rule for operating on each such ${\bf a}$ and turning it into a more compliant, better behaved ${\bf a}$. So we look at the space of all such ${\bf a}$. In the case of the recalcitrant line, the space is the space of three numbers a,b,c. The actual line ${\bf a}$ is a point in this space. This may be confusing at first, but it is easy enought to see that 2x + 3y + 4 = 0 can be represented in the space of all such lines as the vector % latex2html id marker 5549
$\left( \begin{array}
{c}
2 \\ 3\\ 4\end{array} \right) $.

Now the line has been turned into a point in this space of all lines in the plane, and the datum $\hat{{\bf x}} 
$ also defines a point in this space. And the latter point gives the right dot product with half the points in the space, and the wrong dot product with the other half. And for the case when the dot product is zero, the turn around occurs. In this dual space of lines, otherwise called the weight space, $\{ {\bf a}: \hat{{\bf x}} \cdot {\bf a}= 0\}$ is a plane through the origin. It divides the space of lines into two parts, those that are right and those that are wrong. So it is an error to think of the augmented data point as being a vector in the space of all lines. It is better to think of it as a plane through the origin in that space. I draw a rather unconvincing picture of a bit of it in Fig. 5.5.


 
Figure 5.5: Into the dual space.
\begin{figure}
\vspace{8cm}
\special {psfile=patrecfig5.5.ps}\end{figure}

Now given that the line ${\bf a}$ is on the wrong side of the plane through the origin normal to the datum point $\hat{{\bf x}} 
$ (and if you can take that without blinking you are either on top of this or not with us at all), the obvious thing to do is to reflect it through the plane until it is on the right side. And this is the Perceptron convergence algorithm. Moreover, this way of thinking about it, in the weight space, gives some insight into why it works, which beats memorising the recipe.

To do the sums is straightforward and requires only the most elementary linear algebra. Suppose $c({\bf x}) = 1$. Since the neural state ${\bf a}$ is wrong about the point, we must have that $\hat{{\bf x}} \cdot {\bf a}< 0$. The projection of ${\bf a}$ on $\hat{{\bf x}} 
$ is pointing in the direction opposite to $\hat{{\bf x}} 
$ and is the vector $\frac{{\bf a}
\cdot \hat{{\bf x}}}{ \hat{{\bf x}}\cdot \hat{{\bf x}}}
\hat{{\bf x}}$. We need to subtract twice this vector from ${\bf a}$ if ${\bf a}$ is to be reflected through the plane through the origin normal to $\hat{{\bf x}} 
$ and made into a better and purer line which is right about the data point ${\bf x}$.

If $c(\hat{{\bf x}}) = -1$, then $\hat{{\bf x}} \cdot {\bf a}
\gt 0$ since the line is wrong about the point, and so the projection of ${\bf a}$ on $\hat{{\bf x}} 
$ is positive and shouldn't be. Both are on the same side of the normal plane through the origin. So again we subtract the vector

\begin{displaymath}
2\frac{{\bf a}\cdot \hat{{\bf x}}}{ \hat{{\bf x}}\cdot \hat{{\bf x}}}\hat{{\bf x}} \end{displaymath}

from ${\bf a}$. The Perceptron Training rule is then:



 
next up previous contents
Next: The Perceptron Training Rule Up: Decisions: Neural Nets(Old Style) Previous: The End of History
Mike Alder
9/19/1997