next up previous contents
Next: Committees Up: Training the Perceptron Previous: Training the Perceptron

The Perceptron Training Rule

From a random initial state of the unit, ${\bf a}$ and with a randomly selected data point, ${\bf x}$, of category given by $c({\bf x})$, augmented to $\hat{{\bf x}} 
$ by appending a 1 in the n+1th place,

1.
evaluate ${\bf a}\cdot \hat{{\bf x}}$. If $sgn({\bf a}
\cdot \hat{{\bf x}}) = c({\bf x})$ select a new data point at random and repeat from the last value of ${\bf a}$.
2.
If $sgn({\bf a}\cdot \hat{{\bf x}}) \neq c({\bf x})$, alter the state ${\bf a}$ to ${\bf a}'$ given by:

\begin{displaymath}
{\bf a}' = {\bf a}- 2\frac{{\bf a}\cdot \hat{{\bf x}}}{ \hat{{\bf x}}\cdot 
\hat{{\bf x}}}\hat{{\bf x}} \end{displaymath}

Rewrite ${\bf a}'$ to ${\bf a}$.

3.
Choose a new data point ${\bf x}$ at random and go to 1.

It remains only to prove that if we keep on selecting points from the data set at random, and if the set of points is linearly separable, that is, if it is possible to find a line which correctly classifies the data set, then the iteration of the above process will eventually terminate with every data point correctly classified. [*] This is the Perceptron Convergence Theorem and it is fairly easy to understand.

Having explained the ideas, we go through the process more formally.

Definition 13342

A data set is a finite subset $\cal D$ of ${\fam11\tenbbb R}^n$, together with a map

\begin{displaymath}
c: {\cal D} \longrightarrow {\cal C} \end{displaymath}

into a finite set $\cal C$ of categories. If the set $\cal C$ has only two elements, the data set is said to be a binary data set, and $\cal C$ is written $\{ 1, -1\}$ and called S0.

Definition 13348

A binary data set ${\cal D} \subset {\fam11\tenbbb R}^n$ is said to be linearly separable iff there is a hyperplane in ${\fam11\tenbbb R}^n$ such that all points of $\cal D$ of category +1 are on one side of the hyperplane, and all points of category -1 are on the other side.

Definition 13351

The space of all oriented hyperplanes in ${\fam11\tenbbb R}^n$ specified by the n+1 weights on the arcs of a perceptron is called the weight space.

Theorem 13353 (Perceptron Convergence)

If ${\cal D} \subset {\fam11\tenbbb R}^n$ is a binary data set which is linearly separable, then the Perceptron Training Rule will terminate in a finite number of moves with probability 1.

Proof. In the weight space, each augmented data point determines a hyperplane through the origin, and also a right side and a wrong side for this hyperplane, the right side consisting of the hyperplanes which correctly classify the point. The solution space of all hyperplanes which correctly classify all the data is simply the intersection of all these half spaces.

By hypothesis this is not empty, and except in a probability zero case, the intersection has strictly positive measure.

Take a ball on the origin in the weight space, and suppose, without loss of generality, that the initial state lies within this ball. Then the operation of making a change to the state of the perceptron is a folding of the ball about a hyperplane, and the inverse image by this folding of the solution space also has positive measure, and is a region of fixed positive measure removed from the set of states in the ball. Provided that all the data points operate in this way sufficiently often, the measure of the whole ball must be removed, whereupon every state in the ball is transformed to a solution state. $\Box$

It is easy to devise variants of the Perceptron training rule which are also guaranteed to give a solution from any initial state. For example, instead of the reflection, we could simply take a fixed step in the right direction, which is in the direction of $\hat{{\bf x}} 
$ if the category of ${\bf x}$ is +1 and the value of $ sgn({\bf a}\cdot \hat{{\bf x}})$ is -1, and is in the opposite direction when both signs are reversed. This will not correct the error in one move in general, but it will do so if iterated, and the idea of the proof still works: every correction takes a bite out of the space of states so that some that were wrong are now right and will never change again because they are right about all the data. And a finite number of bites of fixed size eats up any bounded set of initial states. This variant of the Perceptron training rule is known as the delta rule.

Note that we have the theorem in any dimension. Generalising it to more than two categories may be done in several ways, and is left as an exercise for the diligent reader.


next up previous contents
Next: Committees Up: Training the Perceptron Previous: Training the Perceptron
Mike Alder
9/19/1997