From a random initial state of the unit,
and with a randomly selected data point,
,
of category given by
, augmented to
by appending a 1 in the n+1th place,
![]()
It remains only to prove that if we keep on selecting
points from the data set at random, and
if the set of points is linearly separable,
that is, if it is possible to find a line
which correctly classifies the data set, then
the iteration of the above process will
eventually terminate with every data point correctly
classified.
This is the Perceptron Convergence Theorem
and it is fairly easy to understand.
Having explained the ideas, we go through the process more formally.
Definition 13342
A data set is a finite subset
of
, together with a map
![]()
Definition 13348
A binary data set
is
said to be linearly separable iff there
is a hyperplane in
such that all points
of
of category +1 are on one side
of the hyperplane, and all points of category
-1 are on the other side.
Definition 13351
The space of all oriented hyperplanes in
specified by the n+1 weights on the arcs of
a
perceptron is called the weight space.
Theorem 13353 (Perceptron Convergence)
If
is a binary data set
which is linearly separable, then the
Perceptron Training Rule will terminate in a finite
number of moves with probability 1.
Proof. In the weight space, each augmented data point determines a hyperplane through the origin, and also a right side and a wrong side for this hyperplane, the right side consisting of the hyperplanes which correctly classify the point. The solution space of all hyperplanes which correctly classify all the data is simply the intersection of all these half spaces.
By hypothesis this is not empty, and except in a probability zero case, the intersection has strictly positive measure.
Take a ball on the origin in the weight space,
and suppose, without loss of generality, that
the initial state lies within this ball. Then
the operation of making a change to the state
of
the perceptron is a folding of the ball about
a hyperplane, and the inverse image by this
folding of the solution space also has positive
measure, and is a region of fixed positive measure
removed from the set of states in the ball. Provided
that all the data points operate in this
way sufficiently often, the measure of the whole
ball must be removed, whereupon every state in
the ball is transformed to a solution state. ![]()
It is easy to devise variants of the Perceptron
training rule which are also guaranteed to
give a solution from any initial state. For example,
instead of the reflection, we could simply
take a fixed step in the right direction, which
is in the direction of
if the category
of
is +1 and the value of
is -1, and is in the opposite
direction when both signs are reversed. This
will not correct the error in one move in
general, but it will do so if iterated, and the
idea of the proof still works: every correction
takes a bite out of the space of states so that
some that were wrong are now right and will
never change again because they are right about
all the data. And a finite number of bites of
fixed size eats up any bounded set of initial
states. This variant of the Perceptron training
rule
is known as the delta rule.
Note that we have the theorem in any dimension. Generalising it to more than two categories may be done in several ways, and is left as an exercise for the diligent reader.