The approach taken by Anderson, Rumelhart, Hinton
and Williams was to replace the sgn
function by a smooth approximation to it. The
function
is favoured for reason which
are
not altogether compelling; it is common to use
thresholding systems which output values to
or
1 instead of -1 or +1, in which case the
sigmoidal approximation becomes the function:

This smooth approach has a considerable advantage from the point of view of the idealist who wants a rule for arbitrary nets. It means that the function implemented by the net is now a differentiable function, not only of the data, but also of the weights. And so he can, when the outcome is wrong, find the direction of change in each weight which will make it less wrong. This is a simple piece of partial differentiation with respect to the weights. It can be done for any geometry of net whatever, but the layered feedforward nets make it somewhat easier to do the sums.
What happens may be seen in the case of a single unit, where the function implemented is
n(x,y) = sig(ax + by +c)
which is a composite of the function A(x,y) = ax + by +c with the function![]()
Now partially differentiating with respect to, say, a, gives
![]()
If we choose to adopt the rule whereby we simply take a fixed step in the direction opposite the steepest gradient, the steepest descent rule, then we see that we subtract off some constant times the vector


which is the step rule variant of the perceptron
convergence rule, multiplied by the
derivative of the sigmoid evaluated at the dot
product of the weight vector with the augmented
data vector. A quick check of what you get if
you differentiate the sigmoid
function reveals that this function has a value
of 1 at the origin and decreases monotonically
towards zero. In other words, it simply implements
the differential movement rule suggested for
committees, the further away you are, the less
you move. If we use
as our sigmoid, the
derivative is
which
for large x is approximately 4e-2x
In the case of the three layer net, what I have called (following Nilsson) a committee net, the function implemented is
![]()
If we partially differentiate this with respect to, say, a1, we get

This is easily computed from the output backwards.
If we use
as the sigmoid, we have
the useful result that the derivative is
, and this means that no further
calculations of transcendental functions is required
than was needed for the evaluation.
The generalisation to feed forward nets with more
layers is straightforward. Turning the idea
of gradient descent (pointwise, for each data
point) into an algorithm is left as an easy
exercise for the diligent student. Alternatively,
this rule is the Back-Propagation Algorithm
and explicit formulae for it may be found in the
references. More to the point, a program implementing
it can be found on the accompanying disk.
It is noticeable that the actual sigmoid function does not occur in any very essential way in the back-propagation algorithm. In the expression for the change in a1 for the three layer net,

(where we have to subtract some constant times
this) the value of sig(a1x + b1y +c1) lies
between
, and if the sigmoid is a steep
one, that is to say if it looks like the sgn
function, then the difference between sig and
sgn will not be large. The derivative however
does have an effect; the closer sig gets to
sgn, the closer the differential committee
net
algorithm based on the derivative of sig gets
to just moving the closest unit.