next up previous contents
Next: Mysteries of Functional Analysis Up: Smooth thresholding functions Previous: Smooth thresholding functions

Back-Propagation

The suggestion that I offered above for modifying the state of a committee by adapting all the wrong units by different amounts, more if they are less wrong, has some common sense justification, and also empirical justification: it works. Those of an idealistic cast of mind find such justifications unsatisfying, they wish to derive practice from more fundamental principles. Always nice, provided it gives the right answers, says the engineer to himself.

The approach taken by Anderson, Rumelhart, Hinton and Williams was to replace the sgn function by a smooth approximation to it. The function $\tanh$ is favoured for reason which are not altogether compelling; it is common to use thresholding systems which output values to or 1 instead of -1 or +1, in which case the sigmoidal approximation becomes the function:

\begin{displaymath}
sig(x) = \frac{e^x}{e^x + e^{-x}} = \frac{ 
1 + \tanh(x)}{2} \end{displaymath}

It should be noted that it doesn't make the slightest difference whether you prefer to -1 for representing the other state or not. If you have a predilection for 1 and , add one to my numbers and divide by 2.

This smooth approach has a considerable advantage from the point of view of the idealist who wants a rule for arbitrary nets. It means that the function implemented by the net is now a differentiable function, not only of the data, but also of the weights. And so he can, when the outcome is wrong, find the direction of change in each weight which will make it less wrong. This is a simple piece of partial differentiation with respect to the weights. It can be done for any geometry of net whatever, but the layered feedforward nets make it somewhat easier to do the sums.

What happens may be seen in the case of a single unit, where the function implemented is

n(x,y) = sig(ax + by +c)

which is a composite of the function A(x,y) = ax + by +c with the function

\begin{displaymath}
sig(t) = \tanh(t) \end{displaymath}

Now partially differentiating with respect to, say, a, gives

\begin{displaymath}
\frac{\partial n}{\partial a} = x sig' (ax+by+c) \end{displaymath}

and similarly for the other weights.

If we choose to adopt the rule whereby we simply take a fixed step in the direction opposite the steepest gradient, the steepest descent rule, then we see that we subtract off some constant times the vector

\begin{displaymath}
% latex2html id marker 5367
sig'(ax+by+c) \left( \begin{array}
{c} x \\  
y \\  1 \end{array} \right) \end{displaymath}

from the weight vector % latex2html id marker 5905
$ \left( \begin{array}
{c} 
a \\  b \\  c \end{array} \right) $

which is the step rule variant of the perceptron convergence rule, multiplied by the derivative of the sigmoid evaluated at the dot product of the weight vector with the augmented data vector. A quick check of what you get if you differentiate the sigmoid function reveals that this function has a value of 1 at the origin and decreases monotonically towards zero. In other words, it simply implements the differential movement rule suggested for committees, the further away you are, the less you move. If we use $\tanh$ as our sigmoid, the derivative is $\frac{4}{(e^x+e^{-x})^2}$ which for large x is approximately 4e-2x

In the case of the three layer net, what I have called (following Nilsson) a committee net, the function implemented is

\begin{displaymath}
n(x,y) = sig(u* sig ({\bf a}_1 \cdot \hat{{\bf x}}) 
+ v*sig...
 ...2 \cdot \hat{{\bf x}}) + w*sig({\bf a}_3 \cdot
\hat{{\bf x}})) \end{displaymath}

where sig is the sigmoidal approximation to the sgn function and ${\bf a}_i, i=1,2,3$ are the three weight vectors. $({\bf a}_1 \cdot \hat{{\bf x}})$ is just (a1 x + b1 y + c1)

If we partially differentiate this with respect to, say, a1, we get

\begin{displaymath}
\frac{\partial n}{\partial a_1} = sig'((u*sig(({\bf a}_1 
\c...
 ... \cdot\hat{{\bf x}}))*u*sig'({\bf a}_1 
\cdot \hat{{\bf x}}) x \end{displaymath}

and similar results for the other parameters. Taking a step against the gradient means decreasing a1 by some constant times this number, and corresponding amounts for the other parameters.

This is easily computed from the output backwards. If we use $\tanh$ as the sigmoid, we have the useful result that the derivative is $1 - 
\tanh^2$, and this means that no further calculations of transcendental functions is required than was needed for the evaluation. The generalisation to feed forward nets with more layers is straightforward. Turning the idea of gradient descent (pointwise, for each data point) into an algorithm is left as an easy exercise for the diligent student. Alternatively, this rule is the Back-Propagation Algorithm and explicit formulae for it may be found in the references. More to the point, a program implementing it can be found on the accompanying disk.

It is noticeable that the actual sigmoid function does not occur in any very essential way in the back-propagation algorithm. In the expression for the change in a1 for the three layer net,

\begin{displaymath}
\frac{\partial n}{\partial a_1} = sig'((u*sig({\bf a}_1 
\cd...
 ... \cdot\hat{{\bf x}})) u*sig'({\bf a}_1 
\cdot \hat{{\bf x}}) x \end{displaymath}

(where we have to subtract some constant times this) the value of sig(a1x + b1y +c1) lies between $\pm 1$, and if the sigmoid is a steep one, that is to say if it looks like the sgn function, then the difference between sig and sgn will not be large. The derivative however does have an effect; the closer sig gets to sgn, the closer the differential committee net algorithm based on the derivative of sig gets to just moving the closest unit.


next up previous contents
Next: Mysteries of Functional Analysis Up: Smooth thresholding functions Previous: Smooth thresholding functions
Mike Alder
9/19/1997