Next: Update Functions Up: Using the Graphical Previous: Confirmer

Parameters of the Learning Functions

The following learning parameters (from left to right) are used by the learning functions that are already built into SNNS:

Std_Backpropagation (''Vanilla`` Backpropagation),
BackpropBatch and
TimeDelayBackprop
1. : learning parameter, specifies the step width of the gradient descent.
  Typical values of are . Some small examples actually train even faster with values above 1, like 2.0.
2. : the maximum difference between a teaching value and an output of an output unit which is tolerated, i.e. which is propagated back as . If values above 0.9 should be regarded as 1 and values below 0.1 as 0, then should be set to . This prevents overtraining of the network.
  Typical values of are 0, 0.1 or 0.2.
BackpropMomentum (Backpropagation with momentum term and flat spot elimination):
1. : learning parameter, specifies the step width of the gradient descent.
  Typical values of are . Some small examples actually train even faster with values above 1, like 2.0.
2. : momentum term, specifies the amount of the old weight change (relative to 1) which is added to the current change.
  Typical values of are .
3. c: flat spot elimination value, a constant value which is added to the derivative of the activation function to enable the network to pass flat spots of the error surface.
  Typical values of c are , most often 0.1 is used.
4. : the maximum difference between a teaching value and an output of an output unit which is tolerated, i.e. which is propagated back as . See above.
The general formula for Backpropagation used here is
BackpropWeightDecay (Backpropagation with Weight Decay)
1. : learning parameter, specifies the step width of the gradient descent.
  Typical values of are . Some small examples actually train even faster with values above 1, like 2.0.
2. d: weight decay term, specifies how much of the old weight value is subtracted after learning. Try values between 0.005 and 0.3.
3. : the minimum weight that is tolerated for a link. All links with a smaller weight will be pruned.
4. : the maximum difference between a teaching value and an output of an output unit which is tolerated, i.e. which is propagated back as . See above.
BackpropThroughTime (BPTT),
BatchBackpropThroughTime (BBPTT):
1. : learning parameter, specifies the step width of the gradient descent.
  Typical values of for BPTT and BBPTT are .
2. : momentum term, specifies the amount of the old weight change (relative to 1) which is added to the current change.
  Typical values of are .
3. backstep: the number of backprop steps back in time. BPTT stores a sequence of all unit activations while input patterns are applied. The activations are stored in a first-in-first-out queue for each unit. The largest backstep value supported is 10.
Quickprop:
1. : learning parameter, specifies the step width of the gradient descent.
  Typical values of for Quickprop are .
2. : maximum growth parameter, specifies the maximum amount of weight change (relative to 1) which is added to the current change
  Typical values of are .
3. : weight decay term to shrink the weights.
  Typical values of are . Quickprop is rather sensitive to this parameter. It should not be set too large.
4. : the maximum difference between a teaching value and an output of an output unit which is tolerated, i.e. which is propagated back as . See above.
QuickpropThroughTime (QPTT):
1. : learning parameter, specifies the step width of the gradient descent.
  Typical values of for QPTT are .
2. : maximum growth parameter, specifies the maximum amount of weight change (relative to 1) which is added to the current change
  Typical values of are .
3. : weight decay term to shrink the weights.
  Typical values of are .
4. backstep: the number of quickprop steps back in time. QPTT stores a sequence of all unit activations while input patterns are applied. The activations are stored in a first-in-first-out queue for each unit.
  The largest backstep value supported is 10.
Counterpropagation:
1. : learning parameter of the Kohonen layer.
  Typical values of for Counterpropagation are .
2. : learning parameter of the Grossberg layer.
  Typical values of are .
3. : threshold of a unit.
  We often use a value of 0.
Backpercolation 1:
1. : global error magnification. This is the factor in the formula , where is the internal activation error of a unit, t is the teaching input and o the output of a unit.
  Typical values of are 1. Bigger values (up to 10) may also be used here.
2. : If the error value drops below this threshold value, the adaption according to the Backpercolation algorithm begins. is defined as:
3. : the maximum difference between a teaching value and an output of an output unit which is tolerated, i.e. which is propagated back as . See above.
Dynamic Learning Vector Quantization (DLVQ):
1. : learning rate, specifies the step width of the mean vector , which is nearest to a pattern , towards this pattern. Remember that is moved only, if is not assigned to the correct class . A typical value is 0.03.
2. : learning rate, specifies the step width of a mean vector , to which a pattern of class is falsely assigned to, away from this pattern. A typical value is 0.03. Best results can be achieved, if the condition is satisfied.
3. Number of cycles you want to train the net before additive mean vectors are calculated.
RadialBasisLearning:
1. centers: determines the learning rate used for the modification of center vectors.
2. bias (p): determines the learning rate , used for the modification of the parameters p of the base function. p is stored as bias of the hidden units.
3. weights: influences the training of all link weights that are leading to the output layer as well as the training of the bias of all output neurons.
4. delta max.: If the actual error is smaller than the maximum allowed error ( delta max.) the corresponding weights are not changed.
5. momentum:influences the amount of the momentum--term during training.
RadialBasisLearning with Dynamic Decay Adjustment:
1. : positive threshold. To commit a new prototype, none of the existing RBFs of the correct class may have an activation above
2. :negative threshold. During shrinking no RBF unit of a conflicting class is allowed to have an activation above .
3. n: the maximum number of RBF units to be diplayed in one row. This item allows the user to control the appearance of the network on the screen and has no influence on the performance.
ART1
1. : vigilance parameter. If the quotient of active F units divided by the number of active F units is below , an ART reset is performed.
ART2
1. : vigilance parameter. Specifies the minimal length of the error vector r (units ).
2. a: Strength of the influence of the lower level in F by the middle level.
3. b: Strength of the influence of the middle level in F by the upper level.
4. c: Part of the length of vector p (units ) used to compute the error.
5. : Threshold for output function f of units and .
ARTMAP
1. : vigilance parameter for subnet. (quotient )
2. : vigilance parameter for subnet. (quotient )
3. : vigilance parameter for inter ART reset control. (quotient )
RPROP (resilient propagation)
1. delta: starting values for all . Default value is 0.1.
2. : the upper limit for the update values .The default value of is .
3. : the weight-decay determines the relationship between the output error and to reduction in the size of the weights. Important: Please note that the weight decay parameter denotes the exponent, to allow comfortable input of very small weight-decay. A choice of the third learning parameter corresponds to a ratio of weight decay term to output error of .
Cascade Correlation (CC) and
Recurrent Cascade Correlation (RCC)
CC and RCC are not learning functions themselves. They are meta algorithms to build and train optimal networks. However, they have a set of standard learning functions embedded. Here these functions require modified parameters. The embedded learning functions are:
- Backpropagation (in CC or RCC):
  1. : learning parameter, specifies the step width of gradient decent minimizing the net error.
  2. : momentum term, specifies the amount of the old weight change, which is added to the current change.
  3. c: flat spot elimination value, a constant value which is added to the derivative of the activation function to enable the network to pass flat spots on the error surface.
  4. : learning parameter, specifies the step width of gradient ascent maximizing the covariance.
  5. : momentum term specifies the amount of the old weight change, which is added to the current change.
  The general formula for this learning function is:
  
  The slopes and are abbreviated by S. This abbreviation is valid for all embedded functions. By changing the sign of the gradient value , the same learning function can be used to maximize the covariance and to minimize the error.
- Rprop (in CC or RCC):
  1. : decreasing factor, specifies the factor by which the update-value is to be decreased when minimizing the net error. A typical value is .
  2. : increasing factor, specifies the factor by which the update-value is to be increased when minimizing the net error. A typical value is
  3. not used.
  4. : decreasing factor, specifies the factor by which the update-value is to be decreased when maximizing the covariance. A typical value is .
  5. : increasing factor, specifies the factor by which the update-value is to be increased when maximizing the covariance. A typical value is
  The weight change is computed by:
  
  where is defined as follows: . Furthermore, the condition should not be violated.
- Quickprop (in CC or RCC):
  1. : learning parameter, specifies the step width of the gradient descent when minimizing the net error. A typical value is
  2. : maximum growth parameter, realizes a kind of dynamic momentum term. A typical value is 2.0.
  3. : weight decay term to shrink the weights. A typical value is .
  4. : learning parameter, specifies the step width of the gradient ascent when maximizing the covariance. A typical value is
  5. : maximum growth parameter, realizes a kind of dynamic momentum term. A typical value is 2.0.
  The formula used is:
Kohonen
1. h(0): Adaptation height. The initial adaptation height can vary between 0 and 1. It determines the overall adaptation strength.
2. r(0): Adaptation radius. The initial adaptation radius is the radius of the neighborhood of the winning unit. All units within this radius are adapted. Values should range between 1 and the size of the map.
3. mult_H: Decrease factor. The adaptation height decreases monotonically after the presentation of every learning pattern. This decrease is controlled by the decrease factor mult_H:
4. mult_R: Decrease factor. The adaptation radius also decreases monotonically after the presentation of every learning pattern. This second decrease is controlled by the decrease factor mult_R:
5. h: Horizontal size. Since the internal representation of a network doesn't allow to determine the 2-dimensional layout of the grid, the horizontal size in units must be provided for the learning function. It is the same value as used for the creation of the network.
RM_delta (Rumelhart and McClelland's delta rule)
1. n: learning parameter, specifies the step width of the gradient descent. In [RM86] Rumelhart and McClelland use 0.01, although values less than 0.03 are generally acceptable.
2. Ncycles: number of update cycles, specifies how many times a pattern is propagated through the network before the learning rule is applied. This parameter must be large enough so that the network is relatively stable after the set number of propagations. A value of 50 is recommended as a baseline. Increasing the value of this parameter increases the accuracy of the network but at a cost of processing time. Larger networks will probably require a higher setting of Ncycles.
NOTE: With this learning rule the update function RM_Synchronous has to be used which needs as update parameter the number of iterations!
Hebbian Learning
1. n: learning parameter, specifies the step width of the gradient descent. Values less than (1 / number of nodes) are recommended.
2. Wmax: maximum weight strength, specifies the maximum absolute value of weight allowed in the network. A value of 1.0 is recommended, although this should be lowered if the network experiences explosive growth in the weights and activations. Larger networks will require lower values of Wmax.
3. count: number of times the network is updated before calculating the error.
NOTE: With this learning rule the update function RM_Synchronous has to be used which needs as update parameter the number of iterations!
Monte-Carlo:
1. Min: lower limit of weights and biases. Typical values are .
2. Max: upper limit of weights and biases. Typical values are .
Simulated_Annealing_SS_error,
Simulated_Annealing_WTA_error and
Simulated_Annealing_WWTA_error:
1. Min: lower limit of weights and biases. Typical values are .
2. Max: upper limit of weights and biases. Typical values are .
3. : learning parameter, specifies the Simulated Annealing start temperature . Typical values of are .
4. deg: degradation term of the temperature: Typical values of deg are .
Scaled Conjugate Gradient (SCG)
All of the following parameters are non-critical, i.e. they influence only the speed of convergence, not whether there will be success or not.
1. . Should satisfy . If 0, will be set to ;
2. . Should satisfy . If 0, will be set to ;
3. . See standard backpropagation. Can be set to 0 if you don't know what to do with it;
4. . Depends on the floating-point precision. Should be set to (simple precision) or to (double precision). If 0, will be set to .

Next: Update Functions Up: Using the Graphical Previous: Confirmer

Niels.Mache@informatik.uni-stuttgart.de
Tue Nov 28 10:30:44 MET 1995