Adaptive critic for sigma-pi networks

Richard Stuart Neville, Thomas John Stonham
1996 Neural Networks  
This article presents an investigation which studied how training of sigma-pi networks with the associative reward-penalty ( A R-p ) regime may be enhanced by using two networks in parallel. The technique uses what has been termed an unsupervised "'adaptive critic element" (ACE) to give critical advice to the supervised sigma-pi network. We utilise the conventions that the sigma-pi neuron model uses (i.e., quantisation of variables) to obtain an implementation we term the "'quantised adaptive
more » ... itic", which is hardware realisable. The associative rewardpenalty training regime either rewards, r = 1, the neural network by incrementing the weights of the net by a delta term times a learning rate, ~, or penalises, r = O, the neural network by decrementing the weights by an inverse delta term times the product of the learning rate and a penalty coefficient, ~ × Arp. Our initial research, utilising a "'bounded" reward signal, r* E {0,..., 1}, found that the critic provides advisory information to the sigma--pi net which augments its training efficiency. This led us to develop an extension to the adaptive critic and associative reward-penalty methodologies, utilising an "unbounded" reward signal, r* E {-1,..., 2}, which permits penalisation of a net even when the penalty coefficient, Arp, is set to zero, A,p = O. One should note that with the standard associative reward-penalty methodology the net is normally only penalised if the penalty coefficient is non-zero (i.e., 0 < Arp ~< 1). One of the enigmas of associative reward-penalty (AR-I,) training is that it broadcasts sparse information, in the form of an instantaneous binary reward signal, that is only dependent on the present output error. Here we put forward ACE and AR-I, methodologies for sigma-pi nets, which are based on tracing the frequency of • "stimuli" occurrence, and then using this to derive a prediction of the reinforcement. The predictions are then used to derive a reinforcement signal which uses temporal information. Hence one may use more precise information to enable more efficient training.
doi:10.1016/0893-6080(96)00015-9 fatcat:dzhmwaspezfnjj3cpdmkq4kf3m