Before we introduce and explore the deep learner N we need to establish our closed loop system. The configuration depicted in Fig. 1A) is the architecture of this learning paradigm which provides a closed loop platform for autonomous learning. It consists of an inner reflex loop and an outer predictive loop that contains the learning unit. In the absence of any learning, the reflex loop receives a delayed disturbance the reflex environment
; leads to state
. Given the desired state
the closed loop Error (
) is generated as:
. This drives the agent to take an appropriate reflex action
as to recover to
to zero. However the reflex mechanism
can only react to the disturbance D after it has perturbed the system.
Hence, the aim of the learning loop is to fend off D before it has disturbed the state of the robot. To that end, this loop receives D via the predictive environment in advance of the reflex loop. This provides the learning unit with predictive signals
and, given its internal parameters
, a predictive action is generated as:
During the learning process, combined with
travels through the reflex loop and
is generated. This error signal provides the deep learner N with a minimal instructive feedback. Upon learning,
fully combats D on its arrival at the reflex loop (i.e.
); hence the reflex mechanism is no longer evoked and
at zero.
Figure 1: A) The closed loop platform: consists of an inner reflex loop (solid lines) and an outer learning loop (dashed lines), the learning unit generates a forward model of the environment. Given predictive inputs
(filtered predictive signals
it generates an action
that combats the disturbance D on its arrival at the reflex loop. Finally, the closed loop error
gives an instructive feedback to the learning unit on how well
protected the system from D. B) The computational unit: shows the forward and back-propagation of the inputs and the error to the deeper layers. Dotted line marks the correlation of the closed loop error with the internal parameters of the neuron highlighting the update rule, where,
is the transfer function of the reflex loop
The aim of the learning is to keep the closed loop Error to zero. Referring to Fig. 1A this signal is derived as:
; expansion of
In mathematical terms, learning entails the adjustment of the internal parameters of the learning unit is kept at zero. To that end, the closed loop Cost-Function
is defined as the square of absolute
Introduction of closed loop Cost-Function () translates the learning goal into adjustments of
is minimised, this in turn ensures that
is kept at zero.
The behaviour of the gradient is best explained through separation of gradi- ents of the closed loop and the learner as below:
The former partial derivative, termed closed loop Gradient , solely relates to the dynamics of the closed loop platform; this is derived from equations 1 and 2:
To be able to link open loop backpropagation to our closed loop learning paradigm we need to relate our closed loop error to the standard open loop error of backpropagation. In conventional open-loop implementations, the open-loop Cost-Function
open-loop Error
are defined at the action output of the network:
Where is the desired predictive action. Minimisation of
with respect to the internal parameters of the learning unit
The former partial derivative is termed open-loop Gradient , from equation 6:
Now we relate the open-loop parameters to their closed loop counterparts. Expansion of in equation 1 gives:
Given is a non-zero transfer function, the open-loop Error is kept at zero if and only if closed loop Error is kept at zero:
Having now established how the error can be fed into an error backpropagation framework we are now able to present the inner workings of the learning unit.
Having explored the dynamics of the closed loop, we now focus on the inner working of the learning unit. The latter partial derivative in equations 4 and 7, termed the Network Gradient , is merely based on the inner configuration of the learning unit which in this work, is a Deep Neural Network (DNN) with backpropagation (BP). Given that the network is situated in the closed loop platform, its dynamics is expressed in z-space. The Forward-Propagation (FP) entails feeding the predictive inputs
and generating the predictive action
. This is shown in Fig. 1B) with solid line and is expressed as below where
denotes the activation of neurons2:
denotes the weights of neurons which are analogous to weights in time-domain. Using equation 11,
with respect to specific weights gives:
The resulting partial derivative is termed Internal Gradient G and is calculated using backpropagation:
Therefore, the Internal Error of the neuron, measuring sensitivity of the closed loop Cost-Function with respect to its activation, is given as below; refer to equations 5 and 13:
The update rule can be expressed as the correlation of the internal error of the neuron with the activation of the previous neuron:
The small learning rate ensures that the time-dependant weight change is small. Note that Eq. 15 results in a weight change in the time domain which is calculated in z-space and for that reason we call this learning scheme Inter Domain Learning (IDL).
The gradient of the with respect to an arbitrary weight is given as following, referring to equations 5, 11, and 14:
This shows that the changes in with respect to an arbitrary weight depends on the weighted internal error introduced in the adjacent deeper layer. This is the propagation of
into the deeper layers and shows the backpropagation in the z-domain.
The performance of our Inter-Domain Learning (IDL) paradigm is tested using a line-follower both in simulation and through experiments with a real robot. The learning paradigm was developed into a bespoke low-level C++ external library [Dar, 2019]. The transfer function of the reflex loop , derived in Equation 5, is set to unity for the following results.
6.1 Simulations
Fig. 2A shows the configuration of the robot and its environment for simulations. The closed loop error is calculated using the right and left ground sensors:
For prediction, 8 predictive signals are generated using an array of 16 ground sensors placed ahead of the robot as shown in the left-hand side of Fig. 2A.
These are then filtered using a bank of 5 second-order lowpass filters (), with a damping coefficient of Q = 0.51 and impulse responses lasting between 3 to 10 iterations as to cause the correct delay for the correlation of predictors and the error signal. This results in 40 predictive inputs to the network which is configured with 12 and 6 neurons in the first two hidden layers and 1 output neuron in the final layer. The steering of the robot is facilitated through adjustments of the left and right wheel velocities (Fig. 2A):
Where are experimental tuning parameters set to
spectively. The simulation environment is shown in Fig. 2B where the robot follows the track from the start point and in a loop for 1000 iterations. A set of simulations were carried out to contrast the reflex and the predictive behaviours; each scenario was repeated 10 times for reproducibility and statistical analysis. Fig. 3A and B show the average closed loop error over 10 trials for reflex and learning (
) behaviours respectively. A comparison of these results show very fast learning of the robot where the error signal is forced to zero. Top and bottom sections of Fig. 2B show the trajectory of the robot over the course of one trial, for the reflex and learning respectively; in the presence of learning the steering is of anticipatory nature and exhibits a smooth trajectory. Whereas, in the absence of learning the steering is reactive and hence the abrupt response.
Fig. 3D shows the normalised euclidean distance of the weights in each layer from their random initialisation. This shows a gradual increase from zero to its maximum during the course of one simulation. Since the error signal is propagated as a weighted sum of the internal errors all layers show similar rate of change in their weight distance. Moreover, Fig. 3C shows the final distribution of first layer’s weights in the form of a normalised greyscale map upon completion of the learning. The weights show an organised distribution, with higher weights associated to the outer predictors, smaller weights associated to the inner predictors,
; see Fig. 2A for position of predictors. This facilitates a sharper steering for the outer predictors ensuring a smooth trajectory, as shown in the bottom section of Fig. 2B.
Another set of simulations were carried out with five orders of learning rates: ; each of the scenarios were repeated 10 times. Fig. 4C shows the Root Mean Square (RMS) of the error signal for each learning trials as well as that of the reflex trials for comparison. All learning scenarios show a significantly smaller RMS error when compared to the reflex behaviour; the error is reduced from over
and below. There is a gradual decrease in this value as the learning rate is increased. Smaller values of RMS error indicates both the reduction in the amplitude and also the recurrence of the error signal.
Figure 2: A) Schematic of the robot and its environment: the robot is composed of a body with two wheels with speeds of and two ground sensors
which the closed loop error
is generated. The robot is placed on a track and has vision of the path ahead. In simulations this is 16 symmetrical ground sensors (left-hand side) and in real experiments it is the camera view with 6x16 pixels (right-hand side). From this the predictors
are generated as the difference of symmetrical pixels pairs. B) and C) The Track and robot’s trajectory for simulations and experiments respectively. In both cases the top section shows the trajectory of the robot during a reflex trial showing a poor uneven trace whereas the bottom section shows the trajectory for a learning trial showing a smooth and even trace.
6.2 Experiments
The experiments with a real robot were carried out using a Parallax SumoBot as a mechanical test-bed, a Raspberry Pi 3B+ for computation and an Arduino Nano as the motor controller. For predictive learning a camera was mounted on the robot providing vision of the path ahead (see right-hand side of Fig. 2) as a matrix of pixels, from which the predictive signals
are extracted :
With 6 columns and 8 rows, 48 predictive signals were extracted and filtered using 5 second-order low-pass filters with Q = 0.51 and impulse responses lasting from 5 to 10 time steps to cause the correct delay. The error signal was defined as a weighted sum of 3 light sensors for a smoother and more informative error signal. The deep neural network was configured with 11 hidden layers each with 11 neurons, as well as an output layer with 3 neurons to facilitate slow, medium and sharp steering.
Fig. 5A Shows the closed-loop error in the absence of learning. This is when the robot navigates using its reflex system only. It can be seen that the error signal is very persistent in this case as the robot can only generate an appropriate steering command retrospectively after an error has occurred. This sets a benchmark for evaluation of the deep learner. Fig. 5B shows the error signal in the presence of the deep learner where the learning rate is , this shows a strong reduction of the error signal over the first 1500 steps, where the learning is achieved rapidly using the closed-loop error that acts as a minimal instructive feedback for the deep learner. Fig. 5C shows the final distribution of the weights in the first layer associating different strength to different pixel location of the predictors. From the gradient it can be seen that the farther the
0.5 1.00.0C) Snapshot of final weight distribution in the first layer (η=1e-1)
Figure 3: Simulation results: A) Shows robot’s closed-loop error signal when navigating by reflex mechanism only, note the high amplitude (RMS=0.05) and frequent occurrence of the error whilst the learning is off. B) Shows robot’s closed-loop error signal when navigating by reflex mechanism and learning () mechanism, note that both the amplitude (RMS=0.01) and the occurrence of the error has reduced sig-nificantly with learning compared to that of reflex only. C) Shows greyscale map of the weight distribution in the first layer after the learning is completed (
that the weight distribution closely follow the location of predictors to which they associate, with weights associated to outermost predictors having high values and weights associated to innermost predictors having small values to allow for abrupt and subtle steering of the robot respectively. For position of predictors refer to Fig. 2A. D) Shows normalised euclidean distance of the weights in each layer during learning (
note the gradual increase of the weight distance that stabilises towards the end of learning where the error is kept at zero.
predictor from the centre line the greater the steering action, this is also illustrated in Fig. 2A. Fig. 5D shows the weight change in each layer as explained in the simulation results. The weight distance changes noticeably over the first 1500 steps dictated by the closed-loop error but comes to a stable plateau as the error signal remains at zero.
Fig. 2C shows the robot’s track and compares the trajectory of the robot for a reflex and a learning trial. Top section of the figure shows that when the learning is off the trace of robot almost always remains outside of the track with a few crossover points indicated by a star. Whereas, the bottom section shows that with learning (the trace of robot is aligned with the track.
The performance of the deep learner () was repeated with 5 different random weight initialisation using different random seeds srand(i) where i = {0, 1, 2, 3, 4}. In the presence of learning, ”success” refers to a condition where the closed-loop error shows a minimum of 75 percent reduction from its average value dur-
Figure 4: A) Shows the number of steps taken until the success condition is met for 5 different random seed for weight initialisation. Note that the random initialisation of weights plays no significant role in the learning and success time. B) Shows the effect of learning rate on the number of step taken until the success condition is met. Data shows a significant exponential decrease in the time taken before a successful learning is achieved. In other words, the learning is significantly faster for higher learning rates as it varies from Shows the effect of learning rate on RMS value of closed-loop error
. Note the significant reduction of the closed-loop error in the presence of learning compared to that of reflex only, as well as the gradual improvement of learning (faster learning) with exponential increase of the learning rate
to
. Examples of these trials are shown in Fig. 3A and B for reflex and learning with
respectively.
ing reflex only trials, for 100 consequent steps. Fig. 4A shows that different random initialisation of the weights makes no significant difference to the time that it takes for the learner to meet the success condition.
The experiment was repeated with a 5 different learning rates ; each experiment was repeated 5 times for reproducibility. Fig. 4B shows the time taken to success for these trials. This data shows an exponential decay of the success time as the learning rate is increased.
Fig. 6 shows another example of a learning trial similar to that in Fig. 5 but with a smaller learning rate . Fig. 6A shows the contribution of the deep learner to the resultant differential speed of the robot. This quantity is small and inaccurate at the start of the trial where the reflex mechanism governs the navigation of the robot, however, the contribution of the learner grows larger and more precise over time as the learner begins to dominate the navigation. This transition from reflex to learning navigation is also seen in Fig. 6B where the error signal
decreases gradually as a successful learning is approached. Fig. 5C shows the final distribution of the weights in the first layer and shows similar but more crude gradients compared to that of Fig. 6C. Fig. 5D shows the weight change in each layer during learning. This shows a more gradual change compared to the learning trial with
shown in Fig. 5D.
0.5 1.00.0C) Snapshot of final weight distribution in the first layer (η=2e-1)
Figure 5: Experimental results for learning rate of Shows the closed-loop error when robot navigates with reflex mechanism only. This sets a benchmark for evaluating the performance of the learning; note the high amplitude and persistence of this signal. B) Shows the closed-loop error when the learning mechanism governs the navigation of the robot. Note the significant reduction of the error signal compared to the reflex data showing fast learning. C) Shows a greyscale map of the weight distribution in the first layer after the learning has been completed. Note that the weight distribution closely follows the location of predictors to which they associate; with weights associated to outermost predictors having high values and weights associated to innermost predictors having small values to allow for abrupt and subtle steering of the robot respectively. This greyscale mapping is also illustrated in Fig. 2A. D) Shows normalised euclidean distance of the weights in each layer during learning.
In this paper we have presented a learning algorithm which creates a forward model of a reflex employing a multi layered network. Previous work in this area used shallow [Kulvicius et al., 2007], usually single layer networks to learn a forward model [Nakanishi and Schaal, 2004, Porr and W¨org¨otter, 2006] and it was not possible to employ deeper structures. On the other hand model free RL has been using more complex network structures such as deep learning by combining it with Q-learning where the network learns to estimate an expected reward [Guo et al., 2014, Bansal et al., 2016]. At first sight this looks like two competing approaches because they both use deep networks with error backpropagation. However, they serve different purposes as discussed in Dolan and Dayan [2013], Botvinick and Weinstein [2014] which lead to the idea of hierarchical RL where RL provides a prediction error for an actor which can then develop forward models.
Both, in deep RL [Guo et al., 2014] and in our algorithm we employ error backpropagation which is a mathematical trick where an error/cost function is expanded with the
0.5 1.00.0C) Snapshot of final weight distribution in the first layer (η=2e-3)
Figure 6: Experimental results for learning rate of Shows the predictive action of the network
. This this the contribution of the learning to the steering of the robot is anticipation of a disturbance (turn in the road). Note that as the learning improves the amplitude of the steering increases and becomes more precise. B) Shows the closed-loop error when the learning mechanism governs the navigation of the robot. Note that the error is continuously reduced over time as the learning progresses. C) Shows a greyscale map of the weight distribution in the first layer after the learning is completed. Note that the weight distribution closely follow the location of predictors to which they associate, with weights associated to outermost predictors having high values and weights associated to innermost predictors having small values to allow for abrupt and subtle steering of the robot respectively. This greyscale mapping is also illustrated in Fig. 2A. D) Shows normalised euclidean distance of the weights in each layer during learning.
help of partial derivatives [Rumelhart et al., 1986]. This approach is appropriate for open loop scenarios but for closed loop one needs to take into account the endless recursion caused by the closed loop. In order to solve this problem we have switched to the z-domain in which the recursion turns into simple algebra. A different approach has been taken by LSTM networks where the recursion is unrolled and backpropagation in time is used to calculate the weights [Hochreiter and Schmidhuber, 1997] which is done offline whereas in our algorithm this is done while the agent acts in its environment.
Deep learning is generally a slow learning algorithm and deep RL tends to be even slower because of the sparsity of the discrete rewards. On the other hand purely continuous or sampled continuous systems can be very fast because they have continuous error feedback so that in terms of behaviour nearly one shot learning can be achieved [Porr and W¨org¨otter, 2006]. However, this comes at the price namely that forward models are learned from simple reflex behaviours and no sophisticated planning can be achieved. For that reason it has been suggested to combine the model free deep RL
with model based learning to have a slow and a fast system [Botvinick et al., 2019].
Forward models play an important role in robotic and biological motor control [Wolpert and Kawato, 1998, Wolpert et al., 2001, Haruno et al., 2001, Nakanishi and Schaal, 2004] where forward models guarantee an optimal trajectory after learning and with our approach this offers opportunities to learn more complex forward models with the help of deep networks and then combine it with traditional Q-learning to planning those movements.
Inter domain learning (IDL) source code. 10.5281/zenodo.3203391, 2019.
S. Bansal, A. K. Akametalu, F. J. Jiang, F. Laine, and C. J. Tomlin. Learning quadrotor dynamics using neural network for flight control. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 4653–4660, Dec 2016. doi: 10.1109/CDC.2016. 7798978.
Mathew Botvinick, Sam Ritter, Jane X Wang, Zeb Kurth-Nelson, Charles Blundell, and Demis Hassabis. Reinforcement learning, fast and slow. Trends in cognitive sciences, 2019.
Matthew Botvinick and Ari Weinstein. Model-based hierarchical reinforcement learn- ing and human action control. Philosophical Transactions of the Royal Society B: Biological Sciences, 369(1655):20130480, 2014. doi: 10.1098/rstb.2013.0480.
Ray J Dolan and Peter Dayan. Goals and habits in the brain. Neuron, 80(2):312–325, 2013.
Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard L Lewis, and Xiaoshi Wang. Deep learning for real-time atari game play using offline monte-carlo tree search planning. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3338–3346. Curran Associates, Inc., 2014.
Masahiko Haruno, Daniel M. Wolpert, and Mitsuo Kawato. Mosaic model for sensori- motor learning and control. Neural Computation, 13:2201–2220, 2001.
Sepp Hochreiter and Jrgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. doi: 10.1162/neco.1997.9.8.1735.
A. Harry Klopf. A drive-reinforcement model of single neuron function. In John S. Denker, editor, Neural Networks for Computing: Snowbird, Utah, volume 151 of AIP conference proceedings, New York, 1986. American Institute of Physics.
Tomas Kulvicius, Bernd Porr, and Florentin W¨org¨otter. Chained learning architectures in a simple closed-loop behavioural context. Biological Cybernetics, 97:363–378, 2007.
Jun Nakanishi and Stefan Schaal. Feedback error learning and nonlinear adaptive con- trol. Neural Networks, 17(10):1453–1465, 2004.
Charles L. Phillips. Feedback control systems. Prentice-Hall International (UK), London, 2000.
Bernd Porr and Florentin W¨org¨otter. Strongly improved stability and faster conver- gence of temporal sequence learning by utilising input correlations only. Neural Computation, 18(6):1380–1412, 2006.
David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning represen- tations by back-propagating errors. Nature, page 533, 1986.
R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. Bradford Books, MIT Press, Cambridge, MA, 2002 edition, 1998.
P.F.M.J Verschure and A.C.C. Coolen. Adaptive fields: Distributed representations of classically conditioned associations. Network, 2:189–206, 1991.
Christofer J.C.H Watkins and Peter Dayan. Q-learning. Machine Learning, 8:279–292, 1992.
Daniel M. Wolpert, Zoubin Ghahramani, and J. Randall Flanagan. Perspectives and problems in motor learning. TRENDS in Cognitive Sciences, 5(11), 2001.
D.M Wolpert and M. Kawato. Multiple paired forward and inverse models for motor control. Neural Networks, 11:1317–1329, 1998.