An information-theoretic on-line update principle for perception-action coupling

2018·Arxiv

Abstract

Abstract

Inspired by findings of sensorimotor coupling in humans and animals, there has recently been a growing interest in the interaction between action and perception in robotic systems [1]. Here we consider perception and action as two serial information channels with limited information-processing capacity. We follow [2] and formulate a constrained optimization problem that maximizes utility under limited information-processing capacity in the two channels. As a solution we obtain an optimal perceptual channel and an optimal action channel that are coupled such that perceptual information is optimized with respect to downstream processing in the action module. The main novelty of this study is that we propose an online optimization procedure to find bounded-optimal perception and action channels in parameterized serial perception-action systems. In particular, we implement the perceptual channel as a multi-layer neural network and the action channel as a multinomial distribution. We illustrate our method in a NAO robot simulator with a simplified cup lifting task.

I. INTRODUCTION

In robotic systems perception and action have often been studied in isolation in the past without an overarching principle of how to put the two processes together. Yet, there is compelling evidence that the two processes are interdependent in humans and animals [3]. In the robotics literature, the direct coupling between action and perception has been especially emphasized in behavior-based robotics [4] and by proponents of embodied cognition [5], but more recently also approaches applying machine learning to sensorimotor processing have focused on the interactive nature of perception—see [1] for a review. The main insight of interactive perception is that sensory processing can be enhanced when manipulating or interacting with the environment. This can be achieved by creating novel signals through movement [6], [7] or by exploiting action-perception regularities that are generated when the same action is performed repeatedly in the same environment [8], [9]. For example, object segmentation could be improved when separating different objects by movement, or some object properties like inertia or weight could be estimated through interaction. In such cases action

∗This study was supported by the ERC, Starting Grant BRISC 678082 and DFG, Emmy Noether grant BR4164/1-1. 1 Max Planck Institute for Biological Cybernetics, T¨ubingen, Germany 2 Max Planck Institute for Intelligent Systems, T¨ubingen, Germany 3 Graduate Training Centre of Neuroscience, T¨ubingen, Germany 4 Bosch Center for Artificial Intelligence, Robert Bosch GmbH, Renningen, Germany 5 PROWLER.io, Cambridge, United Kingdom 6 Institute for Neural Information Processing, University of Ulm, Ulm, Germany † Correspondence: Zhen Peng, Max Planck Institute for Biological Cybernetics, Spemannstr. 38, 72076 T¨ubingen, Germany

directly subserves the perceptual process. In other cases, however, interactive perception has the primary objective to achieve a manipulation goal. Defining the objective is therefore critical in determining what kind of sensorimotor coupling can arise.

The formal framework that deals with adaptive systems optimizing arbitrary objective functions under uncertainty is decision theory. A rational agent has to decide which action to take from the action set according to the desirability of the action quantified by a utility function. A fundamental problem of such perfect rationality models [10], [11], [12] is that they ignore computational costs that arise when searching for the maximum utility action. As such costs can be prohibitive, decision-making with limited information-processing resources has recently been studied extensively in psychology, economics, cognitive science, computer science, and artificial intelligence research [13], [14], [15], [16], [17], [18]. In the following we argue that such resource limitations are crucial for the emergence of sensorimotor coupling.

A. An Information-Theoretic Principle for Bounded Rational Decision-Making with Context-dependence

In this study, we use an information-theoretic model of bounded rational decision-making [19], [20], [21], [22], [23]. In a decision-making task with context, an agent is presented with a world-state and has to find an optimal action afrom a set of admissible actions. The desirability of the action under a particular world-state is quantified by the utility function U. The objective of the decision-maker is to maximize the utility depending on the context:

For an agent with limited computational resources that has to react within a certain time-limit, searching for the best action can potentially become intractable, especially when the number of possible actions is enormous. Thus, a bounded rational agent tries to find a good enough yet tractable solution. In multiple contexts, bounded rational decision-making requires to compute multiple strategies under limited computational resources which can be expressed as a set of probability distributions pover actions given the different world-states. Mathematically, this informational cost can be measured in terms of an “information distance”, namely the Kullback-Leibler divergence DKLp0(a)) from a prior behavior p0(a) to the posterior strategies p. This information-processing cost can be motivated on axiomatic grounds [24], [22] and has been used previously in the robotics and control literature [25], [26], [27], [28]. An upper bound of the Kullback-Leibler divergence DKLp0B with B 0 constrains the decision-maker to spend a maximum number of bits B to adapt its behavior. The resulting optimization problem can be formalized as

pargmaxp

where the inverse temperature 0 governs the trade-off between expected utility and information cost. For classic decision theory is recovered, whereas for 0 the decision-maker has no access to computational resources at all and thus acts according to the prior. When optimizing the expected free energy over all possible contexts, it can readily be shown that the optimal prior p0(a) is given by the marginal distribution p0p[see Section 2.1.1 in [29] or [30]]. Plugging in the marginal p(a) as the optimal prior pyields the following variational principle for bounded rational decision-making

where I;a) is the mutual information between action and world-states and measures the reduction of uncertainty about the action a after observing or vice versa. This problem formulation is commonly known as the rate-distortion problem from information theory [31]. The solution of (3) is given by a set of two self-consistent equations:

where Zexpis the partition sum. In practice, the solution can be computed using the BlahutArimoto algorithm [32], [33], [34] by starting with an initial distribution pinit(a) and then iterating through both equations (4) and (5) until the distributions converge. The iteration is guaranteed to converge to a global optimum with the prerequisite that pinit(a) has the same support as p(a) [29], [30], [35].

B. An Information-Theoretic Principle for Perception-Action Coupling

We follow the work of [2], where the authors extend the rate-distortion framework to systems with multiple information-processing nodes. The serial perception-action system consists of two stages: a perceptual stage pthat maps world-states to observations x and an action stage p(a|x) that maps observations x to actions a. The three random variables for world-state, observation and action form a serial chain of two channels, which is expressed by the graphical model A, and implies the following conditional independence

We assume that the utility function depends only on the world-state and the action U, the internal percept x does not influence the utility. The information processing price in the perceptual channel can be different from the price of information processing in the action channel . Formally, we set up the following variational problem:

Here we define J as an overall objective function. Note that in this problem statement both the perceptual channel and the action channel are optimized. This is in contrast to traditional problem statements where a likelihood model is assumed to be given and the decision rule is built given this model. However, the coupling between action and perception falls out naturally by extending the rate distortion problem to Equation (8).

Similar to the rate-distortion case of Equation (3), the solution is given by the following set of four analytic self-consistent equations [2]:

where Zand Z(x) denote the corresponding partition sums of world-state and internal perception x. The conditional probability pis given by Bayes’ rule

and

is the free energy difference of the action stage. Equation (9) – (12) can be computed by starting with arbitrary initial distributions q(x), q(a) and q(a|x) in lieu of p(x), p(a) and p(a|x) and then iterating (9) to (12) until convergence. As the iterations involve evaluations of the utility function over all possible action and world-states, the computation can become very costly. Another drawback of such Blahut-Arimoto-style algorithms is that they cannot be applied straightforwardly to continuous problems, closed-form analytic solutions exist only for special cases. Here we propose an alternative online optimization method to solve this problem.

II. THEORETICAL RESULTS

In this study, we present an algorithm to update an agent’s perceptual module and its behavioural policy—expressed through two separate parametric models—in a joint fashion under constrained information-processing resources according to the framework of information-theoretic bounded rationality. a) Implementation of the perceptual channel p: We consider neural networks of the type depicted in Figure 1 as parameterized model to represent the perceptual distribution p. The network possesses one input layer , one hidden layer l and one output layer x. is a real-valued column vector representing the world-state . The synaptic weights between the input and the hidden layer are expressed as a real matrix V and the weights between the hidden and the output layer are expressed as another real matrix W. The activation function in the hidden layer is a hyperbolic tangent function V †tanh(V †exp. We apply a soft-max activation function in the output layer to compute the perceptual distribution with pVxiexp

. Accordingly, the gradients of the output distribution with respect to parameter matrices V and W are given by:

A derivation of (13) and (14) can be found in the Appendix. b) Implementation of the action channel p(a|x): Due to the abstract nature of the observation and action space, we assume here for simplicity discrete action choices. We therefore parameterize the action channel as a multinominal distribution:

with dimensionality n 1 and an auxiliary function logexp. Note that the parameter is conditioned on observations x. In our implementation is expressed and updated as a real-valued matrix with

Fig. 1: Illustration of the neural network model. The network possesses one input layer , one hidden layer l and one output layer x, and is parameterized by two real weight matrices V and W. Note that it is not a prerequisite that each layer contains three neurons—the figure serves to illustrate the structure and the notation of the network, the number of neurons in each layer varies according to the problem setup.

dimensionality n . We represent actions as a binaryvalued vector in one-hot encoding having the form ai = [a0aian] with ai = 1 and aj = 0 for all j i where 0 n. The conventional constraint of a conditional distribution p1 is satisfied by defining pa0pai|x) . Thus, the gradient of the action distribution with respect to the parameter is given by

c) Parameter updates: In the course of the simulation, the bounded rational decision-maker constantly updates the parameters representing the perceptual and the action channel. The overall objective in (8) is expressed as a parametric function of V, W and :

By defining an auxiliary term j1

log pVlog pthe objective can be rewritten as J . Here we apply the logtrick to transform the derivative of J into an expected value by noticing that for any parametric function fthe equation plogpis valid. This trick allows us to rewrite the derivative of the overall objective as follows:

The expectation value can be approximated by drawing N sample triplets from the joint distribution pV. The number of samples N governs the accuracy of the approximation. A large N provides high accuracy but demands vast computational resources. Setting the batch size to 1 economizes the computational cost at every iteration by avoiding the expensive evaluation of the summand function, thus leading to an effective online rule for parameter updates as is done in stochastic gradient ascent—see [36] for a similar method. We apply a soft update rule to optimize parameters in an online fashion by introducing the learning rate 0

for each parameter . Note that stochastic gradient ascent does not always converge to a global maximum. The global maximum is assumed only when the objective function is concave and the learning rates decrease with an appropriate rate, otherwise a local maximum might be attained [37]. Therefore, the online rule for the problem described is not guaranteed to converge to global optimal solutions and should be treated carefully by using small learning rates .

III. EXPERIMENTAL RESULTS

Iterating the analytical solutions (Equations (9)-(12)) requires the evaluation of the utility function Ufor all a pairs in each iteration step. A major advantage of the gradient-update scheme derived in the previous section is that it is suitable for on-line updates, i.e. one iteration can be performed after every interaction of the robot with the environment. In such a scheme world-states are generated (randomly) by the environment ˆ. Each ˆis processed by the perceptual stage of the agent, which samples a percept ˆx pV(in our case, drawing a sample from the distribution that results from the softmax output of the neural network). Similarly, the agent samples an action ˆa ˆx). This leads to a roll-out which allows to evaluate Uˆa), and to perform one stochastic gradient update step. In the following, we first compare this on-line stochastic gradient update scheme against solutions obtained from iterating the analytical solution equations. Afterwards we demonstrate the scheme on a more challenging task in a simulated robot environment.

A. Comparison with baseline

To empirically verify the convergence of our gradient update scheme and the correctness of the resulting solution, we compare against the (analytical) solution obtained by iterating the set of self-consistent equations as given in [2]. To this end, we use the “predator-prey” example from [2]. In the example, a fictional animal encounters other animals, which can either be prey that should be hunted, or predators that should be avoided. To decide which action a to take, the animal has a perceptual sensor to determine the size of the encountered animal. The example is described by the utility function Ushown in Figure 2A. Animals belong to one of three groups: small, medium-sized and large animals. All large animals are predators, thus the only action that yields non-zero utility is “flee” (regardless of the particular size of the large animal). For each of the small animals, a specific hunting-action yields the highest utility, therefore it is relevant to distinguish between the individual animals of the small-group. In contrast, for the animals of the medium-sized group the specific hunting-actions yield the same utility as a generic hunting-action that works equally well for all medium-sized animals. The example clearly illustrates the importance of coupling perception with the downstream action-part of the agent. The distinction between the individual animals of the medium- and large-sized groups is irrelevant for acting. Thus, spending (computational) capacity on the perceptual channel for this distinction is lavish and should be avoided, particularly if the capacity of the perceptual channel is limited.

The original example in [2] used categorical distributions for perception pand action p(a|x). Here, we use a neural network with one hidden layer for perception pV(with being a binary encoding of ) and a multinomial distribution (parameterized as given by Equation (15)) for the action-stage p(with a being a one-hot encoding of the action a). The neural network consisted of four input neurons, 20 hidden and 13 output neurons, initialized with Glorot’s scheme (also known as Xavier-initialization) [38]. The parameters were initialized such that all actions were equally probable. We found that convergence of the gradient-update scheme crucially depends on using different learning rates for the perceptual channel (V,W) and the action channel (). Figure 2B shows the evolution of the objective value (Equation (8)) during gradient-update iterations of V(Equation (21)) using 8 (corresponding to large perceptual capacity) and 10 (high-capacity action channel). We used a learning rate of 006 for the perceptual channel and 014 for the action channel. The dashed red line in the panel indicates the baseline, that is the value of the objective function obtained by iterating the set of analytical solutions (Equations (9)-(12)) until convergence as in [2]. As shown in the figure, the gradient update scheme converges to a solution with the same objective-value as the analytical baseline method (after roughly 50000 iterations). Panel C of Figure 2 shows the corresponding behavior pVpV(omitting

Fig. 2: Predator-prey example with a neural network for the perceptual stage and a multinomial distribution for the action-stage. (A) Utility function U, see [2] for a detailed description. (B) Black line: evolution of the objective (Equation (8)) during gradient-update iterations (Equation (21)). Red dashed line: baseline objective-value achieved by iterating the analytical solutions (Equations (9)-(12)) until convergence. (C) Final “behavior” pafter 100000 iterations – compare Figure 6D in [2].

binary/one-hot encoding from the notation for simplicity) after 100000 gradient-update iterations. Comparing panel C against Figure 6D in [2] shows, that the solution obtained from the gradient update scheme is qualitatively identical to the solution obtained by iterating the set of self-consistent equations. Importantly, the solution reflects the intuition, that an information-optimal agent does not distinguish between the individual animals of the medium and large groups, even if the computational capacity of the perceptual channel would in principle allow for such a distinction.

We have performed the same comparison for the other settings of in [2], corresponding to either low capacity of the perceptual- or the action-channel, and found that the gradient scheme converges to solutions that are qualitatively identical (same objective value, same behavior p(a|w)) to solutions obtained from iterating the self-consistent equations. We conclude that the gradient update scheme in conjunction with a neural network for the perceptual stage and a multinomial distribution for the action-stage successfully matches the analytical baseline. Due to the limitations in length of the manuscript, we have omitted further plots of the empirical baseline comparison.

B. Robot Simulation

We also test our method in a simulated robotic environment to illustrate the usage of the proposed update principle for sensorimotor coupling with parametric perception and action modules. To this end, we designed a simplified grasping task with a simulated Nao robot. In the simulation, the robot is positioned next to a table with 4 mugs on it (see Figure 3). The mugs differ in the number and orientation of the handles. One mug (m0) has no handle at all, two mugs have one handle (positioned such that the handle is either to the left (mL) or to the right (mR)) and one mug has handles on both sides (m2), allowing a direct grasp with either the left or the right hand of the robot. The Nao robot has two cameras— in this simulation we make use of the chest-camera which shows the area on the table directly in front of the robot (see the bottom left inlet in Figure 3). Based on this camera input, the (bounded-rational) agent has to decide how to grasp the mug appropriately. We defined 4 possible actions: lift the mug with both hands (a2), lift the mug with the left hand (aL), lift with the right hand (aR) or execute no lift (a0). Additionally, we have defined the following utility function shown in Figure 4A, where each mug has one preferred action (yielding the highest utility) and the case of grasping a single-handle mug with both hands, which yields slightly lower utility.

Fig. 3: Task setup in Nao simulation environment.

As in the previous example, the perceptual stage of the robot is implemented by a neural network with one hidden layer (192 input neurons, four hidden and four output neurons, Xavier-initialization), and the action stage is implemented with a multinomial distribution (parameterized according to Equation (15), initialized to have uniform probabilities over actions). The perceptual and action stage are then learned through interaction with the environment, using the stochastic gradient update scheme proposed in this paper. At the beginning of each trial, one mug is randomly selected according to p(uniform distribution in our case) and placed in front of the agent. A 1612 image from the chest camera of the robot is then fed into the neural network for perceptual processing (2D-image is flattened into 1D vector). Accordingly, an observation ˆx and an action ˆa are sampled by the agent. After evaluating the utility Uˆa), the model parameters V, W and are updated with a gradient ascent step as described in the previous section.

Figure 4 shows three experiments with different computational capacity of the perception- and action-channels, they are: high-capacity channels (2 and 3, 0.035, 7), high perceptual capacity combined with low action capacity (2 and 001, 34), and low-capacity channels (5, 004, 028). Panel B shows how the objective value evolves during training. In all cases, the on-line update scheme converges to a stable solution. The dashed lines show the optimal solution. Panel C and D show evolution of the mutual information of the perceptual channel I;X) and the action channel I(X;A). The mutual information on the perceptual and action stage is high with high computational capacity. Lowering the capacity of the action channel leads to a reduction of mutual information of both channels. Note that for the second case only the information processing price on the action stage is changed, the perceptual channel adjusts accordingly. Under (very) low computational capacity, the agent develops a single action strategy such that it requires no computation at all (channel capacity for both perception and action is effectively zero).

Figure 5 shows the behaviour of the robot after convergence.With high computational capacity, the robot learns to associate the camera images with the best possible action for each mug (Panel A & B). If the action channel does not have sufficient capacity, the agent is not able to apply specific actions to specific contexts, therefore, its policy collapses into two modes: lift with both hands or do nothing at all (Panel C). Accordingly, the agent spends less information processing in the perceptual channel such that it only discriminates between mugs with handle(s) and mugs without handle (Panel D). Under (very) low computational capacity, the agent always chooses to lift the mug with both hands which requires no computation. (Panel E & F)

IV. DISCUSSION

In this study we propose a novel online optimization rule to find bounded-optimal perception-action coupling in serial perception-action systems. The perceptual channel is implemented as a multi-layer neural network while the action channel is represented by a parametric distribution, which was a multinomial in our case. Our method is illustrated with a NAO robot simulator.

The proposed algorithm can be improved in several ways. In the case of rate distortion equation (3), the Blahut-Arimoto algorithm is guaranteed to converge to a unique maximum [see Section 2.1.1 in [29]]. When we consider an extension of bounded rational problems to systems with multiple information-processing stages, since there is no convergenceproof, it cannot be ruled out that the solutions obtained by iterating the self-consistent equations until numerical convergence are only local optima. One future improvement

Fig. 4: (A) The utility function U. m0 indicates the mug without any handle, mL and mR represent the mug with one handle (rotated correspondingly) and m2 symbolizes the mug having handles on both sides. For each mug there is one “most suitable” action that yields the highest utility (in case of no handles, executing no lift is defined to be optimal). Additionally, lifting the one-handle cup with both hands leads to a non-zero utility, corresponding to a successful lift, but using more effort than necessary. (B) Evolution of the objective-value during on-line gradient-update training. Three different computational limitations are compared here, dashed lines: baseline objective-value achieved by iterating the analytical solutions. (C) & (D) Evolution of the mutual information of the perceptual channel I;X) and the action channel I(X;A).

would therefore be to better theoretically understand the convergence properties of the extended rate distortion problem.

Another issue is that the learning rates for the parameter updates (see equation (21)) have a significant impact on the development of the bounded-optimal behaviors. Inappropriate learning rates easily lead to an optimization failure in that the bounded-rational decision-maker is unable to find the bounded-optimal solutions. We choose a grid-search method to optimize the learning rates. A future improvement would require a better understanding of the relationship between the learning rates in the perceptual and action module and to study better optimization procedures for choosing the learning rates accordingly.

In information-theoretic bounded rationality, the main assumption is that the decision-maker’s behavioral policy may not deviate too much from some prior policy. Deviations from the prior policy are costly and modeled through the Kullback-Leibler divergence (Section 1.A). This is similar to other proposed regularization techniques from robotics

Fig. 5: Comparing the final behaviour of a bounded-optimal agent after 100000 iterations under different computational constrains. (A) & (B) results of the simulation with high computational capacity (3). (C) & (D) results with low action channel capacity (5). (E) & (F) results with (very) low computational capacity (0.5).

such as Trust Region Policy Optimization (TRPO, [39]) and Relative Entropy Policy Search (REPS, [25]). The main difference lies in the choice of the prior behavioral policy. In both, TRPO and REPS, the prior policy represents the agent’s behavior at a previous iteration of the optimization procedure or a set of initial expert trajectories (in imitation-learning). In our approach, on the other hand, the agent’s prior policy is optimal w.r.t. the information-theoretic constraints (Eq. (3)), encouraging the agent to ignore irrelevant sensory information that has little impact on the reward.

Conceptually, bounded rationality has its most obvious application when the information-processing capacity is limited by physical constraints like time or space constraints. In our present model this was not the case, which raises the question why one would restrict the system to a smaller capacity than it might naturally have. Apart from possible effects on learning speed that we did not investigate here, restricting the channel capacity creates a bottleneck that filters out relevant information and creates abstractions that are useful for generalization [29]. For instance in panel D of figure 5, two abstract percepts emerge: “mug with handle(s)” (x2) and “mug without handles” (x4).

In summary, our information-theoretic principle (8) for perception-action coupling provides a novel generic principled method and could in principle be applied to combine any parameterized perception and action modules. Compared to the existing literature, this approach is most similar in spirit to approaches that learn particular perceptual features that are most useful to solve a particular task [40], [41], [42], [43]. Here this feature search is integrated in a single bounded rational optimization problem.

REFERENCES

[1] J. Bogh, K. Hausman, B. Sankaran, O. Brock, D. Kragic and S. Schaal, ”Interactive Perception: Leveraging Action in Perception and Perception in Action”, arXiv:1604.03670v2

[2] T. Genewein, F. Leibfried, J. Grau-Moya and D. A. Braun, ”Bounded rationality, abstraction and hierarchical decision-making: an information-theoretic optimality principle”, Frontiers in Robotics and AI, vol. 2(27), 2015, pp.1-24.

[3] A. No¨e,”Action in Perception”, Bradford book, 2004

[4] R. C. Arkin, ”Behavior-based Robotics”, MIT Press, 1998

[5] R. Pfeifer and J. C. Bongard, ”How the Body Shapes the Way We Think: A New View of Intelligence”, MIT Press, 2006

[6] P. Fitzpatrick and G. Metta, ”Towards manipulation-driven vision”, IEEE/RSJ International Conference on Intelligent Robots and Systems, vol. 1, 2002, pp.43–48

[7] H. van Hoof, O. Kroemer and J. Peters, ”Probabilistic segmentation and targeted exploration of objects in cluttered environments”, IEEE Transactions on Robotics, vol. 30, no. 5, 2014, pp. 1198–1209

[8] M. Gupta and G. S. Sukhatme, ”Using manipulation primitives for brick sorting in clutter”, International Conference on Robotics and Automation, 2012

[9] L. Y. Chang, J. R. Smith and D. Fox, ”Interactive singulation of objects from a pile”, International Conference on Robotics and Automation, 2012

[10] F. P. Ramsey, ”Truth and probability”, in The Foundations of Mathematics and Other Logical Essays, ed. R. B. Braithwaite (New York, NY: Harcourt, Brace and Co), 1931, pp. 156-198.

[11] J. Von Neumann and O. Morgenstern, ”Theory of Games and Economic Behavior”, Princeton: Princeton University Press, 1944.

[12] L. J. Savage, ”The Foundations of Statistics”, New York: Wiley, 1954.

[13] G. Gigerenzer and P.M. Todd. ”Simple Heuristics That Make Us Smart”, Oxford: Oxford University Press, 1999.

[14] D. Kahneman, ”Maps of bounded rationality: psychology for behavioral economics”, Am. Econ. Rev., vol. 93, 2003, pp.1449-1475.

[15] A. Howes, R. L. Lewis and A. Vera, ”Rational adaptation under task and processing constraints: implications for testing theories of cognition and action”, Psychol. Rev., vol. 116, 2009, pp. 717-751.

[16] S. Russell, ”Rationality and intelligence”, in Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, ed. C. Mellish (San Francisco, CA: Morgan Kaufmann), 1995, pp. 950-957.

[17] S. Russell and P. Norvig, ”Artificial Intelligence: A Modern Approach”, Upper Saddle River: Prentice Hall, 2002.

[18] R. L. Lewis, A. Howes and S. Singh, ”Computational rationality: linking mechanism and behavior through bounded utility maximization”, Top. Cogn. Sci., vol 6, 2014, pp. 279-311.

[19] D.A. Braun, P.A. Ortega,E. Theodorou, and S. Schaal, ”Path integral control and bounded rationality”, in IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (Piscataway: IEEE), 2011, pp. 202-209.

[20] D. A. Braun, and P. A. Ortega, ”Information-theoretic bounded rationality and epsilon-optimality”, Entropy 16, 2014, pp. 4662-4676.

[21] P. A. Ortega and D. A. Braun, ”Free energy and the generalized optimality equations for sequential decision making”, Workshop and Conference, 2012

[22] P.A. Ortega and D. A. Braun, ”Thermodynamics as a theory of decision-making with information-processing costs”, Proc. R. Soc. A Math. Phys. Eng. Sci., vol. 469(2153), 2013.

[23] P. A. Ortega, D. A. Braun and N. Tishby, ”Monte Carlo methods for exact & efficient solution of the generalized optimality equations”, in Proceedings of IEEE International Conference on Robotics and Automation, Hong Kong, 2014.

[24] P. A. Ortega and D. A. Braun, ”A conversion between utility and information”, in Third Conference on Artificial General Intelligence (AGI 2010) (Lugano: Atlantis Press), 2010, pp. 115-120.

[25] J. Peters, K. M¨ulling, Y. Alt¨un, ”Relative Entropy Policy Search”, Twenty-Fourth National Conference on Artificial Intelligence (AAAI-10), 2010, pp. 1607–1612

[26] E. Theodorou, J. Buchli and S. Schaal, ”A Generalized Path Integral Control Approach to Reinforcement Learning”, J. Mach. Learn. Res., vol. 11, 2010, pp. 3137–3181

[27] E. Todorov, ”Linearly-solvable Markov decision problems”, Advances in neural information processing systems, 2006, pp. 1369–1376

[28] H. J. Kappen ”Linear theory for control of nonlinear stochastic systems”, Physical review letters, vol. 95, no. 20, 2005, p. 200201

[29] N. Tishby, F. C. Pereira, and W. Bialek, ”The information bottleneck method”, in The 37th Annual Allerton Conference on Communication, Control, and Computing, 1999.

[30] I. Csiszar, and G. Tusnady, ”Information geometry and alternating minimization procedures”, Stat. Decis. vol. 1, 1984, pp. 205-237.

[31] C. E. Shannon, ”Coding Theorems for a Discrete Source With a Fidelity CriterionInstitute of Radio Engineers”, International Convention Record, vol. 7, 1959, pp. 142-163.

[32] R. Blahut, ”Computation of channel capacity and rate-distortion functions”, IEEE Trans. Inf. Theory 18, 1972, pp. 460-473.

[33] S. Arimoto, ”An algorithm for computing the capacity of arbitrary discrete memoryless channels”, IEEE Trans. Inf. Theory 18, 1972, pp. 14-20.

[34] R.W. Yeung, ”Information Theory and Network Coding”, New York: Springer, 2008.

[35] T. M. Cover, and J. A. Thomas, ”Elements of Information Theory”, Hoboken: John Wiley & Sons, 1991.

[36] F. Leibfried and D. A. Braun, ”Bounded Rational Decision-Making in Feedforward Neural Networks”, Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, 2016

[37] L. Bottou, ”Online Algorithms and Stochastic Approximations”. Online Learning and Neural Networks. Cambridge University Press,1998.

[38] X. Glorot and Y. Bengio, ”Understanding the difficulty of training deep feedforward neural networks”, In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS10), 2010

[39] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, P. Abbeel, ”Trust Region Policy Optimization”, International Conference on Machine Learning (ICML), 2015

[40] P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine, ”Learning to poke by poking: Experiential learning of intuitive physics”, in Advances in Neural Information Processing Systems, 2016

[41] R. Jonschkowski and O. Brock, ”Learning state representations with robotic priors”, Autonomous Robots, vol. 39, no. 3, pp. 407–428, 2015

[42] S. Levine, C. Finn, T. Darrell and P. Abbeel, ”End-to-end training of deep visuomotor policies”, Journal of Machine Learning Research, vol. 17, no. 39, pp. 1–40, 2016

[43] J. Piater, S. Jodogne, R. Detry, D. Kraft, N. Krueger, O. Kroemer and J. Peters, ”Learning visual representations for perception-action systems”, International Journal of Robotics Research, 2011

A. Partial derivative of the perceptual distribution with respect to V

B. Partial derivative of the perceptual distribution with respect to W

designed for accessibility and to further open science