Consider environments represented by a Markov Decision Process M = (S, A, T , R) with discrete states S and actions A. Deep Reinforcement Learning has been successful in tasks with moderate state and action space sizes where cheap and perfect environment simulators are available, or where the agent can even be placed in the real environment M, as in many games. Agents can then be trained by interacting many times with the (simulated) environment.
Unfortunately, in many important tasks of real life interest, this framework cannot be easily applied. First, interactions with the real environment can be very expensive, and even potentially dangerous. Obviously, it would be prohibitive to let an agent initially drive randomly on public roads, or to perform unrestricted scientific wet-lab experiments, which can cost 1000s of pounds (Dulac-Arnold et al., 2019).
It has been recognised early that it is possible to train models as simulators of the environment dynamics based
1BenevolentAI, London, UK 2Westf¨alische WilhelmsUniversit¨at M¨unster, DE. <marwin.segler@wwu.de>.
Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop, ICML 2019. Copyright 2019 by the author(s).
on limited interaction data, for example recorded trajectories or experiments which have been performed in the past (Schmidhuber, 1990; Ha & Schmidhuber, 2018). However, canonical model-based RL still requires that the action space A is known and can be provided easily by the programmer.
More generally, the learner does not have information available about the possible actions they can take, since it can be too complex to specify what A actually is, similar to common sense knowledge in expert systems. The learner then first has to learn the nature of the elements of the set A such that they are compliant with observations subsequent states.
Second, in contrast to board games or image representations which are of fixed size, in a more general setting the state space S is compositional, constructed of multiple entities and their relations, which means it can potentially have countably infinite cardinality , which requires the learner to exhibit an appropriate inductive bias. Agents with representations based on graphs can naturally tackle compositionality in environments, and have been successful in chemical (Segler et al., 2017; 2018) and physical (Hamrick et al., 2018; Bapst et al., 2019) reasoning and planning tasks.
In the frontiers section of their book Sutton & Barto (2018) succinctly state that “we still need scalable methods for planning with learned environment models. Planning methods have proven extremely effective in applications such as AlphaGo Zero and computer chess in which the model of the environment is known from the rules of the game or can otherwise be supplied by human designers. But cases of full model-based reinforcement learning, in which the environment model is learned from data and then used for planning, are rare.”
Here, we describe a general formalism to address this issue for learning and planning in compositional state spaces with a priori unknown actions, as merger of three components: 1) The induction of the action space as a graph-based world program from a small set of state-state transitions, 2) learning of neural network policies and dynamics models, and 3) planning using these networks and the world program. We further highlight its successful application, and future challenges for the RL community.
Consider the problem of near-optimal decision making in in-finite, discrete, compositional state and action spaces, where a simulator of the environment dynamics T is not available. Furthermore, consider that the learner does not know the elements of the action space A a priori. The only clue are limited amounts of batch data in the form of state-state pairs from (previous) observations of the environment. In the framework of Markov Decision Processes (Sutton & Barto, 2018) we then have
where
learned. For the sake of simplicity, we assume
2.1. Graph-Based Environments
We consider compositional environments, where states are composed of multisets of graphs.2 Graphs G = (V, E) in turn are composed of vertices (nodes) V and edges E, which model the relationship between the vertices. Graphs may be vertex- or edge-labeled with arbitrarily structured labels, and can be directional. This subsumes for example grids (lattice graphs, images), text (linear chain graphs), computer code (abstract syntax trees), molecules, or physical systems. Let G be the set of all possible graphs G, which are defined by an unrestricted graph grammar (Rozenberg, 1997). The state space , where denotes all sub-multisets formed by the elements of
Actions, which transform states into new states, can then be seen as graph rewriting rules , which means a graph L is matched in the graphs in a state s via (sub)graph isomorphism, cut out, and a different graph R is glued in in this position. (Rozenberg, 1997; Andersen et al., 2013). This gives the learner a handle to induce the elements of the action space.
2.2. World Programs
Program induction is the task of generating a program consistent with a specification provided by input-output examples (Gulwani et al., 2017). A plethora of algorithms has been proposed for this task. Purely symbolic approaches are very data efficient, and can learn from very few examples, however, they tend to struggle in the face of noisy data. In contrast, approaches based on low-bias neural networks such as seq2seq-RNNs or transformers turn out to be quite robust to noise. However, without the adequate inductive biases they require large amounts of data to train, and do not generalise well beyond the training distribution. Also, several models combining neural and symbolic components have been proposed.(Gaunt et al., 2017; Boˇsnjak et al., 2017; Evans & Grefenstette, 2018; Bunel et al., 2018)
In our formalism, we define the triple (A, A(s), T ) as the world program, which the learner needs to learn from observations O in the form of pairs. We can then treat each graph-rewriting action as a subroutine, and A(s) as a map from a state to sets of subroutines . Furthermore, we can leverage the induced actions A to learn the transition model by sampling actions , and applying them to O to generate training examples consistent and inconsistent with the observations.
Figure 1. Given subsequent state-state input-output pairs, the learner first has to induce the actions which gave rise to the transition, in the form of graph rewriting rules or subroutines of atomistic edit operations.
The concrete implementation of the program induction step is a hyperparameter. It can be realized by many different algorithms, using everything from purely symbolic to graph neural networks or purely distributed representations (Segler & Waller, 2017a;b; Liu et al., 2017; Bradshaw et al., 2018; Yin et al., 2018; Bradshaw et al., 2019).
2.3. Transition Functions and Policies
Graph neural networks (GNNs) have been employed in chemistry since decades (Kireev, 1995; Baskin et al., 1997; Merkwirth & Lengauer, 2005), and can be traced back to algorithms for graph canonicalisation developed for chemical databases (Morgan, 1965; Weisfeiler & Lehman, 1968) and path-based graph featurisation (Wiener, 1947). GNNs have recently been shown to be useful as well in many other domains where relations between objects can be modelled as graphs, e.g. in physics and source code (Niepert et al., 2016; Kipf & Welling, 2017; Brockschmidt et al., 2018; Bapst et al., 2019). Here, to represent the states, graph neural networks provide the adequate inductive bias for the task.
2.4. Planning with World Programs
After the learner has induced the world program (A, A(s), T ), it can be employed in a simulator (S, A, T , R) in combination with any reinforcement learning algorithm to train an agent or perform planning. The discrete nature of the problem and the availability of the model lends itself immediately to Monte Carlo Tree Search (MCTS) (Kocsis & Szepesv´ari, 2006; Coulom, 2006) and its extensions.
Figure 2. States are multisets of graphs. The actions rewrite the graphs within a state by changing the vertices and edges. In this example, we consider the task of planning the construction of a complex object from a set of building blocks goal state as , and the branch from the solution/plan.
The formalism described above was applied for the problem of chemical synthesis planning for small organic molecules, which are of central importance for human well-being as medicines or materials (Todd, 2005). Here, starting with the target molecule we wish to make, the task is to perform a backward search by recursively deconstructing the molecule into increasingly simpler parts using formally reversed chemical reactions (graph rewriting rules), see Figure 2. The problem is solved iff the state is a subset of a predefined set of building blocks. More formally, given a set of building block graphs B, a state is solved iff , that is all graphs in are isomorphic to a building block graph . The results are reproduced from (Segler et al., 2018), where this framework was first applied, however, formulated for a non-ML audience.
3.1. Methods
The graphs in this task are small organic molecules, consisting of atoms as vertices and bonds as edges. To encode these graphs, a special case of Graph Neural Networks (GNN) was used, where the trainable weights are replaced by hash functions, which drastically speeds up inference while performing well (Rogers & Hahn, 2010). To induce the world program, we first take chemical reactions reported in the literature as our () observations, with the reagents and the product of the reactions as the states. We then first employ a symbolic search procedure to delete all nodes and edges from the molecular graphs that do not get changed in the course of the reaction, to form A, with actions.4 To learn A(s), we train a neural network to predict the probability over all rules given s, and restrict the available actions to the top-k rules with a cumulative probability of > 99%.
The transition function T is a binary classifier which takes the same GNN as an input for states, and takes the difference of the embeddings of and as the representation of the action . For more details, we refer the reader to (Segler et al., 2018).
3.2. Quantitative Results
Table 1 shows experimental results for several models in the task of predicting synthesis plans for 497 different molecules (Segler et al., 2018). In PUCT-MCTS, the prior probability from the network representing A(s) is used to bias the exploration term. Monte Carlo Search (MCS) (sampling from the prior), UCT-MCTS, where the exploration term does not have a predicted probability contribution, and two Best First Search (BFS) variants all perform worse than PUCT-MCTS.5
Table 1. Experimental Results
Time budget 3300 s and 100,000 iterations for MC(T)S or 3300 s and 100,000 expansions for BFS, per molecule (3 restarts were allowed). Std dev is given for the solved ratio. Results reproduced from (Segler et al., 2018)
3.3. Qualitative Results
A notorious problem with model-based RL and planning is that strong learners can exploit imperfections in the simulator, leading to plans that do not translate to the real environment. Thus, to test the quality of the plans, we conducted two AB tests, with 45 expert organic chemists, who had to choose one of two plans in a double blind setup.
First, the subjects were presented with plans previously reported in the literature, and a plan generated by the PUCTMCTS algorithm for the same target molecule. The null hypothesis is that experts rate the computed plans as inferior (expected preference: computer 0% vs literature 100%). Surprisingly, we found that the experts perceived the computed routes (57.0%) as equivalent to the literature routes (43.0%).
In the second test, the experts had to report their preferences for plans found by PUCT-MCTS or plans generated by a baseline system based on heuristic BFS. The subjects sig-nificantly preferred the routes generated by MCTS (68.2%) over the baseline system (31.8%).
World program learning for discrete states and actions is related to the problem of inverse dynamics in continuous state and action spaces, e.g. for use in robotics, where the torques to produce a movement are induced from observed trajectories (Nguyen-Tuong & Peters, 2010; Meier et al., 2016; Christiano et al., 2016). How to extend the presented framework here to the continuous setting remains a topic for future work.
Furthermore, there is a relation between world program induction and hierarchical reinforcement learning/option discovery (Sutton & Barto, 2018), which are predominantly used to extract spatiotemporal abstractions on multiple scales by joining sequences of (micro-)actions into more abstract macro actions, where both micro- and macro-actions having the same type or unit. In contrast, in a world program, the full action and its atomistic components do not
have the same type.
Here, we have proposed a formalism to perform model-based planning in complex compositional environments, where the learner a priori does not have knowledge of the action space. As graph neural networks have progressed from chemical modeling problems to all kinds of relational prediction problems, we hope that the machine learning community might also take inspiration to build on techniques developed for chemical planning problems to other relational planning problems, such as automated theorem proving, code optimization/rewriting, or physical construction and deconstruction problems.
To conclude, we would also like to pose a small challenge: Given only a small set of historically played Go and Chess games as grid graphs, can you first learn the rules of the games to learn a simulator, and then compete with an AlphaZero-like agent (Silver et al., 2018) using an agent trained in the learned simulator?
The author thanks M. Ahmed, T. Edlich, and B. Fabian for feedback on the manuscript, and the anonymous reviewers for helpful suggestions.
Andersen, J. L., Flamm, C., Merkle, D., and Stadler, P. F. Generic strategies for chemical space exploration. arXiv preprint arXiv:1302.4006, 2013.
Bapst, V., Sanchez-Gonzalez, A., Doersch, C., Stachenfeld, K. L., Kohli, P., Battaglia, P. W., and Hamrick, J. B. Structured agents for physical construction. arXiv preprint arXiv:1904.03177, 2019.
Baskin, I. I., Palyulin, V. A., and Zefirov, N. S. A neural device for searching direct correlations between structures and properties of chemical compounds. Journal of chemical information and computer sciences, 37(4): 715–721, 1997.
Boˇsnjak, M., Rockt¨aschel, T., Naradowsky, J., and Riedel, S. Programming with a differentiable forth interpreter. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 547–556. JMLR. org, 2017.
Bradshaw, J., Kusner, M. J., Paige, B., Segler, M. H., and Hern´andez-Lobato, J. M. A generative model for electron paths. arXiv preprint arXiv:1805.10970, 2018.
Bradshaw, J., Paige, B., Kusner, M. J., Segler, M. H., and Hern´andez-Lobato, J. M. A model to search for synthesizable molecules. arXiv preprint arXiv:1906.05221, 2019.
Brockschmidt, M., Allamanis, M., Gaunt, A. L., and Polo- zov, O. Generative code modeling with graphs. arXiv preprint arXiv:1805.08490, 2018.
Bunel, R., Hausknecht, M., Devlin, J., Singh, R., and Kohli, P. Leveraging grammar and reinforcement learning for neural program synthesis. arXiv preprint arXiv:1805.04276, 2018.
Christiano, P., Shah, Z., Mordatch, I., Schneider, J., Black- well, T., Tobin, J., Abbeel, P., and Zaremba, W. Transfer from simulation to real world through learning deep inverse dynamics model. arXiv preprint arXiv:1610.03518, 2016.
Coulom, R. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pp. 72–83. Springer, 2006.
Dulac-Arnold, G., Mankowitz, D., and Hester, T. Chal- lenges of real-world reinforcement learning. arXiv preprint arXiv:1904.12901, 2019.
Evans, R. and Grefenstette, E. Learning explanatory rules from noisy data. Journal of Artificial Intelligence Research, 61:1–64, 2018.
Gaunt, A. L., Brockschmidt, M., Kushman, N., and Tar- low, D. Differentiable programs with neural libraries. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1213–1222. JMLR. org, 2017.
Gulwani, S., Polozov, O., and Singh, R. Program synthesis. Foundations and Trends in Programming Languages, 4 (1-2):1–119, 2017.
Ha, D. and Schmidhuber, J. Recurrent world models facili- tate policy evolution. In Advances in Neural Information Processing Systems, pp. 2450–2462, 2018.
Hamrick, J. B., Allen, K. R., Bapst, V., Zhu, T., McKee, K. R., Tenenbaum, J. B., and Battaglia, P. W. Relational inductive bias for physical construction in humans and machines. arXiv preprint arXiv:1806.01203, 2018.
Kipf, T. N. and Welling, M. Semi-supervised classifica- tion with graph convolutional networks. International Conference on Learning Representations, 2017.
Kireev, D. B. Chemnet: A novel neural network based method for graph/property mapping. Journal of chemical information and computer sciences, 35(2):175–180, 1995.
Kocsis, L. and Szepesv´ari, C. Bandit based monte-carlo planning. In European conference on machine learning, pp. 282–293. Springer, 2006.
Liu, B., Ramsundar, B., Kawthekar, P., Shi, J., Gomes, J., Luu Nguyen, Q., Ho, S., Sloane, J., Wender, P., and Pande, V. Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS Cent. Sci., 3(10): 1103–1113, 2017.
Meier, F., Kappler, D., Ratliff, N., and Schaal, S. Towards ro- bust online inverse dynamics learning. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4034–4039. IEEE, 2016.
Merkwirth, C. and Lengauer, T. Automatic generation of complementary descriptors with molecular graph networks. Journal of chemical information and modeling, 45(5):1159–1168, 2005.
Morgan, H. The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service. Journal of Chemical Documentation, 5 (2):107–113, 1965.
Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, pp. 2, 2000.
Nguyen-Tuong, D. and Peters, J. Using model knowledge for learning inverse dynamics. In 2010 IEEE International Conference on Robotics and Automation, pp. 2677– 2682. IEEE, 2010.
Niepert, M., Ahmed, M., and Kutzkov, K. Learning con- volutional neural networks for graphs. In International conference on machine learning, pp. 2014–2023, 2016.
Rogers, D. and Hahn, M. Extended-connectivity finger- prints. Journal of chemical information and modeling, 50 (5):742–754, 2010.
Rozenberg, G. Handbook of Graph Grammars and Comp., volume 1. World scientific, 1997.
Schmidhuber, J. Making the world differentiable: On us- ing supervised learning fully recurrent neural networks for dynamic reinforcement learning and planning in non-stationary environments. Technische Universit¨at M¨unchen Tech. Report: FKI-126-90, 1990.
Segler, M., Preuß, M., and Waller, M. P. Towards ”Al- phachem”: Chemical synthesis planning with tree search and deep neural network policies. ICLR Workshop track, preprint arXiv:1702.00020, 2017.
Segler, M. H. and Waller, M. P. Modelling Chemical Rea- soning to Predict and Invent Reactions. Chem. Eur. J., 23(25):6118–6128, 2017a. ISSN 15213765. doi: 10.1002/chem.201604556.
Segler, M. H. and Waller, M. P. Neural-symbolic machine learning for retrosynthesis and reaction prediction. Chem. Eur. J., 23(25):5966–5971, 2017b.
Segler, M. H., Preuss, M., and Waller, M. P. Planning chem- ical syntheses with deep neural networks and symbolic AI. Nature, 555(7698):604, 2018.
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, Cambridge, 2nd edition, 2018.
Todd, M. H. Computer-aided organic synthesis. Chemical Society Reviews, 34(3):247–266, 2005.
Weisfeiler, B. and Lehman, A. A. A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-Technicheskaya Informatsia, 2(9): 12–16, 1968.
Wiener, H. Structural determination of paraffin boiling points. Journal of the American Chemical Society, 69(1): 17–20, 1947.
Yin, P., Neubig, G., Allamanis, M., Brockschmidt, M., and Gaunt, A. L. Learning to represent edits. arXiv preprint arXiv:1810.13337, 2018.