Automated Deep Abstractions for Stochastic Chemical Reaction Networks

2020·Arxiv

Abstract

Abstract

Predicting stochastic cellular dynamics as emerging from the mechanistic models of molecular interactions is a long-standing challenge in systems biology: low-level chemical reaction network (CRN) models give raise to a highly-dimensional continuous-time Markov chain (CTMC) which is computationally demanding and often prohibitive to analyse in practice. A recently proposed abstraction method uses deep learning to replace this CTMC with a discrete-time continuous-space process, by training a mixture density deep neural network with traces sampled at regular time intervals (which can obtained either by simulating a given CRN or as time-series data from experiment). The major advantage of such abstraction is that it produces a computational model that is dramatically cheaper to execute, while preserving the statistical features of the training data. In general, the abstraction accuracy improves with the amount of training data. However, depending on a CRN, the overall quality of the method – the efficiency gain and abstraction accuracy – will also depend on the choice of neural network architecture given by hyper-parameters such as the layer types and connections between them. As a consequence, in practice, the modeller would have to take care of finding the suitable architecture manually, for each given CRN, through a tedious and time-consuming trial-and-error cycle.

In this paper, we propose to further automatise deep abstractions for stochastic CRNs, through learning the optimal neural network architecture along with learning the transition kernel of the abstract process. Automated search of the architecture makes the method applicable directly to any given CRN, which is time-saving for deep learning experts and crucial for non-specialists. We implement the method and demonstrate its performance on a number of representative CRNs with multi-modal emergent phenotypes.

1 Introduction

Understanding how the dynamics of complex biological systems such as a biological cell or a group of interacting cells emerges from the dynamics of low-level molecular interactions is one of the key challenges of systems and synthetic biology. Low-level mechanisms of molecular interactions are usually hypothesised in form of chemical reaction networks. Each reaction fires with a corresponding rate. A reaction network induces a stochastic dynamical system - continuous-time Markov chain (CTMC), describing how the state - a vector enumerating multiplicities of each of the species - changes over time upon firing of reactions. Computationally predicting the distribution of species over time from such CTMC is generally challenging, due to a huge number of reachable states, due to stochasticity and events happening at multiple time-scales. Two major approaches are used to analyse the CTMC. The first approach focuses on computing the transient evolution of the probability related to each state of the CTMC numerically. The transient distribution evolves according to the Kolmogorov forward equation (chemical master equation in the chemistry literature), and it is typically very difficult to solve the forward equations except for the simplest systems. The second approach is based on a statistical estimation of trace distribution and event probabilities of the CTMC by generating many sample traces [18], often referred to as stochastic simulation Gillespie algorithm (SSA). While this method generally allows to trade-off computational tractability with loosing precision, even simulating a single trace can still take considerable processor time, especially when some reactions fire at very fast time-scales relative to the global time horizon of interest. At the same time, we are often not interested in predictions at such small time-scales or transient distributions for each of the species. For all these reasons, it is desirable to develop model reduction techniques for stochastic reaction networks, which allow for efficient simulation, yet faithfully abstract the context-relevant emerging features of the hypothesised mechanism.

Example 1 (running example). For example, the following set of reactions constitutes a reaction network with three species , and six reactions:

where 0 1. This fast-and-slow network ([27], Eq. 16) may be interpreted as a gene slowly switching between two different expression modes. In Fig.1, we show a sample trajectory for Ex.(1), where one can see notable differences in species’ abundance and reaction time-scales. Moreover, we may be interested only in abstractions reproducing the distribution of protein at several interesting time points, e.g. four at time points shown in Fig.2.

Deep abstractions, introduced in [8], propose to use available simulation algorithms to generate a suitable number of simulated system traces, and then learn an abstract model from such data. The task of learning a transition kernel for a Markov process that defines the abstract model is solved as a supervised learning problem: the transition kernel for this Markov process is modelled as a probability mixture with parameters depending on the system state, and a deep neural network is trained on simulated data to parameterise this probability mixture. Such abstract model preserves the statistics of the original network dynamics, but runs on a discrete time-scale representing equally distributed time intervals, abstracting away all intermediate transitions, which can lead to significant computational savings.

Contributions. The performance of any deep learning application largely depends on the choice of the neural network architecture, usually constructed by the user through a trial-and-error process. In context of applying deep abstractions proposed in [8], this means that the modeller would have to take care of finding the suitable architecture manually, for each given CRN. The main contribution of this paper is a framework for deep abstractions where the neural network architecture search is automated: in parallel to learning the kernel of the stochastic process, we learn a neural network architecture, by employing the recent advances on this topic in the deep learning community [22,9]. We implement our technique as a Python library StochNetV2 available on GitHub and we illustrate the quality of model reduction on different reaction network case studies.

Related Works. Different techniques on reducing stochastic CRNs have been proposed in literature and practice over the years. Classical limit approximations of deterministic limit, moments or mean-field approximation [4,11] can provide significant computational savings, but they do not apply to general stochastic CRNs, especially when species distributions emerging over time are non-Gaussian, as for example is the case shown in Ex.(1). Moreover, principled model reduction techniques have been proposed in several aggregation [12,26,16,17,15,28,13] and time-scale separation frameworks [21,5,20,24]. These techniques are generally based on detecting species, reactions or states which are behaviourally indistinguishable or similar. In these methods, the space of considered abstractions is typically discrete and as a consequence, it does not allow smooth tuning of abstracted entities, or control of abstraction accuracy. In other words, the achieved abstraction may or may not be significant, and once the method is applied, it is difficult to further improve it, both in terms of abstraction size, and accuracy. This is different to deep abstractions, where the abstraction accuracy can be improved by increasing the model size and/or adding more training data, and increasing time discretisation interval improves abstraction efficiency. Abstractions based on statistical analysis of traces, closest to the idea of deep abstractions in [8], include [25], who proposed to construct abstractions using information theory to discretise the state space and select a subset of all original variables and their mutual dependencies and a Dynamic Bayesian Network is constructed to produce state transitions, as well as a statistical approach to approximate dynamics of fast-and-slow models was developed by in [23], where Gaussian Processes are used to predict the state of the fast

Fig. 1. Sample trajectory of Ex.(1) network.

Fig. 2. Distribution (histogram) of the protein at times 20, 50, 100, and 200 for Ex.(1) network.

equilibrating internal process as a function of the environmental state. It is worth noting that all the mentioned reduction techniques, except from the exact frameworks based on syntactic criteria, such as in [17,12], do not guarantee error bounds a priori.

2 Backgound and Preliminaries

Neural Networks. In the last decade, deep neural networks (DNNs, NNs) gathered a lot of attention from researchers as well as from industry, bringing breakthroughs in various application areas, such as computer vision, time-series analysis, speech recognition, machine translation, etc. Neural networks are well known as a powerful and versatile framework for high-dimensional learning tasks. The key feature of neural networks is that they can represent an arbitrary function, mapping a set of input variables () to a set of output variables (). Further we denote them as x and y for simplicity.

Neural networks are typically formed by composing together many different functions. The model is associated with a directed acyclic graph describing how the functions are composed together. For example, we might have three functions connected in a chain to form f(x) = ))). In this case, is called the first layer of the network, is called the second layer, and so on. The outputs of each layer are called features, or latent representation of the data. By

Fig. 3. Mixture Density Network structure. Given x, neural network outputs values and that define m Gaussian distributions. Weighted by mixing coeffi-cients , they form a mixture density - conditional probability density p(y|x).

adding more layers and more units within a layer, a deep network can represent functions of increasing complexity [19].

In particular, each layer usually computes a linear transformation Wx + b where W is a weight matrix and b is bias vector. Additional nonlinearities are inserted in between the layers which allow the network to represent arbitrary nonlinear functions (see the illustration in Fig.8.

NNs are trained on a set of training examples {(x, y)} with the aim not to memorize the data, but rather to learn the underlying dependencies within the variables, so that the output y can be predicted for unseen values of x. During training, the weights in a network are adjusted to minimize the learning objective - loss function - on training data. The quality of a trained model is usually measured on unseen (test) data, which is addressed as the model’s generalization ability.

Mixture Density Networks. However, conventional neural networks perform poorly on a class of problems involving the prediction of continuous variables, for which the target data is multi-valued. Minimising a sum-of-squares error encourages a network to approximate the conditional average of the target data, which does not capture the information on the distribution of output values.

Mixture Density Networks (MDN), proposed in [6], is a class of models which overcome these limitations by combining a conventional neural network with a mixture density model, illustrated in Fig 3. This neural network provides the parameters for multiple distributions, which are then mixed by the weights (also provided by the network). Therefore Mixture Density Networks can in principle represent arbitrary conditional distributions, in the same way that a conventional neural network can represent arbitrary non-linear functions.

To construct an abstract model defined by a Markov process, we need to learn its transition kernel. In other words, given a vector describing the system state at time t, we wish to predict the state , which follows a distribution, conditioned on . Therefore we can use Mixture Density Networks to learn the conditional distribution ).

3 Deep Abstractions

Here we present the main steps of the abstraction technique originally proposed in [8]. For more details we refer to the original paper.

Abstract Model. Let be the CTMC describing a CRN with the state space . As mentioned earlier, with the abstract model we wish to reproduce the dynamics of the original process at a fixed temporal resolution. Let be a discrete-time stochastic process such that

with fixed time interval and initial time . The resulting process ˜is a timehomogeneous discrete-time Markov chain (DTMC), with a transition kernel

Further, two following approximations take place:

1. The state space is embedded into the continuous space ˜. The abstract model takes values in ˜X.

2. The kernel is approximated by a new kernel ) operating in the continuous space ˜X. The kernel ) is modelled by a MDN.

To evaluate the abstract model, we introduce a time-bounded reward function r that monitors the properties we wish to preserve in the reduced model. This function, therefore, maps from the space of discrete-time trajectories to an arbitrary space T (here M is an upper bound on the length of discrete-time trajectories, and ˜denotes time-bounded trajectories). For example, it can be a projection, counting the populations of a subset of species, or it can take Boolean values corresponding to some linear temporal logic properties. Note that ) is a probability distribution on T.

As an error metric we use the distance d between the abstract distribution and ), which is evaluated statistically, as the distance among histograms, h [10]. In our experiments as a distance d we use metric:

or Intersection over Union (IoU) distance:

Here is the formal definition of model abstraction [8]:

Definition 1. Let be a discrete time stochastic process over an arbitrary state space S, with a time horizon, and let be the associated reward function. An abstraction of () is a tuple ( ¯) where:

Let 0. ¯is said to be -close to with respect to d if, for almost any ,

The simplest choice for the abstraction function could be an identity mapping. Alternatively, one can follow [25] to identify a subset of species having the most influence on the reward function. Inequality (6) is typically experimentally verified simulating a sufficiently high number of trajectories from both the original system ˜and the abstraction ¯starting from a common initial setting. As there is no way to ensure that the inequality holds for almost any in S, we evaluate it for many different initial settings that the model did not see during training. Evaluation examples are presented in supplementary material.

The abstract model for our example shown in (1) as follows:

– The abstract state space ¯S is , i.e. the continuous approximation of ;

– The abstraction function p is the identity function that maps each point of S into its continuous embedding in ¯S;

– The reward function r is the projection on the protein ; – The discrete time stochastic process ¯is a DTMC with transition kernel represented by an MDN trained on simulation data.

Dataset Generation. We build our datasets as a sets of pairs D := {(x, y)} where each y is a sample from the distribution ), i.e. each pair corresponds to a single transition in discrete-time trajectories ˜. For this, we simulate trajectories starting from random initial settings from to , and take the consecutive states (as (x, y) pairs

Example 3 (running example: dataset). For the example network 1, we simulate 100 trajectories for each of 100 random initial settings. We run simulations up to 200 time units, and fix the time step to 20 time units for both training and evaluation (histogram) dataset. Therefore the time horizon M for evaluation is 10 steps.

Model Training. Let M be a parameterized family of mixture distributions and be an MDN, where are network weights. Then, for every input vector x, . During training, weights are optimized to maximize the likelihood of samples y w.r.t. the parameterized distribution ), so that the kernel is approximated by K:

where := ˜is the infinity norm ball with radius centered in s, needed to properly compare approximating continuous distribution with the original discrete distribution. Though model training is a relatively time-consuming task, once we have a trained model, sampling states is extremely fast, especially with the use of highly parallelized GPU computations.

Abstract Model Simulation and Evaluation. With a trained MDN , for an initial state , the distribution of the system state at the next time instant is given by ), and the values of the next state ¯can be sampled from this distribution. This values then can be used to produce the next state, and so on:

Every iteration in this procedure has a fixed computational cost, and therefore choosing equal to the timescale of interest, we can simulate arbitrarily long trajectories at the needed level of time resolution, without wasting computational resources.

To evaluate the abstract model, we chose a number of random initial settings and for every setting we simulate (sufficiently many) trajectories from both the original and the abstract model up to the time horizon , evaluating the distance (6). Note that we train the MDN to approximate the kernel for a given , i.e. it approximates model dynamics step-wise. However, we can evaluate the abstract model using the reward function on a time span longer than . As noted in [8], the idea is that a good local approximation of the kernel should result in a good global approximation on longer time scales.

Example 4 (running example: evaluation). For the histogram dataset, we simulate 10000 with the stochastic simulation Gillespie algorithm (SSA) trajectories up to time 200 for 25 random initial settings, and extract state values of the discrete-time process ˜with fixed to 20 time units.

With the trained MDN, we simulate 10000 traces starting from the same initial settings for 10 consecutive steps and therefore obtain the values ¯.

Finally, we can evaluate the inequality (6), and estimate the average histogram distance. For every time-step i in range from 1 to M = 10, we average the distance )) over 25 initial settings. This displays the performance of the abstract model in predicting for many time steps in future, see Fig. 4. Note that, to draw the next state ¯, the model uses its own prediction from the previous step.

Fig. 4. Ex.(1): Mean histogram distance (IoU) for different time-lags (left) and sample histograms of protein concentration after 1 time step (middle) and 9 time steps (right).

4 Automated Architecture Search

The performance of machine learning algorithms depends heavily on the latent representation, i.e. a set of features. In contrast to simple machine learning algorithms, neural networks learn not only a mapping from representation to output but also the representation itself. As mentioned above, neural networks form the desired complex function from many small transformations represented by different layers. Each layer produces a new representation of the input features, which then can be used as an input to the following layers. The final representation, therefore, depends on the layer types used across the network, as well as on the graph describing connections between these layers.

Usually, neural networks are manually engineered by the experts via the trial-and-error procedure, which is a very time-consuming and error-prone process. Complex tasks require models of large size, which makes model design even more challenging.

Convolutional neural networks is a good example of a gain that comes from introducing incremental improvements in the network architecture. Step by step, in a series of publications, better design patterns were developed, improving the quality of models and reducing the computational demands. Even though a new model outperforms previous approaches, one could never argue that it is optimal. This raises an interest in the automated architecture search procedure that leads to the optimal model configuration given a task.

One of the first successes in this field was achieved in [29] where reinforcement learning was applied to discover novel architectures that outperformed humaninvented models on a set of tasks such as image classification, object detection, and semantic segmentation. It is worth to mention that it took 2000 GPU days of training to achieve this result. Later publications [22,9] introduced a gradient-based approach which significantly reduced required computational powers and allowed to achieve compatible results within one to two days using a single GPU.

In this work, we propose the algorithm inspired by [22,9] for the automated architecture design of MDN. Given a dataset and only a few hyper-parameters, it learns the architecture that best fits the data.

4.1 Our framework for automated neural network search

Broadly speaking, all NAS methods vary within three main aspects: search space, search policy, and evaluation policy.

Search space defines which architectures can be represented in principle. Therefore, to define a search space we fix a set of possible operation/layer candidates, a set of rules to connect them, and the architecture size (number of connections/layers).

Search policy describes a strategy of exploring the search space, e.g. random search, Bayesian optimization, evolutionary methods, reinforcement learning (RL), or gradient-based methods.

Evaluation policy includes the set of metrics of interest, such as accuracy on test data, number of parameters, latency, etc.

Search Space Similarly to DARTS architecture search method, proposed in [22], we consider a network that consists of several computational blocks or cells. A cell is a directed acyclic graph consisting of an ordered sequence of nodes. Each node is a hidden state (latent representation) and each directed edge (i, j) is associated with some operation that transforms .

Each cell has two input nodes and a single output node. The input nodes are the outputs of the previous two cells (or the model inputs if there are no previous cells). The output node is obtained by applying an aggregating operation (e.g. sum or mean) on the intermediate nodes. Each intermediate node is computed based on all the predecessors:

A special zero operation is also included to indicate a lack of connection between two nodes.

To allow expanding the feature space within a network, we define two kinds of cells: normal cell preserving the number of neurons(features) received at inputs, and expanding cell that produces d times more activations, where d is an expansion multiplier parameter, see Fig. 5 for illustration. To serve this purpose, additional expanding operations (e.g. Dense layer) are applied to the inputs of a cell to produce the first two (input) nodes of the desired size. The very first cell is usually an expanding cell, and the rest are normal.

Therefore, our model is defined by the number of cells , cell size , expansion multiplier d, and the set of operations on the edges. We consider , and d to be hyper-parameters defining the model backbone, and fix the set of operation candidates. The task of architecture search thus reduces to learning the operations on the edges of each cell.

Search Strategy The discrete search space we constructed leads to a challenging combinatorial optimisation problem, especially if we search for a model that is deep enough. As a neural network performs a chain of operations adjusted to each other, replacing even a single one requires a complete re-training. Therefore, each configuration in exponentially large search space should be trained separately. Gradient-based methods tackle this issue by introducing a continuous relaxation for the search space so that we can leverage gradients for effective optimization.

Let be the set of N candidate operations (e.g. dense, identity, zero, etc.). To represent any architecture in the search space, we build an over-parameterized network, where each unknown edge is set to be a mixed operation with N parallel paths.

First, we define weights for the edges as a softmax over N real-valued architecture parameters (note that outputs of softmax operation are positive and sum up to one, therefore we can treat weights as probabilities):

For each , only one operation (path) is sampled according to the probabilities to produce the output. Path binarization process defined in [9] is described by:

where are binary gates:

In this way, the task of learning the architecture reduces to learning a set of parameters within every cell. The final network is obtained by replacing each mixed operation by the operation having the largest weight: = arg max.

Optimization After building an over-parameterised network, our goal is to jointly optimise the architecture parameters and the weights w within all mixed operations. As discussed in [22], the best model generalisation is achieved by reformulating our objective as a bi-level optimisation problem. We minimise the validation loss ) w.r.t. , where the weights are obtained by minimising the training loss ). In other words, training is performed by altering two separate stages for several epochs each, see Fig. 6.

When training weight parameters, we first freeze the architecture parameters . Then for every example in the training dataset, we sample binary gates according to (11) and update the weight parameters of active paths via standard gradient descent.

Fig. 5. Learning a computational cell. a) : Operations connecting internal states are unknown and set to a mixture of candidate operations (colored edges). Every state is connected to all its predecessors. b) : During training, the weights for candidates are adjusted to prioritize the most important operations. c) : Operations with the highest weights are selected for every edge. Further, only two edges with the highest scores are selected to be inputs for each state.

When training architecture parameters, we freeze the weight parameters and update the architecture parameters on validation data. For every batch, binary gates are sampled w.r.t updated architecture parameters.

However, due to the nature of the binarization procedure, the paths probabilities are not directly involved in the computational graph, which means that we can not directly compute gradients

to update using the gradient descent. As it was proposed in [9,14], we update the architecture parameters using the gradient w.r.t. its corresponding binary gate , i.e. using instead of in (12).

Example 5 (running example: architecture search). We search for the network consisting of 2 cells each of size 2, the first cell is an expanding cell. We train for 100 epochs in total: first 20 epochs only networks weights are updated, and the following 80 epochs training is performed as on Fig. 6. See Fig. 7 for the example of learned architecture.

5 Implementation

The library has implementations for all parts of the workflow:

– defining a custom CRN model or importing SBML/Kappa file, – producing CRN trajectories and creating datasets for training and evaluation,

Fig. 6. Optimization stages. left: Weight parameters w are updated on training data while parameters are frozen. right: Architecture parameters are updated on validation data while w parameters are frozen.

Fig. 7. Ex.(1): learned network structure. Gray rectangles represent input nodes, blue - intermediate nodes, Green and yellow - output nodes (or cell inputs). Intermediate nodes compute the sum of values on incoming edges, output nodes - the average.

– training custom models and automated architecture search, – model evaluation, – generating traces with trained model, – various visualisations (traces, histograms, mixture parameters, architecture design, model benchmarking etc.).

To simulate traces more effectively, we provide scripts that run many simulations in parallel using multi-threading routines.

We use luigi [1] package as a workflow manager, which allows creating complex pipelines and managing complex dependencies. Neural networks, random variables, and optimisation algorithms are implemented in TensorFlow [2] and deep learning framework.

CRN Models and Simulation Data. CRN models are handled by gillespy python library [3], which allows to define a custom class for model or import a model in SBML format. Note that not all models can be imported correctly, due to high variability of the SBML format. In those cases one can use a custom class with some pre-processing of the imported model.

Simulated trajectories we split into training examples (x, y) := (˜). As neural networks show better convergence when the training data is standardised so that it varies in a reasonable range (e.g. [-1, 1] or [0, 1]) or has zero mean and variance, we apply a preprocess step to the input data such that it is scaled to a desired range. Then it is split into training (80%) and validation (20%) parts. The training dataset is used to optimise network weights, and validation dataset is used to optimise architecture parameters.

To increase generalisation capabilities of the model, it is important to build the dataset that covers the most variability of the original process. Although having more training data is always beneficial for a model, it increases training time. Therefore, depending on the variation of trajectories starting from the same initial conditions, we might prefer to run a few simulations for many initial conditions or more simulations for fewer initial conditions. When generating the dataset for evaluation, to make histograms more consistent, we usually simulate much more trajectories (from 1000 to 10000) for several initial settings.

Network Structure and Computational Cells. In our experiments, we learn the network typically constructed from two to three computational cells each of size 2 to 4. The first cell is expanding with a multiplier in a range from 10 to 20, and other cells are normal cells. Having multiple cells is not necessary, so it may consist of only one (larger) cell.

A computational cell described in the previous sections may also be altered. First, the expanding operations in the beginning of a cell can be represented either by Dense layers or by identity operations with tiling. For instance, for a multiplier 2 it transforms a vector () into a vector (). Second, we can vary the number of intermediate nodes of a cell being aggregated to produce the output (e.g. all intermediate nodes or the last one, two, etc.), as well as the aggregating operation itself (e.g. sum or mean). Smaller number of aggregated nodes may lead to smaller cells after removing redundant connections at the final stage of architecture selection. If all edges connecting some of the intermediate nodes with the output are pruned.

Random Variables and MDN Training. Our implementation has various random variables from Gaussian family: Normal Diagonal, Normal Triangular, Log-Normal Diagonal. Different combinations of these variables can be selected as the components for the mixture distribution, as long as their samples have the same dimensionality. In our experiments, we usually use from 5 to 8 components of Normal Diagonal variables. Replacing 2-3 Normal Diagonal components with the same number of Normal Triangular components sometimes improves model quality, though slows down both training and simulation times.

When training the architecture search, we have three main stages:

– heat-up stage, when weight parameters are trained for 10-20 epochs without updating the architecture parameters,

– architecture search stage, when weight parameters and architecture parameters are updated as described in NAS section (see Fig. 6), for 50 to 100 epochs in total, each turn of updating weight/architecture parameters typically lasts 3 to 5 epochs,

– fine-tuning stage, when the final architecture is fine-tuned for 10 to 20 epochs.

Table 1. Execution time required to complete each step of the abstraction pipeline for different models. The last two rows display the difference in simulation times between the Gillespie SSA algorithm and the MDN abstract model. All simulations here performed in similar conditions: for every model we simulate the same number of trajectories up to the same time horizon (typically 10-20 time steps ), using the same number of CPU cores.

6 Conclusions

In this paper, we proposed how to automatise deep abstractions for stochastic CRNs, through learning the optimal neural network architecture along with learning the transition kernel of the abstract process. Automated search of the architecture makes the method applicable directly to any given CRN, which is time-saving for deep learning experts and crucial for non-specialists. Contrary to the manual approach where the user has to create a neural network by hand, test it for his use-case, and adopt it accordingly, our method allows to find a solution with minimal efforts within a reasonable amount of time. We implement the method and demonstrated its performance on three representative CRNs, two of which exhibit multi-modal emergent phenotypes. Compared to the plain stochastic simulation, our method is significantly faster in almost all use-cases, see Table 1.

The proposed methodology, especially automated architecture search, enables fast simulation of computationally expensive Markov processes. As such, it opens up possibilities for efficiently simulating interactions between many individual entities, each described by a complex reaction network. Although our method is generic with respect to the model, it is sensitive to model population size. In short-term future work, we would like to relax this limitation and develop a strategy that is agnostic to the size of the system being modelled.

References

1. https://github.com/spotify/luigi

2. https://www.tensorflow.org/

3. https://github.com/johnabel/gillespy

4. Anderson, D., Kurtz, T.G.: Continuous-time Markov chain models for chemical reaction networks. Tech. rep., University of Wisconsin - Madison (Jul 2010)

5. Beica, A., Guet, C., Petrov, T.: Efficient reduction of kappa models by static inspection of the rule-set. In: Lecture Notes in Computer Science. pp. 173–191 (2015)

6. Bishop, C.M.: Mixture density networks (1994), http://publications.aston.ac. uk/id/eprint/373/

7. Bodei, C., Bortolussi, L., Chiarugi, D., Guerriero, M.L., Policriti, A., Ro- manel, A.: On the impact of discreteness and abstractions on modelling noise in gene regulatory networks. Computational Biology and Chemistry 56, 98 – 108 (2015). https://doi.org/https://doi.org/10.1016/j.compbiolchem.2015.04.004, http://www.sciencedirect.com/science/article/pii/S1476927115000547

8. Bortolussi, L., Palmieri, L.: Deep abstractions of chemical reaction networks. In: ˇCeˇska, M., ˇSafr´anek, D. (eds.) Computational Methods in Systems Biology. pp. 21–38. Springer International Publishing, Cham (2018)

9. Cai, H., Zhu, L., Han, S.: Proxylessnas: Direct neural architecture search on tar- get task and hardware. CoRR abs/1812.00332 (2018), http://arxiv.org/abs/ 1812.00332

10. Cao, Y., Petzold, L.: Accuracy limitations and the measurement of errors in the stochastic simulation of chemically reacting systems. Journal of Computational Physics 212(1), 6 – 24 (2006). https://doi.org/https://doi.org/10.1016/j.jcp.2005.06.012, http://www. sciencedirect.com/science/article/pii/S0021999105003074

11. Cardelli, L., Kwiatkowska, M., Laurenti, L.: Stochastic analysis of chemical reac- tion networks using linear noise approximation. Biosystems 149, 26–33 (2016)

12. Cardelli, L., Tribastone, M., Tschaikowski, M., Vandin, A.: Syntactic markovian bisimulation for chemical reaction networks. In: Models, Algorithms, Logics and Tools, pp. 466–483. Springer (2017)

13. Conzelmann, H., Saez-Rodriguez, J., Sauter, T., Kholodenko, B.N., Gilles, E.D.: A domain-oriented approach to the reduction of combinatorial complexity in signal transduction networks. BMC Bioinformatics 7, 34 (2006)

14. Courbariaux, M., Bengio, Y., David, J.: Binaryconnect: Training deep neural net- works with binary weights during propagations. CoRR abs/1511.00363 (2015), http://arxiv.org/abs/1511.00363

15. Feret, J., Henzinger, T., Koeppl, H., Petrov, T.: Lumpability abstractions of rule- based systems. Theoretical Computer Science 431, 137–164 (2012)

16. Feret, J., Koeppl, H., Petrov, T.: Stochastic fragments: A framework for the exact reduction of the stochastic semantics of rule-based models. International Journal of Software and Informatics to appear (2013)

17. Ganguly, A., Petrov, T., Koeppl, H.: Markov chain aggregation and its applica- tions to combinatorial reaction networks. Journal of mathematical biology pp. 1–31 (2013)

18. Gillespie, D.: Exact stochastic simulation of coupled chemical reactions. Journal of Physical Chemistry 81, 2340–2361 (1977)

19. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016), http: //www.deeplearningbook.org

20. Gunawardena, J.: A linear framework for time-scale separation in nonlinear bio- chemical systems. PloS one 7(5), e36321 (2012)

21. Henzinger, T.A., Mateescu, M., Mikeev, L., Wolf, V.: Hybrid numerical solution of the chemical master equation. CoRR (2010)

22. Liu, H., Simonyan, K., Yang, Y.: DARTS: differentiable architecture search. CoRR abs/1806.09055 (2018), http://arxiv.org/abs/1806.09055

23. Michaelides, M., Hillston, J., Sanguinetti, G.: Statistical abstraction for multi- scale spatio-temporal systems. Lecture Notes in Computer Science p. 243258 (2017). https://doi.org/10.1007/978-3-319-66335-7-15, http://dx.doi.org/10. 1007/978-3-319-66335-7-15

24. Pahlajani, C.D., Atzberger, P.J., Khammash, M.: Stochastic reduction method for biological chemical kinetics using time-scale separation. Journal of theoretical biology 272(1), 96–112 (2011)

25. Palaniappan, S.K., Bertaux, F., Pichen, M., Fabre, E., Batt, G., Genest, B.: Abstracting the dynamics of biological pathways using information theory: a case study of apoptosis pathway. Bioinformatics 33(13), 1980–1986 (02 2017). https://doi.org/10.1093/bioinformatics/btx095, https://doi.org/10. 1093/bioinformatics/btx095

26. Petrov, T., Koeppl, H.: Approximate reductions of rule-based models. In: European Control Conference (ECC). pp. 4172–4177 (2013)

27. Plesa, T., Erban, R., Othmer, H.G.: Noise-induced mixing and multimodality in re- action networks. European Journal of Applied Mathematics 30(5), 887–911 (2019)

28. Tribastone, M., Vandin, A.: Speeding up stochastic and deterministic simulation by aggregation: an advanced tutorial. In: Proceedings of the 2018 Winter Simulation Conference. pp. 336–350. IEEE Press (2018)

29. Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. CoRR abs/1611.01578 (2016), http://arxiv.org/abs/1611.01578

7 Supplementary Material

7.1 Background and Preliminaries

Fig. 8. A single layer of a neural network. Outputs are computed as a linear transfor- mation Wx + b followed by a non-linear activation function

7.2 EGFR

Epidermal growth-factor receptor (EGFR) reaction model of cellular signal trans- duction, with 25 reactions and 23 different molecular species:

Fig. 9. Three sample trajectories of EGFR network starting from same initial state for 50 time steps

Fig. 10. EGFR: traces simulated by Gillespie algorithm (dashed lines) and MDN (full lines) for 50 consecutive time steps,

Fig. 11. Simulation times for EGFR model. Left: MDN model, right: Gillespie simulation. Times are measured to simulate traces of length 5 for different combinations of number of initial settings and number of traces for each setting.

Fig. 12. EGFR: learned network structure. Gray rectangles represent input nodes, blue - intermediate nodes, Green and yellow - output nodes (or cell inputs). Intermediate nodes compute the sum of values on incoming edges, output nodes - the average.

7.3 Gene

Self-regulated gene network [7,8]: a single gene G is transcribed to produce copies of a mRNA signal molecule M, which are in turn translated into copies of a protein P; P acts as a repressor with respect to G - it binds to a DNA-silencer region, inhibiting gene transcription.

7.4 X16

The following fast-slow network ([27]) displays interesting dynamics with multi-modal species distribution changing through time, as well as for different initial settings:

Network (15) may be interpreted as describing a gene slowly switching between two expressions Gand G. When in state G, the gene produces and degrades protein P, while when in state G, it only produces P, but generally at a different rate than when it is in state G. Furthermore, Pmay also spontaneously degrade.

Fig. 13. Sample trajectory of gene regulatory network.

Fig. 14. Gene: mean histogram distance (intersection over union) averaged over 25 different initial settings.

Fig. 15. Gene: histograms of protein P concentration after 5 time steps (left) and 25 time steps (right).

Fig. 16. Gene: simulation times for MDN model (left) and Gillespie simulation (right). Times are measured to simulate traces of length 5 for different combinations of number of initial settings and number of traces for each setting.

Fig. 17. Gene: learned network structure. Gray rectangles represent input nodes, blue - intermediate nodes, Green and yellow - output nodes (or cell inputs). Intermediate nodes compute the sum of values on incoming edges, output nodes - the average.

Fig. 18. Sample trajectory of X16 network.

Fig. 19. X16: mean histogram distance (intersection over union) averaged over 25 different initial settings.

Fig. 20. X16: histograms of protein concentration after 1 time step (left) and 9 time steps (right).

Fig. 21. X16: simulation times for MDN model (left) and Gillespie simulation (right). Times are measured to simulate traces of length 5 for different combinations of number of initial settings and number of traces for each setting.

Fig. 22. X16: learned network structure. Gray rectangles represent input nodes, blue - intermediate nodes, Green and yellow - output nodes (or cell inputs). Intermediate nodes compute the sum of values on incoming edges, output nodes - the average.