My stuff
Bayesian Verification under Model Uncertainty

Machine learning enables systems to build and update domain models based on runtime observations. In this paper, we study statistical model checking and runtime verification for systems with this ability. Two challenges arise: (1) Models built from limited runtime data yield uncertainty to be dealt with. (2) There is no definition of satisfaction w.r.t. uncertain hypotheses. We propose such a definition of subjective satisfaction based on recently introduced satisfaction functions. We also propose the BV algorithm as a Bayesian solution to runtime verification of subjective satisfaction under model uncertainty. BV provides user-definable stochastic bounds for type I and II errors. We discuss empirical results of a toy experiment.

Statistical approaches to model checking and runtime ver-ification exploit a domain model in order to evaluate system properties at design and runtime [1]. The system simulates potential traces based on the domain model in order to establish some statistical guarantees about properties of interest.

Statistical verification is often based on a singular domain model [2], [3], [4]. Machine learning enables systems to build and adapt models about their application domains based on runtime observations (see e.g. [5], [6]). In particular, Bayesian statistics generally allow to infer and reason about an infinite amount of models [7]. Bayesian approaches allow to quantify the likelihood of a particular model, given prior beliefs and observed data. A system that verifies itself at runtime has to cope with model uncertainty to establish reliable verification results: Which hypothesis to assume when assessing system properties?

Model uncertainty induced by learning from limited runtime information also raises another issue: What does it exactly mean for a system to satisfy a particular property, given many model hypotheses and their respective plausibilities?

In this paper, we study statistical verification for systems that are able to build and update their models based on runtime information. The paper’s contributions are twofold:

We propose a definition of subjective satisfaction for systems that perform runtime verification based on limited information and possibly infinite hypothesis spaces.

We also propose a Bayesian verification algorithm, BV, that enables learning systems to decide on satisfaction or violation of required subjective satisfaction. BV provides user-definable stochastic bounds for type I and II errors (false negatives/positives). By construction, these error bounds are independent of the number of observations a system made about its environment.

We empirically establish the validity of BV’s error bounds on a toy example. The paper is structured as follows. Section II recaps Bayesian model checking and satisfaction functions for systems with parametrized models. In Section III we introduce our definition of subjective satisfaction. In Section IV we discuss the Bayesian treatment of verification under model uncertainty with the BV algorithm. In Section V, we describe setup and results of an empirical evaluation of BV. We discuss related work in Section VI. Section VII concludes and discusses venues for further research.

This Section recalls Bayesian model checking and satisfac- tion functions for parametrized models.

A. Bayesian Model Checking

Bayesian model checking (BMC) is based on Bayesian sequential hypothesis testing, and aims to infer the posterior distribution of the probability that a system satisfies its requirements [3], [4]. In contrast to point estimation (e.g. maximum likelihood), the Bayesian posterior captures the uncertainty about the true probability that arises from only performing a finite number of system assessments.

Requirements may be formally specified in a suitable probabilistic temporal logic [8], [9].

BMC treats a bounded simulation run of a system with a particular configuration as a Bernoulli experiment: The run may either satisfy or violate requirements. As the simulation captures probabilistic domain dynamics, the result of a simulation run is Bernoulli distributed with a probability  p ∈ [0, 1].

BMC infers a posterior distribution over p based on the observed simulation results and a prior assumption about the distribution of p. In general, the posterior is proportional to the likelihood of observed data D (i.e. the results of simulation), multiplied by the prior distribution  P(θ)over the parameters of interest,  θ = pin the case of BMC (Equation 1).


BMC models the uncertainty about p by a Beta distribution, the conjugate prior of the Bernoulli distribution. This approach ensures that the posterior is of the same form as the prior distribution, and thus enables efficient sequential updating of the distribution. The Beta distribution is parametrized by two parameters  a, b ∈ R+. In the case of BMC, a and b are given by the successes and failures of the simulation runs. Given s successes, f failures, and assuming a uniform prior over p, the posterior (for  θ = p) is determined by Equation 2.


Termination can be determined by assessing whether the probability mass above or below the required value of  preqmeets a particular confidence requirement  creq ∈ (0; 1). For alternative termination criteria, we refer to [4].

B. Satisfaction Functions

Many modern systems operate with models of the environment that are stochastic and parametrized, e.g. models build by machine learning. Classical statistical model checking algorithms, including classical BMC, enable to assess requirement satisfaction for a single parametrization of the model. Recently, the satisfaction function was introduced as a concept to allow for efficient, regressive assessment of requirement satisfaction for parametrizable models with potentially infinitely many parameters [10]. At its core, the satisfaction function is defined as follows.


Here, sat denotes a boolean variable indicating requirement satisfaction or violation, and  θare the model parameters. The satisfaction probability is depending on the particular parametrization of the model. However, note that the definition of the satisfaction function does not make any assumptions about the distribution of the parameters themselves. We will now turn to combine estimations about the parameters and the satisfaction function in order to define what we label subjective satisfaction.

Consider a system that was able to make a limited number of observations about the dynamics of its environment. For example, consider a mobile agent whose moves may fail to have an effect with Bernoulli probability  pfail ∈ [0; 1]. The agent may observe whether its moves are effective or not. Consider a situation where the agent observed its moves 10 times, out of which two had no effect. The following questions naturally arise:

What is  pfail?

How confident can the agent be in its estimate of  pfail? With these two questions in mind, consider now the situation that the agent finds itself in a grid world with obstacles at particular positions. Also, the agent has a sequence of movements to be executed in order to fulfill some given task, e.g. computed by a planning component. Consider that there is a requirement that the agent is only allowed to hit a limited number of obstacles (e.g. 2), with at most a specified probability  preq ∈ [0; 1]. Another question arises: What is the probability  psatthat the sequence of movements will satisfy the requirements, given the limited observations about  pfail?

In this setup, an agent has to cope with various uncertainties:

1) Domain uncertainty is inherent to the environment, in our example given by  pfail. It is aleatoric, therefore irreducible and originates from the physical setup of the domain (e.g. sensory abstraction, laws of physics, etc.). Note that domain uncertainty in combination with requirements uniquely defines a satisfaction function (cf. Section II-B).

2) Model uncertainty is the epistemic uncertainty about the aleatoric domain uncertainty. It arises from the limited number of observations that the agent is able to collect from its environment. Note that model uncertainty not only arises from models learned at runtime. All empirically assessed models convey this kind of uncertainty, in particular all models built with machine learning approaches, regardless of the position in the a system’s development lifecycle.

3) Subjective satisfaction, the uncertainty about a plan satisfying (or violating) a requirement in a particular situation, is also epistemic. It is a consequence of domain and model uncertainty, and the given system requirements. The relation of domain and model uncertainty can be modeled in a Bayesian way. This is a widely adopted view, and a vast body of literature and techniques exists for estimating model uncertainty  P(θ|D)based on available domain observations D [5], [7]. For readability, we write  P(θ)for P(θ|D)in the remainder of the paper. We now combine model uncertainty with the satisfaction function to define subjective satisfaction  psat.


Subjective satisfaction  psatcan be interpreted as the parameter of a Bernoulli distribution that models uncertainty about satisfaction of the requirements. Intuitively, Eq. 4 weights the satisfaction probability for given parameters w.r.t. the the probability that these parameter represent the ground truth. Subjective satisfaction is considering all possible hypothetic domain parametrizations at once, and weights their respective satisfaction probabilities according to their plausibility (which is based on domain observations).

We now define Bayesian Verification (BV), an algorithm for estimating subjective satisfaction by Monte Carlo simulation. By taking a Bayesian stance, we also get a confidence measure for this estimate. In fact, due to assessment of satisfaction with a limited number of simulations, an additional source of uncertainty arises: The uncertainty about the estimate of  psat. BV establishes and updates a probability distribution P(sat) to quantify this uncertainty, and uses it to decide on termination. BV takes the following inputs.

The current system state s.

 P(θ), the system’s model uncertainty.

A probabilistic simulation model of the domain dynamics M, parametrized by  θ. Mtakes a state, a plan, a requirement and a parametrization, and yields a boolean variable indicating requirement satisfaction. I.e. this model implicitly provides the satisfaction function  P(sat|θ).

The system’s plan  πto be assessed.

A system requirement  φ, e.g. a temporal logic formula.

A required probability  preqof satisfying  φ.

A required confidence  creqin the estimate of  psat.

BV is shown in Algorithm 1. BV first initializes it estimate of  psat. As satisfaction of a requirement in a stochastic domain can be interpreted as a Bernoulli random variable we use a uniform prior, which is a Beta(1, 1) distribution (line 2).

We define the confidence in the estimate of  psatthat is above the required satisfaction probability  preqby determining the probability mass of  P(psat)above  preq.


BV updates its estimate  P(psat)and uses it in order to decide whether the estimate of satisfaction (or violation) can be done with at least required confidence (cf. Equation 5). To this end, it performs the following steps in repetition.

1) A sample parametrization is drawn from the model uncertainty  P(θ)(line 4).

2) A simulation run is performed w.r.t. state, plan, requirement and parameters (line 5). Note that the simulation result is distributed accounting for both model uncertainty and satisfaction function, as the parameterization has been sampled from model uncertainty before. That is,  sat ∼ P(sat|θ)P(θ)at this point.

3) The simulation result is used to update the belief distribution about  psat(line 6).

4) The probability mass of the belief distribution is used to determine whether satisfaction or violation have been assessed with at least required confidence (Eq. 5). If so, the algorithm terminates accordingly (lines 7 and 8).


We empirically assessed BV on a toy example. While we modeled a very simple example, it may be worth noting that in general the Bayesian approach to model uncertainty scales up to much larger models. There exist varied and powerful tools for sampling from complex, high-dimensional posteriors P(θ), such as Markov Chain Monte Carlo (see e.g. [11] for a very interesting read), or variational inference (e.g. [12]).

A. Setup

The state s is constituted by a 10 x 10 grid world, with the agent at position (0, 0). Obstacles are randomly positioned, at an obstacle to free position ratio of 0.2. The agent is presented a plan  π(an action sequence) of 10 movements (up, down, left, right, with obvious semantics). The agent has a Bernoulli action failure probability  pfailuniformly sampled from [0; 1]. Action failure results in the inverse movement (e.g. failing up yields down). The agent is presented a number of observations about its failure probability before running BV. We build model uncertainty  P(θ)about  θ = pfailwith a Beta distribution (cf. Eq. 2 and Section IV).

In our setting,  φis the requirement to hit less than three obstacles while executing the plan. We set  preq = 0.9. This means we allow the agent to classify a plan as satisfying the requirement if it hits less than three obstacles in ninety percent of executions. We use a confidence requirement of  creq = 0.95.

We approximate the ground truth satisfaction probability of a plan  πby taking the maximum likelihood estimate of satisfaction probability based on 10000 simulation runs. We assessed two error types.

A type I error is an incorrect rejection. This occurs if a plan  πsatisfies  φwith at least probability  preqand is falsely rejected.

A type II error is a false accept. This occurs if a plan  πviolates the requirements (i.e.  φis not satisfied with at least probability  preq) and is falsely accepted.

We also assessed a variant of BV that does not explicitly build model uncertainty from observations, but rather builds a corresponding maximum likelihood estimate (ˆpfail =observed failures / number of observations). Line 4 is correspondingly changed to  θ ← ˆpfailin Algorithm 1.

Our implementation of the setup and BV is available at https://github.com/jazzbob/bv.

B. Results

Results are recorded for 10 randomly sampled observations of action failure probability. An exemplary result of our experiments is shown in Figure 1. The former shows accumulated type I errors over the course of different setups (i.e. randomly generated environments paired with random plans), the latter type II errors respectively. The dashed line shows the required statistical error bound (0.05 for  creq = 0.95).

In particular, BV is able to establish the required statistical error bounds for both error types, while the MLE approach that is not explicitly using model uncertainty for inference fails to do so for type II errors. We observed this behavior for various numbers of observations presented to the system.


Fig. 1. Type I (left) and II (right) errors. X-axis shows number of tested situations (s, π). Vertical axis shows accumulated number of type I and II errors.

BV is an instance of statistical model checking in general [2], and Bayesian statistical model checking in particular [3], [4]. Typically, these approaches are assuming a perfect available model, and do not deal with explicitly quantified epistemic model uncertainty. One of the starting points of the current article is the work on smoothed model checking [10]. SMC approximates a satisfaction function w.r.t. uncertain model parameters by Gaussian process regression. However, SMC does not incorporate distributions over model parameters for system assessment. Our definition of subjective satisfaction is a direct consequence of combining quantified model uncertainty with SMC’s satisfaction function. Parametrized Bayesian model checking for DBNs [13] does deal with quantified model uncertainty. However, the author does not exploit the posterior for bounding or estimating errors. The algorithm terminates when the posterior variance “is less than some user-specified threshold”. This approach does not yield statistical error estimates or bounds. We argue that in the context of software engineering, quantifiable error guarantees or estimates play a key role for system assessment. A quite different approach to quantitative system assessment under model uncertainty is formal verification with confidence intervals (FACT) [14]. It is based on (exhaustive) probabilistic model checking, and therefore allows to perform more thorough analysis than BV, which is approximate and (temporally) bounded. However, for the same reason, FACT suffers from the state space explosion. FACT models uncertainty in terms of frequentist confidence intervals, in contrast to BV’s Bayesian modeling approach.

We have presented a Bayesian approach to statistical model checking under model uncertainty. We introduced the notion of subjective satisfaction as a result of combining recently introduced satisfaction functions with model uncertainty. We also presented Bayesian Verification (BV), an approximate Monte Carlo style algorithm for assessing subjective system satisfaction based on a simulation. BV allows for user-specified confidence bounds, and thus enables to statistically bound verification errors. We empirically evaluated BV on a toy example with positive results.

There are some limitations to the BV algorithm. When  psatis close to  preq, BV may take a many iterations to establish the required confidence. Note that this property is independent from the absolute value of  psat. Similar to Bayesian model checking based on a fixed model, BV scales well with required satisfaction probabilities close to one (see e.g. [4]). BV’s obtained error bounds are statistical: They do not provide a hard upper bound. I.e. this bound may be surpassed temporarily when operating BV (e.g. an error may occur even when running BV only once, yielding an error rate of one). Also, while we could empirically observe that the error bound was not severely violated for our toy problem, there may be an intimate connection to the choice of prior for  P(θ). To study the connection of prior and error bound would probably yield interesting directions for further research. Another limitation of BV is its boundedness in terms of search depth. To this end, it would be interesting to increase the quality of satisfaction estimates, for example by adding global, previously trained satisfaction estimators to BV.

The authors would like to thank Martin Wirsing and Matthias H¨olzl for many inspiring discussions that led us into the direction of research presented in this paper.

[1] M. Kwiatkowska, G. Norman, and D. Parker, “PRISM 4.0: Verification of probabilistic real-time systems,” in Proc. 23rd International Conference on Computer Aided Verification (CAV’11), ser. LNCS, G. Gopalakrishnan and S. Qadeer, Eds., vol. 6806. Springer, 2011, pp. 585–591.

[2] A. Legay, B. Delahaye, and S. Bensalem, “Statistical model checking: An overview,” in International Conference on Runtime Verification. Springer, 2010, pp. 122–135.

[3] S. K. Jha, E. M. Clarke, C. J. Langmead, A. Legay, A. Platzer, and P. Zuliani, “A bayesian approach to model checking biological systems,” in International Conference on Computational Methods in Systems Biology. Springer, 2009, pp. 218–234.

[4] P. Zuliani, A. Platzer, and E. M. Clarke, “Bayesian statistical model checking with application to simulink/stateflow verification,” in Proceedings of the 13th ACM international conference on Hybrid systems: computation and control. ACM, 2010, pp. 243–252.

[5] D. J. MacKay, Information theory, inference and learning algorithms. Cambridge university press, 2003.

[6] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.

[7] E. T. Jaynes, Probability theory: The logic of science. Cambridge university press, 2003.

[8] A. Pnueli, “The temporal logic of programs,” in Foundations of Computer Science, 1977., 18th Annual Symposium on. IEEE, 1977, pp. 46–57.

[9] C. Baier, J.-P. Katoen, and K. G. Larsen, Principles of model checking. MIT press, 2008.

[10] L. Bortolussi, D. Milios, and G. Sanguinetti, “Smoothed model checking for uncertain continuous-time Markov chains,” Information and Computation, vol. 247, pp. 235–253, 2016.

[11] P. Diaconis, “The Markov chain Monte Carlo revolution,” Bulletin of the American Mathematical Society, vol. 46, no. 2, pp. 179–205, 2009.

[12] M. J. Wainwright, M. I. Jordan et al., “Graphical models, exponential families, and variational inference,” Foundations and Trends R⃝ in Ma-chine Learning, vol. 1, no. 1–2, pp. 1–305, 2008.

[13] C. J. Langmead, “Generalized queries and bayesian statistical model checking in dynamic bayesian networks: Application to personalized medicine,” 2009.

[14] R. Calinescu, C. Ghezzi, K. Johnson, M. Pezz´e, Y. Rafiq, and G. Tam- burrelli, “Formal verification with confidence intervals to establish quality of service properties of software systems,” IEEE Transactions on Reliability, vol. 65, no. 1, pp. 107–125, 2016.

Designed for Accessibility and to further Open Science