Bayesian Learning of Causal Relationships for System Reliability

2020·arXiv

Abstract

1 Introduction

The Bayesian Network (BN) has proved to be one of several efficient machine learning graphical tools for modelling conditional independencies and making inference about relationships between many variables. Concurrently advances in causal modelling has led to a better understanding of how to predict complex systems when these are subjected to control. A major breakthrough then occurred about 20 years ago when it was then discovered that causal and graphical modelling could be combined. This paper reports on how the combination of these two technologies are being applied to give more insights concerning causal hypotheses that relate specially to reliability and system safety.

Of course causal ideas have been embedded within reliability theory for a long time, both to explore the reasons behind a failure and to also estimate the efficacy of various interventions in the system that might ameliorate adverse outcomes. However, the graphical frameworks around which these ideas appear have been tree like – for example fault trees or event chains (Leverson 2011) – rather than the most common graphical framework for analyzing causation: the BN. Only recently have graphical causal methods emerged for methodologies and algorithms to exist for causal discovery and causal reasoning on such tree structures. The primary such graph is the chain event graph (CEG), see Thwaites, Smith and Riccomagno (2010), Barclay, Hutton and Smith (2013) and Collazo et. al (2018). This class of probabilistic graphical model enables us to search for and then hypothesize causal relations between events and to estimate effects of external interventions via a Bayesian hierarchical learning process, transferring technologies originally designed for BNs. Here we harness this CEG technology transfer to develop new causal methodologies to shine a new light on causal analysis within reliability theory.

In this paper we import the concepts of “root causes” and “remedial work”, that are currently central ideas within reliability, into a graphically based causal model and relate these to engineers’ written explanations. A “root cause” of a failure is the initial contributing factor that leads to this defect and is the target of an ideal “remedial work” after a failure is to fix this root cause. In reliability theory, a root cause analysis is usually supported by a fault tree analysis where “bottom level” nodes in the tree depict root causes (Gano 2011). In contrast, the model framework we develop here is described by trees that respect the chronological order in which they might happen, and so, in particular, the causal story behind the fault. This alternative representation then enables us to capture various causal hypotheses as fragments of these CEGs. They provide the vehicle through which reported engineering expert judgements can be systematically embedded into a statistical analysis of failure data. These causal fragments are first extracted from the natural texts obtained from a maintenance log. We then demonstrate how a Bayesian learning algorithm constructed from these extracted explanations can provide automatically generated inferential support about the root causes of failures. We argue that, if properly implemented, such methods could provide powerful decision tools for an automized causal fault analyses which, most importantly, can be scaled up. Expert judgements are captured by both the hypothesized framework depicting how faults might occur and prior probabilities about the likelihood of each step in these paths to failure.

Standard BN causal model would classify “remedial work” as a particular portfolio of an external intervention (Pearl 2000, Iung et.al 2005). Here instead we use the more recently developed CEG causal algebra (Thwaites, Smith and Riccomagno 2010) to express this intervention. In order to do this, a remedial intervention needs to be carefully defined. In particular, we need to define a new “do” operation that makes it possible for a graphically supported causal analysis to apply to intervened system in reliability studies.

In Section 2 we review the concept of a CEG. Then in Section 3 we give a detailed explanation of how we define a remedial intervention and show how to mathematically express such intervention in terms of causal algebras and systematically reconstruct a CEG after a remedial intervention. This is followed by a simulation study in Section 4 demonstrating the efficacy of adding causal information into a reliability analysis.

2 Chain Event Graph and Causal Concepts

Here we develop a methodology to automatically accommodate all information embedded within maintenance logs of plants into the analysis of failure data. Such logs consist of numeric and categorical data, such as time and failure mode, but are also supplemented by natural texts written by engineers explaining what and why they have chosen certain remedial actions. These texts can obviously be extremely informative about the causes, symptoms, defects that have occurred (Iung et.al 2005). A graphical framework is a useful tool to accommodate this information. We argue here that a tree based framework such as a CEG rather than a BN network should be the preferred such graph. So we next illustrate how to represent the explanations embedded within these texts to construct hypothesized or chains of events that might have led to a failure within a CEG.

So let an event tree have an associated probability vector , where each edge emanating from is annotated by a component of , each component of being associated with one of the edges out of . A staged tree is a colored tree that embellishes an underlying event tree. Two vertices and are said to be at the same stage, given the same color, if up to a permutation of their components (Collazo et.al 2018). Edges out of

Figure 1. Example of a staged tree for security system is shown on the left, the event tree structure is

and are then colored the same if they correspond to events assigned the same probability. A new graph, the CEG, is then defined as follows.

Definition 1 (Chain Event Graph) A Chain Event Graph (CEG) is a graph with vertex set given by the set of positions in the underlying stage tree. A position is the set of stages whose event subtrees have the same colored graph and the same associated sets of edge labels. Every position inherits its color from the staged tree. is a set of edges between these vertices with the following properties. If there exist edges and are in the same position, then there exist corresponding edges . If also are in the same position, then . The labels of edges in the new graph are inherited from the corresponding edges in the staged tree. Then is called a Chain Event Graph (Barclay et.al 2015, Collazo et.al 2018).

A causal CEG orders events along its root-to-sink paths to be consistent with any hypothesized temporal ordering of events. This is especially useful when events that might happen after a control – here a remedial action – are expressing various causal hypotheses. We will see that a remedial action will only be able to affect the train of events that happen after the intervention, i.e. downstream from the root of the CEG. An example of such a CEG is given in Figure 1, which starts with root causes: {oil supply, lightening}. The CEG of a reliability system starts with partitions of components and failure modes followed by paths describing putative causes whilst symptoms are represented towards the sink of the CEG. An indicator variable defined at the end of each path indicates whether that part has failed.

3 External Intervention

We next devise a bespoke intervention calculus analogous to the calculus of Pearl (2000) for BNs. This embellishes the tree based CEG calculus in Collazo et al. (2018) to make this suitable to the domain of reliability. Here for brevity we will focus our discussion on expressing graphically the central concept of a root cause and their potential remedies. This provides us with a framework within which natural language explanations in maintenance logs describing faults and remedial actions taken can be systematically embedded into particular families of probabilistic models subjected to control. For a discussion of the practicalities of how the expert judgements we refer to below are elicited, how data can be used to refine these estimates in non-vanilla examples and how NLP methods can be used to extract the causal fragments see Yu et al. (2020).

3.1 A classification and new semantic for remedies

Previous work (Iung et.al 2005, Borgia et.al 2009, Cai et.al 2013) classified maintenance into “perfect/minimal/imperfect maintenance”. Here we use terminologies closely parallel previous ones. If an intervention successfully addressed the root cause of a fault, it will be called a “remedy”. Under this intervention we will assume that all subcomponents associated to the root cause of the failure are “renewed” and the status of the remedied part is “as good as new” (“AGAN”) (Bedford and Cooke 2001, Andrzejczak 2015). Let denote the result of a remedial intervention that observed from maintenance logs. The space of is partitioned so that classifies the results into three types.

fully addressed and corrected by . A graphical illustration for this type is shown in Figure 2(a). The dashed directed edge connecting from the last symptom to the root node represents the status of the corresponding piece of machine being returned to just before any defect occurs: an “AGAN” status, coinciding with “perfect maintenance” as defined by Iung et.al (2003).

faults are remedied, this is an imperfect result is termed as an imperfect remedy. The condition of the defect equipment is returned to the status after occurrence of root cause, which is not “AGAN”, see (b) where direct dashed lines pointing from the sink node to the vertices on the downstream path of the edge labelled as root cause. Here to remedy the root cause additional maintenance will be required. This is represented by the grey vertex in (b). When from logs the result is uncertain immediately after it is classified as – an uncertain remedy. Diagnostic information is not yet available so the root cause of the failure cannot yet be determined. As yet unknown and undescribed follow-up maintenance might be needed: see (c). To define this graphical construction more generically, we next define a new variable to indicate whether a root cause is targeted by a remedial intervention. This is a binary vector used to indicate which root causes are targeted by the remedial action.

Figure 2. Different types of remedies

3.2 Causal algebra for remedial interventions

Adapting Pearl’s “do”-operation for use with a CEG is given in Collazo et.al (2018). Retain the notation of Section 2 used for a CEG in an “idle” system – i.e. one not under any control. Now for a manipulated CEG , where we have forced an event labelled on edge to happen, the conditional probability ; and all other edges emanating from the same vertex as have conditional probability . Here we now need to introduce a new the intervention on CEG, designed specifically for

The indicator is the tool that delivers the effect of intervention. Depending on the

types of intervention could be directly observed or have a distribution. Thus:

(i) Under perfect intervention:

(ii) Under imperfect intervention:

𝑝(𝑰?>(𝑟∗)|𝑑𝑜(𝑟∗), 𝑇) = ∫ ∑ 𝑝(𝑰?>(𝑟∗)|𝜸)𝑓(𝜸|𝑥a, 𝑟∗)𝑝(𝑥a|𝑑𝑜(𝑟∗)))lm∈no(d∗)pq 𝑑𝜸 (5) (iii) Under uncertain intervention:

3.3 Causal algebras and the analysis of the impact of interventions in a CEG

A semi-Markov process defined on the CEG is now used to model the failures and maintenance activities described within a log. The state defined in semi-Markov process is analogous to the state of machines represented on the tree. So represent the renewal kernel by , with transition probabilities and holding times distribution :

The effect of an intervention is to assign the new probability distributions of a subset of

root causes. Here both the transition probabilities and the holding times distributions associated

to edges of the CEG can be a function of the chosen manipulation. Once the root cause labelled on is remedied it is less likely the next failure is still caused by it. The transition probability along this edge typically reduces whilst its associated holding time is expected to increase.

To express this mathematically we next define explicit maps to update the distribution after

control. Let

* and be the new and old hyper parameter vectors after manipulation for transition probability vector . Define the map updating

* (8) Here the parameter controls the effect of on . Assume is either a known value, set by experts, or has a known distribution with hyper-parameters again whose values are set by experts. We assign a probability distribution to the intervention indicator : if is not perfect we learn this through a Bayesian model. The distributions and the associated parameters can be determined by domain experts. We allow flexibility in the choice of the transformation given different circumstances. The exact form of depends on expert judgements and any relevant past data.

Let parameter represent the strength of an intervention on the holding time distributions for edges emanating from the vertex . This is set by users or has a known distribution. Similarly, we define a function such that:

where and

* are the vectors of hyper-parameters for the holding time distributions over all edges emanating from before and after manipulation respectively. The methodology for choosing the precise form of these functions is beyond the scope of this paper. However, detailed explanations and examples for how to choose these functions in a variety of circumstances based on natural language texts are given in Yu et.al (2020).

4 Experiment

In this section we will show the potential additional value of embedding causal elements extracted from natural language texts into fault analysis using a simulation study. In particular, we demonstrate how estimation and prediction improves when such information is not ignored.

Here we use our revised framework to compare the estimation of parameters on CEG that formally takes account of the applied intervention (for details see supplementary materials) with a model that does not using an artificial dataset when we have a perfect description of the system. This helps bound the advantages we can expect from utilizing natural language texts. Assume the ground truth tree shown in Figure 3(a) and also the stages and positions are known. Nodes in the same stage are colored in orange: {}, green: {} and pink: {} respectively. There are clusters of edges that share the same holding time distributions: {}, {}, {, }, {}, {}. For the data generating process, we set up ground truth transition probabilities and time for edges in the tree without specifying them separately for the intervened and non-intervened system. These parameters over edges can be estimated via the algorithm for either of the two regimes. We generated 10 groups of machines’ data such that in each group, where these machines are assumed to have the same maintenance history and share the same parameters for conditional probabilities and holding time distributions. In total, we generated 5 different dataset with sizes 500,1000,3000,5000 and 10000 respectively.

Assume transition probabilities are and their holding time is drawn from . Prior elicitation for is discussed in Yu et.al (2020). An MCMC-within-Gibbs algorithm enabled parameters to be estimated. Agglomerative Hierarchical Clustering then selected a CEG representation for the data. Three levels of stages could logically be merged: . We ran 100 simulations for different

(a) (b) Figure 3. (a) Event tree used for simulation study; (b) Comparison of models with and without

Figure 4. Plots of errors. The three plots on the top are errors for intervened model; the three plots on the

bottom are errors (Shenvi & Smith 2018) for the non-intervened model; the left two plots show the result of situational errors the two plots in the middle show the cluster errors; the two plots on the right are average

numbers of phantom units for each sample size. Figure 3(b) presents the proportions of the simulated results that merge two stages in the same level or two edges in the same cluster. We can observe the intervened model agrees with the ground truth in terms of merge of stages. On the other hand, non-intervened model cannot estimate the CEG structure well, especially for stages in and even with a large sample size. It is also evident that with a sample size greater than 3000, the merging of edges in terms of the holding time distributions can be well estimated by the algorithm when our intervention is acknowledged. The advantages of incorporating causal factors stand out especially for larger dataset when comparing the estimates with ground truth: see Fig. 4. For example, for 5000 samples, the mean cluster error is 0.31 for intervened model, 0.116 smaller than the non- intervened model. We also compared the posterior estimate of mean holding times and the real ones. The intervened model has mean error 0.28 while the non-intervened gives 0.32. The prediction in terms of transition probabilities is also improved by 0.16.

our bespoke causal algebras vastly improves a statistical analysis, especially when informed by large samples. Estimates of conditional probabilities and mean holding times and failure predictions are all then much more accurate once textual information is used.

5 Discussion

This short paper outlines some of our recent work in creating causal algebras for remedial intervention. A full exposition, where we explain in detail the hierarchical Bayesian model for structure learning, the causal inference and our elicitation methodology will be reported soon. Our next task is to scale up this methodology so that it become a feasibly implementable tool for larger logs where robust automatic natural language extraction is needed. There is much yet to do. However we hope that we have demonstrated the promise of harnessing free text log data for a reliability analysis by devising a new graphically supported bespoke causal analysis.

Acknowledgement

This project is funded by the Engineering and Physical Sciences Research Council (EPSRC) and the statistics department of the University of Warwick. Professor Jim Q. Smith is supported by the Alan Turing Institute and EPSRC with grant number EP/K03 9628/.

References

Andrezejczak, K., Stochastic modelling of the repairable system, Journal of KONBiN, 35(1), 5-14, 2015.

Barclay, L.M., Hutton, J.L, & Smith, J.Q., Refining a Bayesian network using a chain event graph, International Journal of Approximate Reasoning, 54(9), 1300-1309, 2013.

Barclay, L.M., Collazo, R.A., Smith, J.Q., Thwaites, P.A. and Nicholson, A.E., The dynamic chain event graph, Electronic Journal of Statistics, 9(2), 2130-2169, 2015.

Bedford, T. and Cooke, R, Probabilistic Risk Analysis, Cambridge University Press, 2001.

Borgia, O., De Carlo, F., Peccianti, M., & Tucci, M., The use of dynamic object oriented Bayesian networks in reliability assessment: a case study, Recent advances in maintenance and infrastructure management. London: Springer-Verlag, 153-170, 2009.

Cai, B., Liu, Y., Zhang, Y., Fan, Q., & Yu, S., Dynamic Bayesian networks based performance evaluation of subsea blowout preventers in presence of imperfect repair, Expert Systems with Applications, 40(18), 7544-7554, 2013.

Collazo, R. A., Görgen, C., and Smith, J. Q., Chain event graphs, CRC Press, 2018.

Collazo, R. A., & Smith, J. Q., A new family of non-local priors for chain event graph model selection. Bayesian Analysis, 11(4), 1165-1201, 2016.

Eaton, D. and Murphy, K., Exact Bayesian structure learning from uncertain interventions, Artificial Intelligence and Statistics, 2007.

Gano, D. L., Reality Charting: Seven Steps to Effective Problem-solving and Strategies for Personal Success, Apollonian Publications, 2011.

Iung, B., Veron, M., Suhner, M. C., & Muller, A., Integration of maintenance strategies into prognosis process to decision-making aid on system operation, CIRP annals, 54(1), 5-8, 2005.

Leveson, N., Engineering a safer world: Systems thinking applied to safety, MIT Press, 2011.

Pearl, J., Causality: models, reasoning and inference, Vol. 29, MIT Press, Cambridge, 2000.

Shenvi, A., & Smith, J. Q., The Reduced Dynamic Chain Event Graph, arXiv preprint arXiv:1811.08872, 2018.

Thwaites, P., Smith, J.Q., and Riccomagno, E., Causal analysis with chain event graphs, Artificial Intelligence, 174(12-13), 889-909, 2010.

Yu, X., Smith, J.Q. & Nichols, L., Probabilistic Learning of Causal Reasoning in Reliability, 2020 (in preparation).

Designed for Accessibility and to further Open Science