b

DiscoverSearch
About
My stuff
Visual Summary of Value-level Feature Attribution in Prediction Classes with Recurrent Neural Networks
2020·arXiv
Abstract
Abstract

Deep Recurrent Neural Networks (RNN) is increasingly used in decision-making with temporal event sequences. However, understanding how RNN models produce final predictions remains a major challenge. Existing work on interpreting RNN models for sequence predictions often focuses on explaining predictions for individual data instances (e.g., patients or students). Because state-of-the-art RNN models are formed with millions of parameters optimized over millions of instances, explaining predictions for single data instances can easily miss a bigger picture. Besides, RNN models often use multi-hot encoding to represent the presence/absence of features, where the interpretability of feature attribution with numeric values is missing. We present ViSFA, an interactive system that visually summarizes feature attribution over time for different feature values. ViSFA scales to large data such as the MIMIC dataset containing the electronic health records of 1.2 million high-dimensional temporal events. We demonstrate that ViSFA can help us reason RNN prediction and uncover insights from data by distilling complex attribution into compact and easy-to-interpret visualizations.

image

Deep Recurrent Neural Networks (RNN) is pervasively used in reasoning and decision-making tasks for sequential data analysis. Due to its remarkable performance and broadly applicable feature, RNNs have helped solve problems in domains from fundamental research such as natural language processing (NLP) [44] and video analysis to domain-specific research such as electronic health records (EHR) analysis [5], customer behavior analysis [42], and stock prediction [2].

Despite RNNs’ popularity and remarkable performance, end-users’ demand on model trust, fairness and reliability makes its limited use in many critical real-world decision-making schemes. One would be critical of a reliability-oriented paper that only cites accuracy statistics [8]. Consciously collected data can be easily biased, and models built with such data can be unreliable. For example, Ribeiro et al. [32] show a case where the snow backgrounds in the training images, instead of real morphological features, distinguishes “huskies” from “wolves”. Such problems also happen to sequential data analysis. It will be critical and causing fatal issues if similar problems happened to applications such as health care, autonomous vehicles, or legal matters. How to make sure a model’s decision is not based on biased facts? How do we provide trusted prediction systems that humans are confident to use? How do we guarantee the decisions are not discriminated against by a special group? It’s urgently demanded to solve these problems so we can use machine learning models transparently and safely.

To address this subject, existing work has focused on visualizing RNN predictions for domain-specific tasks such as NLP [24, 39] by revealing RNNs’ inner mechanisms. However, RNN models have a broader usage in various scientific research fields such as DNA sequence analysis [29], electronic health records (EHR) analysis [5], customer purchase intent analysis [35], stock prediction [2]. Explanations in domain-specific manners can not fulfill the requirements in diverse scenarios. For example, word-to-vector embeddings in the NLP domain are fundamentally different from one-hot or multi-hot embeddings in most applications where each embedding vector dimension has physical meaning. Owing to this difference, the visual analysis for the NLP domain can be distinctive among general visual analysis with RNN predictions. Besides, interpreting inner-model behavior can be fragmentary in real practice and can potentially cause catastrophic harm to society [34]. Furthermore, it’s important for humans to understand and trust a model’s predictions by knowing how multi-dimensional features and their temporal changes in entire prediction classes contribute to prediction. Current visual interpretations often focus on explaining predictions for individual instances [18,26]. Because deep learning models are computed based on a populous distribution of instances, explanations for individual instance prediction can easily miss a bigger picture of valuable insights learned by a model. Last, value-level feature attribution is not previously addressed for model interpretation, yet explaining what values in a feature contribute to a particular class would be more meaningful than simply illustrating whether a feature has a high contribution. For example, knowing the importance of feature “customer visits” would provide less guidance than telling how many visits would be effective in preserving customers. Therefore, it’s often desirable to explain models by visualizing temporal feature value attribution for the entire prediction classes in a domain-independent manner.

Our work is inspired by these unfulfilled requirements and attempts to resolve the following challenges. Data complexity: datasets for RNN modeling can be seen as tensors that composed of time, multi-dimensional feature, and instance. In the settings of RNN modeling, each instance is a sequence and is labeled with a class. For example, in the patient mortality analysis, a patient’s medical temporal history is associated with a mortality label dead/alive. At each time-step, a temporal event is composed of multi-dimensional features representing values from medications, lab test results, etc. Despite the complexity in data rank and dimension, analyzing such practical data often desires the handling of specialties such as sparsity and noisiness. The dilemma – generality or visual complexity: Complex visualizations systems are often tailor-made for specific data characteristics or analytic tasks, and high generality requirements often limit possible visualization designs. We aim to handle compound computations at the backend in exchange of easy-to-comprehend visualization designs for a model-agnostic straightforward interpretation of data insights. We introduce ViSFA, a visual analytics system that summarizes feature attribution in temporal sequences with RNNs. Our contributions are:

• A scalable and domain-independent visual analytics approach that summarizes feature attribution in entire prediction classes. We designed a series of modular algorithms based on stratified sampling and gap statistic clustering, which facilitate a fair comparison of contributing patterns in temporal value change between classes.

• An interactive visualization system for users to remove irrelevant noises and discover major contributing sequential patterns.

A deep neural network is typically trained with a large number of instances that each is annotated with a class label. A well-trained deep learning model can learn useful knowledge from the instances and form a complex network of neurons that transform the instances to a probability for class prediction. However, how deep neural networks learn knowledge and achieve outperform predictions are remained unclear. Researchers from both the Machine Learning and Data Visualization field have investigated the interpretability of deep learning models. However, there is not a single interpretation method that can be applied to any deep learning models [8].

2.1 RNN Model interpretion

Currently, there are two major approaches in terms of whether RNN models are extended for interpreting deep learning models.

To interpret the dynamics of deep neural networks, a few studies apply visual approach to explain convolutional neural networks [3,9, 13,31,38]. For understanding RNNs, model-generating factors, like hidden states, need to be explored. LSTMVis [39] visualizes the hidden state dynamics of RNNs by a parallel coordinates plot. To match similar top patterns, LSTMVis allows users to filter the input range and check hidden state vectors from heatmap matrices. For natural language processing tasks, Ming et al. [24] also employ a matrix design. Combining hidden state clusters and word clusters, Ming et al. [24] design a co-clustering layout, which links cluster matrices and word clouds.

In addition, extending RNNs by extra structures, like neural attention mechanisms, contributes to easier interpretation. The attention mechanism has become popular recently because the added attention layers allow the interpretation of a particular aspect of the input [1, 11, 43]. Besides, the attention mechanism is proved to be able to improve the performance of deep learning models. Hermann et al. [11] develop an attention-based method for deep neural nets for comprehending documents. Due to easier interpretation for end-users, this group of research is often applied to real-world analysis in a variety of domains. However, only a few works focus on the visual interpretation of deep learning with an attention mechanism. RetainVis [18] is a rare example that is an interpretable and interactive visual analytics tool for EHR data.

2.2 Additive Feature Attribution Methods

The highest accuracy for large modern datasets is often achieved by complex models that even experts struggle to interpret. To simplify model understanding, a branch of research uses additive feature attribution methods that consider deep learning models as black boxes and explain simpler explanation models as approximations of the original model. Model-specific approximation such as DeepLIFT [36,37] compares the activation of each neuron to its “reference activation” and assigns contribution scores according to the difference. There are also model-agnostic methods such as LIME [32] and SHAP [20]. LIME interprets individual model predictions based on locally approximating the model around a given prediction. SHAP use Shapley values as a measure of feature variable contribution towards the prediction of the output of the model. Simpler explanations improves computational performance for interpretation. Similar in spirit, Manifold [45] provides a visual analysis framework to support interpretation, debugging, and comparison of machine learning models. These approaches differ from our work in that they approximate RNN models with simpler explanation models, whereas we attempt to interpret the behavior of original models.

2.3 RNN Application fields

RNN becomes popular for its high performance in the linguistic domain. Most well-known works are from the NLP field, such as machine translation and sentiment analysis [14]. Due to the popularity of NLP tasks, the majority of RNN interpretation work has made efforts to explain RNN models under the NLP background, such as [7,19,24,39]. However, the visual analysis of NLP often requires special visualization techniques because language composed by individual words is a unique type of data. The architecture of RNN is distinctive because connections between nodes form a directed graph along a temporal sequence. Therefore, RNN can be used to a broad range of sequence analysis applications, such as DNA sequence analysis [29], Electronic Health Records (EHR) analysis [5], customer purchase intent analysis [35], stock predictions [2], and so on. As discussed in the introduction, the interpretability is critical to these application domains regarding model fairness, reliability, and trust. However, not much attention has addressed to the visual interpretation of RNN from a more generic perspective except for ProSeNet [25], which proposes an interpretable and steerable deep sequence model.

This paper attempts to bridge this gap and propose a visual analysis method that can be applied to broader scenarios of analyses for summarizing contributing sequence patterns in predictions. Because end-users’ understanding is an urgent desideratum, this work focuses on visualizing distilled patterns that directly map to the original data, based on attention mechanism enhanced RNN model for its easy-to-perceptible superiority. Besides, to our best knowledge, there is no work explaining how RNN models correlate contributing numeric/categorical values with the prediction except AttentionHeatmap [42]. This work extends AttentionHeatmap in the analysis flow and visualization designs.

There are a few observations of RNN models before visualizing feature attribution with them [24,39,42]. First, if an RNN model is successfully trained, the instances within a prediction class in the dataset are expected to share some common characteristics which can be captured by the RNN model. In other words, if the instances do not share any common pattern within any prediction class, it’s impractical to learn a convergent or high-performance RNN model. Second, characteristics from different prediction classes are different. If the commonalities from one class cannot distinguish itself from another, the model training cannot succeed. Third, the attention weight of an event reflects the importance of the event in making such distinction. However, because state-of-the-art RNN models trained with real-world datasets can hardly achieve 100% accuracy, the learned attention weights are often not completely accurate. For example, LSTMVis [39] notices interpretable patterns but also significant noise when studying RNN models. Likewise, AttentionHeatmap [42] reveals that the filtered events contain noise that does not follow the major pattern no matter what attention range is selected by the user.

Based on the above observations, we derive the following design goals (DGs):

DG1: Facilitate the attribution analysis for tensors that are composed of dimensions including time, instance, feature, and feature value. Given multi-dimensional tensors, how do we synthesize meaningful visualizations for interpreting a model’s prediction? Knowing how entire classes share common patterns is not enough because these patterns not necessarily contribute to the model formation, a.k.a, distinguishing different classes. Therefore, the system should help to distill complex data to find the contributing subset and visualize the patterns in the subset.

DG2: Highlight major patterns across the instances within each prediction class. As mentioned earlier, we notices significant noise when studying deep learning models. The learned attention weights are noisy too. The visualization should remove or minimize the influence of noises and highlight major patterns in data. Besides, the visualization should highlight common patterns shared among the instances in a prediction class. Those common patterns are the keys for users to find insights from each class.

DG3: Contrast differences between prediction classes. The visualization design should facilitate easy and fair comparison between different classes. For example, visual comparison based on imbalanced class sizes can suffer from inequity if the visualization results are affected by class sizes. The visualization design should guarantee the visualized pattern is a true reflection of its belonging class instead of the influence of class size.

DG4: Be able to scale for large datasets. Because predictions based on state-of-the-art models are formed with millions of weights optimized over millions of data instances, explaining predictions for single data instances can miss a bigger picture oftentimes. Understanding how entire classes contribute to a model is important for trusting a model’s prediction and deciphering what a model has learned. Therefore, the design should build entire class representations regardless of the class size.

DG5: Be generic for different applications. RNN becomes widely used across different domains because of its generality. Even the vanilla RNN models can be adaptive for multiple disciplines such as finance stock price forecasting [2] and customer analysis predicting purchasing intent [35]. It’s challenging but meaningful to build a domain-independent visualization system. For users from different domains, we should provide easy-to-interpret interaction and visualization designs.

This approach leverages LSTM attention neural networks. The attention model computes a value that represents the importance of a particular temporal event at every time-step for all instances. Existing approaches build end-to-end models for RNN interpretation such as [5,14]. These methods are not sufficiently developed for many real-world applications. They often use multi-hot vector feature encoding where 0/1 are used to represent the appear/absence of a feature but without detailed information. For example, in the EHR data analysis, each temporal event is a patient visit. During a patient visit, multi-hot encoding records whether a treatment is performed or a type of medicine is applied. However, either using a treatment/medication or not is more of a standard process. It’s more meaningful to study how much treatment or the dose of medication can be more effective to a patient group.

Our work solves the problem by using numeric feature encoding so that the analysis result can help understand the importance of different feature values. Numerical encoding contains higher granularity of information compared to multi-hot encoding. For example, the information ‘a patient’s taking antibiotics every day is less informative compared to know a patients antibiotics doses every day. It is more difficult to tune a model with data of high granularity due to its complexity. However, by including actual values in the learning process, the analysis can be more practical for studies. In the above example of EHR data analysis, our approach keeps track of the contribution of each feature value. The following contribution analysis more likely provides insights that can help doctors make future decisions such as what treatment to apply to patients.

AttentionHeatmap [42] presents a visual analytics interface for RNN feature attribution analysis, which is composed of the matrix grid view for visualizing time-folded feature attribution, and the k-partite graph view for displaying attribution change over time. The interface of AttentionHeatmap is designed for user groups such as RNN model trainers and data scientists. For ultimate end-users like domain analysts who have no machine learning knowledge, we create ViSFA to help them focus on major patterns distilled from data. For time-folded feature attribution analysis, ViSFA improves AttentionHeatmap by integrating a ranking algorithm that sort features in contribution descending order. ViSFA improves the visual analytics procedures by providing comparable visual summaries of temporal feature attribution and lets end-users interactively removing noise (DG1). We introduce more details in the following sections.

5.1 Data & Control Flow

We pre-process data so that the time-steps for each instance are aligned reasonably. Each instance xi in the tensor space  RT×FIis a temporal sequence  {x1i ,..,xt−1i ,xti,xt+1i ,...,xTi }that is of length T and labeled with a class yi in the training dataset D  = {xi,yi}, where t  ∈ {1,T}and y  ∈ {1,L}is one of L categorical labels. f  ∈ {1,F}is one of F-dimensional features whose value range is  [vfmin,vfmax], and each event xti can be further expanded to a multidimensional expression {xt,1i ,..,xt, f−1i ,xt,fi ,xt, f+1i ,...,xt,Fi }where xt, fi ∈ [vfmin,vfmax]. Therefore, the research problem becomes calculating contribution scores s f for each feature, and visualizing the temporal change in the values of highly contributing features. But how to calculate the contribution of each feature?

Fortunately, the attention mechanism for RNN models is designed to solve the problem. The RNN model training process is using the dataset D to form a complex model ˆyi  = g(xi) where gtransforms input data x to y using millions of parameters. During the training process, parameters were decided by minimizing a chosen loss function  ∑Tt=1 Lt(yti, ˆyti). The attention mechanism plugs an attention network to original RNN model and learns a set of parameters  {a1i ,..,at−1i ,ati,at+1i ,...,aTi } whereati ∈ [0,1]represents the importance of each event in the corresponding instance’s temporal sequence. The two-level attention model further computes a set of parameter  {at,1i ,..,at, f−1i ,at,fi ,at, f+1i ,...,at,Fi } at eachtemporal event ati for each corresponding feature.

Figure 1 illustrates the high-level workflow of ViSFA. The visual summary module is composed of part1) the time-folded summary module and part2) the module for visually summarizing contributing temporal patterns. We applied the matrix grid view in AttentionHeatmap [42] in the time-folded summary module. Our work enhances the matrixgrid view with a ranking algorithm that comprehensively computes the contribution of each feature. ViSFA focuses on the temporal pattern analysis and Figure 1 illustrates part2.

First, the tensor of multidimensional temporal sequences is feed into an RNN model to train a classifier that can predict the class for all sequences. The RNN model must be validated and have an user-verified prediction performance. The attention mechanism then outputs the attention sequence aT . Thereafter, two sets of data are the input of the visualization module: the original data containing instances and their labels D  = {xi,yi}(B) and the attention sequences aT (C) that are associated with each temporal sequence and reflects the importance of each temporal event in the sequences. The way C works is like a mask applied to B. A user can filter the original temporal events with an attention range of interest (AOI). Specifically, after user selecting AOI, the feature values whose corresponding attention values are outside the AOI becomes “NULL” in the vector representation (D). In deep learning, empty events are usually padded with zeros for easy matrix computation. However, in our visual analysis step, padding zeros for the unselected events can be ambiguous because zero values can both represent uninterested events and contributing values. Therefore, the feature f’s value range becomes  {/0∪[vfmin,vfmax]}. Particularly, zero values of feature can be important if their contribution is high.

ViSFA then transforms the filtered data with a sequence of operations such as data sampling and noise reduction (E), as shown in Figure 1. To show the common temporal patterns distilled by a series of algorithms, ViSFA clusters the temporal sequences and automatically estimates the number of clusters (F). To fulfill the design goals, we tested several operational and visualization design alternatives and resolved with a series of algorithms that keep users in the loop. We explain the details in the following sections.

5.2 Ranking Feature Contribution

Two factors are taken into consideration when determining the overall contribution score s f for feature f: the average frequency of all feature values C and the variance V. If the values of f are divided into M bins bfm (m = [1,M]), the definition of s f is

image

where

image

and V(x f )represents the variance of array x f .  |x|crepresents the cardinality of x. Term C determines whether contributing feature values between classes are profoundly different on average. Term V measures how different the bin values contribute to classification. Therefore the score s f combines the contribution of feature values and their variations. Because C > 0 and V > 0, we have s f > 0.

image

Fig. 1: Workflow of ViSFA.

AttributionHeatmap summarize time-folded feature attributions with matrix grids [42]. We extend the matrix grid visualization with a ranking algorithm so that the feature of the highest score is on the top/left and the lowest score on the bottom/right. With the ranking function, users can instantly locate features of interest (FOIs) instead of searching for FOIs by visually comparing matrices in the matrix grid. Users can also interactively select the number of features to remove the features of low contributions.

After users located FOIs, a primary contribution of ViSFA is helping users to compare the contributing sequential patterns between different classes. There are challenges in such visual comparisons, such as different instance sizes in two classes and significant noises in computed feature attribution. We design a series of procedures to solve these problems. As shown in Figure 1, we present these procedures D, E and F in the following sections. We provide a summary of these procedures in Algorithm 1.

5.3 Between-class Comparison – Stratified Sampling

Fair comparisons are prerequisites (DG3) when comparing the contributing sequences between entire prediction classes. However, sequences from two prediction classes can be fundamentally different in size, distribution, and contributing values. Visually comparing them by simply aggregating the sequences within each class can be misleading. For instance, a between-class comparison showing the aggregated values for two imbalanced instance sizes can lead to ambiguous because it can be either the values or the instance sizes that causing an overall difference. Besides, the analysis should be scale to summarize temporal patterns from large datasets (DG4). Even if the scale of a training dataset reaches or larger than millions, the analysis should still be able to visualize patterns for different classes in the dataset. For a large dataset, a particular scale of aggregation or summarization is essential. Otherwise, the visualization can suffer from a limited canvas size if using a juxtaposed design or a visual blocking if adopting an overlapped design. Additionally, the summary of contributing sequences should be a true reflection of its belonging class.

To meet these requirements, we propose to use a method using strati-fied sampling after testing a few alternatives, such as random sampling. Stratified sampling is one of the probability sampling methods that sample equal size instances from all classes and sampling the most representative instances from each class. In statistics, stratification is the process of dividing members of the population into homogeneous subgroups before sampling. We choose to use stratified sampling to sample instances from data’s subpopulations because sampled instances cover all possible subpopulations and is a substantive reflection of the original data population.

However, it’s non-trivial to find homogeneous subpopulations from a set of sequences xt belonging to a particular class, because these sequences are often high-dimensional (in time steps) and noisy. Determining the number of clusters S is difficult for unknown data distribution. Fortunately, the parameter S is usually large for sampling tasks. For an example of our experiments, S is on a scale of 3K for a balanced training dataset that each class contains 10K instances. That is, around 30% of the instance population is sampled. We leverage the Hierarchical Agglomerative Clustering (HAC) that gradually calculates the increased distances between instances and newly formed dendrogram nodes in a bottom-up direction. The HAC iteration stops when the node size reaches S to save computation time. As illustrated in a native example in Figure 2d left, the iteration stops at the cut location (yellow line) where only nodes below the cut are computed. The algorithm then randomly sample an instance from each sub-cluster to form S samples in total. In Algorithm 1, the sampling procedure includes a computational efficient HAC algorithm in lines 3-20, except that line 10 is an iteration through N  −Sinstead of N  −1as in the original HAC algorithm. If S is 30% of N for instance, the computation becomes 1−0.3 = 0.7time faster than looping through all N. Specifically, the algorithm uses extra memories to store the next-best-merge array NBM to improve the complexity [22]. The Cluster() function in line 21 then converts the node linkage table A1 to a tree structure and returns the clusters CS, which stores the instance indices for each cluster. Line 24 implements the sampling algorithm explained above. We calculate a transformed L2-norm distance between instance aTi and aTj which have NULL entries as

image

where the numerator is the shortest L2-norm distance between two hyper-planes formed by the non-null dimensions and the denominator penalizes non-null dimensions.

Using HAC-based stratified sampling benefits the attribution analysis from several properties. First, it guarantees enhanced precision and population depiction compared to other sampling methods. Second, the sampling running time is inversely proportional to sample size K, and therefore faster than sampling with other clustering algorithms such as k-means when K is large. compared to k-means, HAC-based sampling methods do not have a convergence problem either. Third, the algorithm is scalable to the visual summary for a large dataset. Additionally, without data prerequisites such as particular distribution assumptions and high dimensional density, the method scales for different applications (DG5). Last but not least, stratified sampling is intimately coherent with the following within-class temporal pattern summarization.

5.4 Within-Class Summary with NoRCE

In correspondence with DG2, ViSFA aims to provide sequence summaries for users to recognize temporally contributing patterns in each class. We introduce NoRCE – a systematical algorithm for noise reduction and the number of cluster estimation.

A few signature data properties brought challenges to summarizing sequence patterns. First, RNN training produces significant noise. Eliminating the influence of noise is essential before further operations. Besides, high-dimensional data is often sparse. Many algorithms are designed for dense datasets which will produce artifacts for sparse data analysis. We will show an example later. Additionally, determining the number of clusters K is difficult for unknown data distribution. Although dimension reduction techniques such as t-SNE are often used to visualize data distribution, the visualized distribution is only an approximation. Visualizing dimension-reduced data can be misleading,

image

(a) Example data set that homogeneously form two clusters and an outlier (17).

image

20 3826 30 3424 Iteration

image

Instance Instance 0 1 2 4 5 76178 91011 1213141516 1918 3

(c) A dendrogram illustration of HAC-based outlier removal. Left: original data, locating outlier 17 with dendrogram cutting. Right: new dendrogram forms after removing outlier. Numbers represent node IDs in sequential order.

image

Fig. 2: Illustration of NoRCE in with a simple example.

especially when the feature dimension is high, due to the dimension curse. Determining the optimal K for such a dataset is more challenging. Because deep neural networks are still “blackbox,” it’s unclear whether sequences from each class would homogeneously form more than one group. NoRCE is designed to resolve these difficulties.

NoRCE first performs noise reduction. We present this procedure in lines 27-35 in Algorithm 1. Specifically, HAC() is the HAC calculation function that returns the nodes’ linkage table A2 created in the process. Figure 2 illustrates the concept of noise reduction using a simple 2D example, where the instances are shown in Figure 2a. Instance 17 (I17) is the outlier in the dataset that homogeneously forms two clusters on the bottom left and top right, respectively. During the HAC dendrogram building process, newly formed nodes have lower similarities to nearby nodes than early formed nodes, as shown in Figure 2c where node IDs are in progressive order where nodes with smaller IDs form earlier and vice versa. Therefore, the dendrogram iteration vs. distance curve is monotone increasing. As shown in Figure 2b, the distance for the example dataset is monotone increasing during the iteration. As shown in Figure 2c left, the distance suddenly increases when I17 joins the bottom-up building process of the dendrogram. Meanwhile, the iteration-distance curve exhibits an elbow point where a sudden jump of distance value happens at iteration 36 that cut through I17 as shown by the horizontal blue line. This phenomenon is not a coincident, as the outliers are distanced from other nodes. We then leverage this property and use the elbow point as a threshold to detect outliers which are the instances covered by dendrogram layers to the right of elbow point. In Algorithm 1, the Distances() function on line 29 computes the iteration-distance curve from A2. And NoRCE computes the elbow point by smoothing the curve and find the maximum absolute second derivative, as shown on line 30. As illustrated in Figure 2c left, the algorithm detects I17 as an outlier because it’s the single instance whose closest distance to other instances is greater than the elbow point. NoRCE computes the elbow point by smoothing the curve and find the maximum absolute second derivative, as shown on line 30. Users can also adjust the elbow point interactively, as explained in section 5.6.

image

After removing the outliers, NoRCE introduces an Adaptive Gap Statistic (AGS) algorithm to estimate the number of clusters. Gap Statistic (GS) [41] is a method for determining the number of clusters in a set of data by comparing the within-cluster dispersion to its expectation under an appropriate reference distribution of data. The GS method is proved to outperforms other numbers of cluster estimating methods, such as the average silhouettes method [33] and the elbow methods [15]. Figure 2d shows the results of the gap statistic before (red) and after (black) removing the outlier. The estimated cluster numbers are correspondence with the most significant gap values (yellow). After removing the outlier, the algorithm automatically detects two optimal clusters with the cut that splits the largest gap in the dendrogram, as shown in Figure 2d right.

We show the AGS algorithm in lines 36-56 in Algorithm 1. AGS takes two parameters: the number of references Nref and the maximum cluster number Kmax, and returns the estimated number of cluster OptK. We set Nre f = 3 and Kmax = 10 in our experiments. The classic gap statistic algorithm creates randomized data reference that is within the original data range. However, for high-dimensional sparse data, such references often greatly change the original data distribution and lead to unreasonable cluster number estimation. Imagine a dataset where all instances are individual points in a 3D space. Sparse data indicates that most points in this space are on the planes made by two axes or on one of the axes. However, randomly generated data would evenly distribute in the 3D space, which is different from the sparse data distribution. Then the question is how to reference data that has similar distribution? Fortunately, NoRCE’s input Y is sampled from the original training data X. As shown in line 44 in Algorithm 1, function AdaptiveRe ferencing() randomly samples the same number of instances as Y from the residual data X  −Yas references. Then computes the gap values through the for loops and returns the optimal cluster number OptK, which correspondences with the largest gap value. Finally, line 56 computes the estimated clusters for later visualization. Specifically, users can also adjust the number of clusters interactively. We introduced more details in the next section.

5.5 Visualization Design

In this subsection, we focus on the visualization of contributing temporal pattern summaries and their comparison between prediction classes, based on the algorithm presented in the previous subsection.

5.5.1 Dashboard

We divide the dashboard on the left of Figure 3 into two parts. We visualize the statistical distributions of instance-level attributes on the top (a) and the event-level attention distribution on the bottom (b).

To provide users with an overview, the projection view embeds the high-dimensional temporal feature sequences into a two-dimensional canvas via t-SNE [21] (see Figure 3). The dimension reduction result can reflect the similarity of instances in a birds-eye view. Besides, the distributions of instance attributes are indispensable for users to review the data in partitions. For example, student e-learning behavior analysts always want to understand the distributions of students gender and education level. ViSFA lists bar charts in the attribute view (see Figure 3b for all instance attributes. Specially, we employ a contrast color pair, purple and green, to encode the bars for two classes. This color scheme is consistent in the entire ViSFA system.

At the temporal event level, the old version of ViSFA uses a slider bar to let users filter events by their contribution. Based on users feedback, the improved ViSFA provides the histogram of normalized contribution, where the horizontal axis represents the contribution ranges from 0 to 1 with a 0.1 step. Because the distributions of attribution scores can vary largely for different RNN models, ViSFA provides another mode to easily filter top-contributing events. In this mode, the x-axis represents the percentiles, and the y-axis represents the corresponding ranges of the attribution score, as shown in Figure 3(b). Users click the switch on the top of this view to switch these two modes. The histogram shows the contribution distribution of temporal events, while the percentile mode illustrates what attention range is covered by each percentile.

5.5.2 Temporal Pattern View

ViSFA illustrates the over-time feature value change in the temporal pattern view, as shown in Figure 3c. AttentionHeatmap [42] overlays contributing temporal events in a time-value plane. Such design has perception issues, such as line crossing hinders users to notice jump over edges. In this work, the temporal pattern view visualizes the NoRCE clustering results using stacked area charts.

As mentioned earlier, ViSFA provides noise reduction for the analysis using the elbow method. In the temporal pattern view, ViSFA should provide an interface for users to remove noise based on the elbow method. Dendrograms in the elbow method (Figure 2b) illustrate the clustering loss of each iteration by distance on the vertical axis. However, it is unnecessary for users to distinguish each individual by ID on the horizontal axis. Besides, we design ViSFA for domain analysts who may not have a computer science background. Thus, we simplify the view and use a slider bar to let users select different noise reduction levels, as shown in Figure 3c3. We compute the noise reduction level as the number of instances at the elbow point over the total number of instances.

We visually summarize the denoised instances from two classes in a juxtaposition layout for easy comparison [10]. We sort the clusters belonging to each class in a size descending order from top to bottom. In each cluster, the temporal pattern summary (Figure 3d) is composed of a horizontal bar chart on the right, an area chart on the bottom-left, and a bar chart on the top.

The horizontal bar chart shows the cluster size. For easy comparison of cluster sizes in different ranges from all clusters, the horizontal axis is log-scaled. We design the area chart to visually summarize the over-time changes in the feature values across all the instances in the cluster. The design of the area chart is inspired by boxplots that statistically depict the quantiles of a group of values. The horizontal axis represents time and the vertical axis represents the value of the selected feature. The top edge of each area connects the same quantiles horizontally for the ability to handle a large number of time steps compared to the boxplot. And the area chart is also more flexible in increasing the number of quantiles. Five area layers represent a set of predefined quantiles in Figure 3. The middle quantile is colored with the darkest intensity and the intensities decrease while the quantiles go further away from the middle (towards top and bottom). The bar chart on the top is a contribution indicator that shares the horizontal axis with the area chart. The log-scaled vertical axis denotes the number of instances contributing to the pattern in the bottom area chart.

To demonstrate temporal patterns involving two features, AttentionHeatmap [42] applies two-hierarchy axis, which is not intuitive to read values or compare trends over time. ViSFA improves the design by placing the axes for the two features symmetrically, as shown in Figure 1 in the supplementary material. It is convenient to discover relationships from a vertically symmetrical design.

5.6 Interactions

ViSFA provides a variety of interactions for users to explore data and discover insights from both instance level and temporal event level. For instance exploration, users can interact with the t-SNE projection view by lasso-selecting a group instance to review their attribute distribution. Users can also click any bar(s) in the bar charts to filter the instances with the corresponding value. For instance, selecting the high qualifica-tions the Education bar chart in Figure 3(a) results in changes in other views (e.g. a better pass/fail rate). For the exploration of temporal events, users can: Filter by attention. The neural attention models calculate the at-

tention of events based on their contribution to the prediction. Taking

this advantage, ViSFA lets the users filter the events by their contribution. As mentioned earlier, ViSFA provides two modes for filtering the temporal events by their contribution value. In both modes, each bar is associated with a contribution range. Users can click the bar to use the corresponding contribution range to filter the temporal events in both classes. Multiple selections are enabled and clicking a bar again will deselect the bar. The filtering interaction results in an update in the

image

Fig. 3: The interface of ViSFA is composed of a) a instance property view including a instance 2D projection view and an attribute view, b) a feature attribution chart and c) a summary view, which is initialized as a matrix grid view that visualizes the attribution of feature values. Clicking a matrix in the matrix grid triggers the temporal pattern view, as shown in the “contributing sequence patterns” window.

matrix grid view and the temporal pattern view. The filter range can be set by users via a slide bar.

Check temporal patterns. The temporal pattern view can be triggered by clicking a cell in the matrix grid view. The cells on the diagonal correspond to temporal patterns with a single feature, and the other cells correspond to temporal patterns with two features. The temporal pattern view provides multiple interactions for users to easily compare temporal patterns in different classes/clusters. Users can hover any component in the temporal pattern summary and all corresponding components in all clusters and classes will respond. For example, hovering any horizontal bar chart enables the comparison between the cluster sizes among all clusters. And hovering any time-step in the area chart triggers the tooltips in all clusters, each of which shows the feature values of all percentiles at the hovering time-step.

On the bottom of Figure 3d, we design a slider bar for the interaction. Users can adjust either side and the position of the slider bar to locate a temporal focus. The position of the slider bar relative to the total slider length indicates the selected temporal range relative to all time-steps.

Use concisely designed UI to operate complex background algorithms. ViSFA provides a few widgets in the interaction panel (Figure 3e). Noise reduction. It is challenging to decide which records are noise for various datasets by automatic models. Different decisions may lead to distinct results in the latter analysis [30]. Thus, we allow users to set the parameter for noise different levels of noise reduction. In an earlier design, we let the users adjust the noise reduction level by dragging a dashed line in the line chart illustrating HAC. Based on the user feedback, a slider bar indicating the level of noise reduction is preferable, as shown in Figure 3 right. The interaction panel also provides the users with a button to apply the automatically estimated number of cluster. Meanwhile, the model-defined cluster number may be unsatisfying for distinct application scenarios [16]. Users may want to explore clusters freely, so the interaction panel enables a slider bar in the middle for adjusting different numbers of clusters.

We verified the effectiveness of our approach with two open datasets. We trained two RNN attention models: LSTM and bidirectional LSTM (BiLSTM) for the analysis of both datasets. All models are crossvalidated. The first dataset is from the education domain for studying students’ learning activities. The second dataset is from the medical domain for studying patients mortality study. In the following subsections, we showcase the visual analytics results that help understand RNN models behavior. The discovered insights also guide future decision-making towards achieving desired outcomes.

6.1 Open University Learning Analytics (OULAD)

In this use case, we validate ViSFA with the OULAD dataset [17]. We train RNN models to predict online course evaluation results: either fail or pass. The LSTM model makes predictions at a 75% accuracy, and the BiLSTM 88%. 180 days of more than 15 hundred students interaction records are used - each day as a temporal event. Each temporal event contains 16 features. Each feature is a webpage in the online course system, and the feature values are the number of visits in one day. The webpages are such as homepage, course contents, assignments, quiz, glossary, forum, supplementary contents, etc.

The feature contribution ranking results from interpolating two RNN models indicate that “assessment questions, ” “non-scored quiz,” and “forum” have the most significant contributions among all webpages. Contrarily, pages such as “glossary” and “additional data” are the least contributing pages. As one might expect, the assessments and quizzes are important in examining students learning achievement. The amount of a students interaction with the related webpages is expected to be significantly correlated with the courses final performance. In addition, high participation in the forum is effective in achieving satisfying course evaluations. Studies show that peer learning can enhance learning outcomes such as increased motivation and engagement in the learning task, deeper levels of understanding, increased metacognition, the development of higher-order thinking skills and divergent thinking [4,40]. Meanwhile, the pages like “glossary” and “additional data” are obviously less related to a course evaluation and thus rank lower.

image

Fig. 4: Temporal pattern summaries for feature “assessment questions.”

Knowing what features have the highest contribution is meaningful in tasks such as feature selections, but less critical in guiding future decision-making. The temporal pattern summary help to discover contributing temporal patterns and the contribution bar chart on the top quantifies the contribution for each time-step. We pick the feature “assessment questions” as an example to showcase the temporal pattern analysis results. Figure 4 illustrates the effectiveness of the designed algorithm and visualization in revealing the underneath contributing temporal pattern using two RNN models.

Figure 4a shows the temporal pattern summary for the raw data. The real-world data is sparse and noisy. We can hardly see any patterns within either class due to the data sparsity, where a lot of zeros “occupy” many quantiles that only the top quantile is shown. Theoretically, if a group of students has similar learning behavior over time, the area chart is expected to show similar fluctuations across all percentiles. Therefore, Figure 4a shows no similar patterns among the students if we look at the raw data. In addition, the data noise introduces a lot of spikes on the top edge of the area chart. Because raw data covers all events, the contribution indicators simply show the number of all instances for all time steps.

The visual summaries of temporal patterns learned by two the LSTM

image

Fig. 5: Graduate changes during the training training process, from the epoch 01 to epoch 24 (best performance).

and BiLSTM models are shown in bottom four sub-figures in Figure 4. The visualization results based on two models both indicate a single dominant cluster, where other clusters only have one or two instances if increasing the number of clusters manually. Consistently, the number of cluster estimation output “one” for both models. This indicates that the RNN models recognize a mutual pattern from all instances for both fail and pass classes.

Figure 4b and Figure 4d show the visual summaries of the temporal events that have the top 10 percent contribution to the prediction. Figure 4c and Figure 4e are the final results produced by the NoRCE algorithm. Comparing the results before and after applying NoRCE, we can see that the NoRCE algorithm can help clean noisy results produced by the RNN attention models. Specifically, the noise reduction mainly removes the events of low contributions (short bars in the bar charts.) The remaining temporal patterns and contribution patterns are more smooth for users to discover insights.

Figure 4c shows a decreasing contribution over time for both classes, which means the temporal sequence patterns in the early time steps have higher contributions. Both area charts show relatively consistent over-time fluctuations among the percentiles. As demonstrated in the video, a user further explores more details of this result. She adjusts the temporal focus to the first month, and slide the temporal focus rightwards to review the patterns in the next three months. The results in the pass class show no significant change in the values for all percentiles. Contrarily, the value of the fail class decreases significantly after the second month. These signals indicate that a students interaction with the assessment questions in the first three months has a higher contribution to their course evaluation results. And, a persistent interaction is important in achieving the desired course performance.

Figure 4e shows temporal pattern summary for the BiLSTM model. The results in the area chart and the contribution indicator shows the BiLSTM model successfully captures more common patterns among the students. Especially for the fail class, the contribution indicators show the number of instances making the contribution is close to the total number of instances. And the pattern in the area chart is all zeros on all time-steps, which indicates that failing to interact with assessment questions from day 30 to day 80 is convincingly correlated with the bad course performance. Interestingly, the results learned by BiLSTM are more contrastive between two classes. The fail class shows an extremely certain pattern due to the high contributions for a long time period, while the pass class does not show highly contributing temporal patterns as the contributions in most time-steps are less than ten.

Model behavior in the learning process. Figure 5 shows the temporal summary changes of the BiLSTM model, with their accuracy shown in parentheses. Same as Figure 4e, the area charts from epoch 03 to epoch 24 show zeros for all time-steps and thus omitted. We can notice the model is gradually searching important time-steps around all time-steps during the learning process for both classes. For the fail class on the left column, the top edge of the bars in the contribution indicators all connect to smooth line shapes. The contribution indicator at the first epoch shows a more evenly distributed pattern over time, as the learning just started. The contribution indicators become more concentrated in certain time periods for the later process. The pass class on the right column also shows an evenly distributed pattern at the first epoch. At epoch 03 and 09, the pass class show very similar patterns to the fail class. However, the pattern in the pass class becomes very different from the fail class while the learning continues in the later stages. This illustrates the model captures more common patterns from the instance in the fail class to distinguish two classes at the final step.

6.2 EHR Data

The development of ViSFA is first motivated by the needs of ICU patients’ mortality analysis. We validate ViSFA with the MIMIC [12] dataset. We train the RNN models using four categories of patients’ medical records – ICU stay, inputs, procedure and lab results. The records contain 37 temporal features, as shown Table 1 in the supplemental material. The models use patients’ records in the first 48 hours after admission to predict their mortalities after 48 hours. The analysis task is to find what features and values among all features contribute most to mortality predictions and what is the difference of temporal patterns in contributing feature values between mortality prediction classes. The preprocessed dataset contains 14K patients (instances) including 1.2 million temporal events, where the alive/dead patient ratio is 10:1. We train RNN models using the oversampling technique so the interpretation is based on balanced data. Then 10% data are sampled in the stratified sampling procedure.

The visualization results illustrate that most top-ranked features are the “inputs” – any fluids which have been administered to the patients. The results indicate “inputs” are more important in keep patients alive compared to “ICU events,” “procedures,” and “lab tests.” According to the domain scientist, the result is reasonable as fluids affect the cardiovascular, renal, gastrointestinal, and immune systems in critical illness treatment [23].

The features of the highest contribution are “drips, “antibiotics (Non IV),” and “prophylaxis. Take the “antibiotics (Non IV)” as an example, the raw data does not show any common pattern among the data quantiles. However, the contributing temporal pattern summary in ViSFA (Figure 2 in the supplemental material) for “Antibiotics (Non IV)” shows contrastive temporal patterns for the “dead” and “alive classes. From the visualization results, we notice several phenomenons. First, the contributing temporal events for the “dead” class show higher antibiotic levels throughout the entire time period comparing to the alive class for the LSTM model. The BiLSTM model shows a consistent result except that the BiLSTM suggests only the last two time-steps have an extremely large contribution. According to the domain scientist, combining antibiotics is a strategy often used by clinicians, but recent research shows that such a strategy can cause body resistance [6]. Also, there is research demonstrating that deploying synergistic antibiotics can, in practice, be the worst strategy if bacterial clearance is not achieved after the first treatment phase [28], which can be the explanation of the phenomenon. Second, the antibiotic dosage for two classes both fluctuated throughout time, but their temporal patterns show significant differences. For instance, the alive class feature in a relatively more steady fluctuation compared to the dead class, which shows more significant peaks and bumps. Specifically, the first great drop of value happens at 4h for the alive class but happens at 8h for the dead class. Both classes have fluctuated patterns around 24h (±2h).However, the alive class shows a more steady change afterward while the dead class shows two significant peaks in the next six hours, and the dosage becomes steady afterward. The domain scientist who participated in the case study indicates that she is highly interested in the visualized result. She considers the comparative temporal pattern visualization is informative in helping to understand what evidences are used by the AI to make predictions. Although she can not verify the cause of antibiotics dosage change on every time point without further experiments, she states the visual summaries could potentially reveal critical time points in the patient treatment process. Besides, she verifies that the suddenly increased values in the last two hours for the dead group are highly reasonable because the signal indicates previous antibiotics were not effective enough.

During the exploration of feature attribution for two fundamentally distinctive datasets, we found a mutual phenomenon that there is always a dominant concentrate cluster. Enlarging cluster number hardly changes the dominantion, and only small clusters appear in the bottom, as shown in Figure 3. The NoRCE computation results also verify this phenomenon that the estimated cluster number is one in most experiments. We consider the mutual temporal patterns in the dominant cluster as the important facts that contribute to distinguishing two prediction classes. These facts can be complex, as illustrated in the EHR data analysis where domain expert suggests more longitudinal experiments are required to verify its reliability. And the facts can be straightforward to understand and explain, as shown in the OULAD data analysis. However, we can not conclude that RNN models always learn one dominant pattern for each feature without abundant experiments in various application domains. Therefore, we design the temporal view so that the analysis is feasible for visualizing more large clusters.

Closed-loop training. In the OULAD example, we initially trained an LSTM model using all 285 days. The contribution visualization shows that only the first 180 days have a high contribution to distinguishing two prediction classes. We then retrained the model with only 180 days of the data and the approach improves the accuracy by 4 percent. This demonstrates the visualization results derived from ViSFA can help develop a closed-loop RNN model training.

Causality vs correlation. Although our approach can visualize features and their overtime patterns contributing to class predictions, it’s inadequate to consider top-ranked features and their overtime patterns the cause to the predicted class. The learned attribution reflects what features/values in the training dataset significantly correlated to the prediction whereas causation indicates that the labeled class is the result of the features, which is a stronger statement to prove and beyond the scope of our discussion. However, although it’s insufficient for causal analysis, feature attribution analysis provides evidence for further causal analysis. For example, students’ behavior learned by RNN can not be proved to cause their course evaluation results without further controlled experiments [27]. But our approach uncovers important pages and students’ interactive temporal patterns that are correlated with the course evaluation results. Similarly, the antibiotics (Non-IV) doses applied to the patients who died after 48 hours are not necessarily the cause of their death. But the analysis provides statistically supportive evidence that different doses of antibiotics (Non-IV) are correlated with patients mortality.

The ViSFA system uses a modular design where the stratified sampling module is for scalable and comparative analysis, and the NoRCE module is for noise reduction and temporal pattern summarization. ViSFA also provides a systematic framework containing these modules to motivate more discussions to probe RNN classifiers behavior in the value-level. As discussed in the related work section, the proposed analysis can be used to a broad range of analysis scenarios and the noise removal and the pattern summarization modules are two requisite steps in the analysis.

As deep learning pervasively used in decision-making tasks for multi-dimensional sequential data analysis, it’s essential to understand contributing features and temporal patterns for predictions. In this work, we present ViSFA, the first visual analytics system that scalably summarizes value-level feature attributions with recurrent neural attention networks. We test ViSFA with two real-world datasets, each using two RNN models. The case study results demonstrate that ViSFA can 1) help distill contributing patterns for different RNN models of different prediction performances, 2) reveal gradual changes in the RNN model learning process, and 3) help effectively reason value-level feature attribution for different application domains, and the visual summaries of temporal patterns in feature attribution provide guidelines for making future decisions. We hope our work will motivate further research in developing domain-user-oriented analysis systems with deep learning.

[1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations, 2015.

[2] W. Bao, J. Yue, and Y. Rao. A deep learning framework for financial time series using stacked autoencoders and long-short term memory. PLoS ONE, 12(7):e0180944, 2017.

[3] A. Bilal, A. Jourabloo, M. Ye, X. Liu, and L. Ren. Do convolutional neural networks learn class hierarchy? IEEE Transactions on Visualization and Computer Graphics, 24(1):152–162, 2018.

[4] P. C. Blumenfeld, R. W. Marx, E. Soloway, and J. Krajcik. Learning with peers: From small group cooperation to collaborative communities. Educational Researcher, 25(8):37–39, 1996.

[5] E. Choi, M. T. Bahadori, J. Sun, J. Kulas, A. Schuetz, and W. Stewart. RETAIN: An interpretable predictive model for healthcare using reverse time attention mechanism. In Proceedings of Annual Conference on Neural Information Processing Systems, pp. 3504–3512. Curran Associates, Inc., 2016.

[6] G. D. Wright. Antibiotic adjuvants: Rescuing antibiotics from resistance. Trends in Microbiology, 24(11):862–871, 2016.

[7] Y. Ding, Y. Liu, H. Luan, and M. Sun. Visualizing and understanding neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 1150–1159, 2017.

[8] F. Doshi-Velez and B. Kim. Towards a rigorous science of interpretable machine learning. In eprint arXiv:1702.08608, 2017.

[9] P. E. Rauber, S. Fadel, A. Falco, and A. Telea. Visualizing the hidden activity of artificial neural networks. IEEE Transactions on Visualization and Computer Graphics, 23:101–110, 2016.

[10] M. Gleicher, D. Albers, R. Walker, I. Jusufi, C. D. Hansen, and J. C. Roberts. Visual comparison for information visualization. Information Visualization, 10(4):289–309, 2011.

[11] K. M. Hermann, T. Kocisk´y, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, 2015.

[12] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark. MIMIC-III, a freely accessible critical care database. Scientific data, 3:160035, 2016.

[13] M. Kahng, P. Y. Andrews, A. Kalro, and D. H. Chau. Activis: Visual exploration of industry-scale deep neural network models. IEEE Transactions on Visualization and Computer Graphics, 24(1):88–97, 2017.

[14] A. Karpathy, J. Johnson, and F. Li. Visualizing and understanding recurrent networks. CoRR, abs/1506.02078, 2015.

[15] D. J. KETCHEN and C. L. SHOOK. The application of cluster analysis in strategic management research: An analysis and critique. Strategic Management Journal, 17(6):441–458, 1996.

[16] Y.-I. Kim, D.-W. Kim, D. Lee, and K. H. Lee. A cluster validation index for gk cluster analysis based on relative degree of sharing. Information Sciences, 168(1–4):225–242, 2004.

[17] J. Kuzilek, M. Hlosta, and Z. Zdrahal. Open university learning analytics dataset. Scientific Data, 4(170171), 2017.

[18] B. C. Kwon, M. Choi, J. T. Kim, E. Choi, Y. B. Kim, S. Kwon, J. Sun, and J. Choo. RetainVis: Visual analytics with interpretable and interactive recurrent neural networks on electronic medical records. IEEE Transactions on Visualization and Computer Graphics, 25(1):299–309, 2018.

[19] J. Li, X. Chen, E. Hovy, and D. Jurafsky. Visualizing and understanding neural models in NLP. In Proceedings of the 2016 Conference of the NAACL: Human Language Technologies, pp. 681–691. Association for Computational Linguistics, San Diego, California, 2016.

[20] S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds., Proceedings of Annual Conference on Neural Information Processing Systems, pp. 4765–4774, 2017.

[21] L. v. d. Maaten and G. Hinton. Visualizing data using t-SNE. Journal of machine learning research, 9(Nov):2579–2605, 2008.

[22] C. D. Manning, P. Raghavan, and H. Sch¨utze. Introduction to Information Retrieval. Cambridge University Press, 2008.

[23] C. Martin, A. Cortegiani, C. Gregoretti, I. Martin-Loeches, C. Ichai, M. Leone, G. Marx, and S. Einav. Choice of fluids in critically ill patients.

BMC Anesthesiology, 18(1):200, 2018.

[24] Y. Ming, S. Cao, R. Zhang, Y. Li, Y. Chen, Y. Song, and H. Qu. Understanding hidden memories of recurrent neural networks. In Proceedings of 2017 IEEE Conference on Visual Analytics Science and Technology, pp. 13–24, 2017.

[25] Y. Ming, P. Xu, H. Qu, and L. Ren. Interpretable and steerable sequence learning via prototypes. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019.

[26] C. Olah, A. Mordvintsev, and L. Schubert. Feature visualization. Distill, 2(11):e7, 2017.

[27] J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2nd ed., 2009.

[28] R. Pena-Miller, D. Laehnemann, G. Jansen, A. Fuentes-Hernandez, P. Rosenstiel, H. Schulenburg, and R. Beardmore. When the most potent combination of antibiotics selects for the greatest bacterial load: The smile-frown transition. PLoS Biology, 11(4):e1001540, 2013.

[29] D. Quang and X. Xie. DanQ: A hybrid convolutional and recurrent deep neural network for quantifying the function of dna sequences. Nucleic Acids Research, 44(11):226, 2016.

[30] J. R. Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.

[31] P. E. Rauber, S. G. Fadel, A. X. Falco, and A. C. Telea. Visualizing the hidden activity of artificial neural networks. IEEE Transactions on Visualization and Computer Graphics, 23(1):101–110, 2017.

[32] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should I trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144, 2016.

[33] P. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20(1):53–65, 1987.

[34] C. Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215, 2019.

[35] H. Sheil, O. Rana, and R. G. Reilly. Predicting purchasing intent: Automatic feature learning using recurrent neural networks. In Proceedings of the SIGIR 2018 Workshop On eCommerce co-located with the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, 2018.

[36] A. Shrikumar, P. Greenside, and A. Kundaje. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 3145–3153. Proceedings of Machine Learning Research, 2017.

[37] A. Shrikumar, P. Greenside, A. Shcherbina, and A. Kundaje. Not just a black box: Learning important features through propagating activation differences. arXiv, 2016.

[38] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR, abs/1312.6034, 2013.

[39] H. Strobelt, S. Gehrmann, H. Pfister, and A. M. Rush. LSTMVis: A tool for visual analysis of hidden state dynamics in recurrent neural networks. IEEE Transactions on Visualization and Computer Graphics, 24:667–676, 2016.

[40] M. Thomas. Learning within incoherent structures: The space of online discussion forums. Journal of Computer Assisted Learning, 18(3):351 – 366, 2002.

[41] R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a dataset via the gap statistic. Journal of the Royal Statistical Society, 63(2):411–423, 2000.

[42] C. Wang, T. Onishi, K. Nemoto, and K. Ma. Visual reasoning of feature attribution with deep recurrent neural networks. In IEEE International Conference on Big Data, Big Data 2018, pp. 1661–1668, 2018.

[43] K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, vol. 37, pp. 2048–2057. JMLR, 2015.

[44] T. Young, D. Hazarika, S. Poria, and E. Cambria. Recent trends in deep learning based natural language processing. IEEE Computational Intelligence Magazine, 13(3):55–75, 2018.

[45] J. Zhang, Y. Wang, P. Molino, L. Li, and D. S. Ebert. Manifold: A model-agnostic framework for interpretation and diagnosis of machine learning models. IEEE Transactions on Visualization and Computer Graphics, 25(1):364–373, 2019.


Designed for Accessibility and to further Open Science