In recent years, AI methods, especially machine learning with various directions and algorithms [25, 26], have become more and more successful in a wide range of areas like computer vision, natural language processing, and robotics, among others. Consider e.g. AlphaZero surpassing human-level performance in playing chess and Go. During its self-play training process, AlphaZero discovered a remarkable level of Go knowledge. This included not only fundamental elements of human Go knowledge, but also non-standard strategies beyond the scope of traditional human Go knowledge [27], exemplifying the potential of these methods to discover strategies previously unknown even to experts of the domain. However, studies from various applications such as [28, 29, 30, 3] have revealed that learning machines can also result in “Clever Hans”-like moments, i.e., human-undesired strategies where the machine exploits artifacts in the dataset.
To “un-Hans” machines, we introduced the novel learning setting of “explanatory interactive learning” (XIL) and illustrated its benefits. XIL adds the scientist into the training loop. She interactively revises the original model via providing feedback on its explanations, used to automatically augment the training with counterexamples or to modify the model using rrr. Our experimental results demonstrate that users care strongly about “Clever Hans”-like moments in machine learning and XIL can indeed help avoiding them.
There are several possible avenues for future work to overcome the current limitations of XIL. Acquiring annotations, especially of explanations, can be time consuming. The number of interactions required in order to reach an acceptable state is an open issue [16]. Hence, one should work on optimal query strategies for XIL that aim at minimizing the interaction efforts. Adapting regret bounds from co-active learning [11] might be an interesting alternative. Moreover, the data at hand may not always allow XIL to fully alleviate wrong reasons without decreasing the network’s predictive performance. One should develop ways for keeping the drop as small as possible. Furthermore, XIL relies on two assumptions, namely, (a) faithful explanations can be computed, and (b) the user feedback is faithful, too. Assumption (a) is still subject to very active research, particularly for deep learning methods [31] (see the supplement). One should improve the quality and robustness of XAI methods and also explore XIL for interpretable models [32]. If the user is rather confident about the right reasons, learning to explain methods such as hint provide an interesting avenue for future work. Our initial results, see the supplement, are encouraging. However, even scientific experts do not always know the reasons for predictions. Therefore, one should strive to better understand the effects of wrong feedback and even adversarial attacks [33] on XIL. Additionally, one should turn other interactive learning settings such as coactive [11], active imitation [18], mixed-initiative interactive [19] and guided probabilistic learning [34] into explanatory one. Lastly, because it is not yet clear what makes explanations good for humans [35], one should extend explanatory interactions towards using alternative explanations, multiple modalities and counterfactuals [36, 37]. In any case, interacting with explanations of machine learning models is an enabler for scientific discoveries for humans and machines in cooperation.
Active learning. The active learning paradigm targets scenarios where obtaining supervision has a non-negligible cost. Here we cover the basics of pool-based active learning, and refer the reader to two excellent surveys [38, 39] for more details. Let X be the space of instances and Y be the set of labels (e.g. ). Initially, the learner has access to a small set of labeled examples
and a large pool of unlabeled instances
. The learner is allowed to query the label of unlabeled instances (by paying a certain cost) to a user functioning as an annotator, often a human expert. Once acquired, the labeled examples are added to L and used to update the model. The overall goal is to maximize the model quality while keeping the number of queries or the total cost at a minimum. To this end, the query instances are chosen to be as informative as possible, typically by maximizing some informativeness criterion, such as the expected model improvement [40] or practical approximations thereof. By carefully selecting the instances to be labeled, active learning can enjoy much better sample complexity than passive learning [41, 42]. Prototypical active learners include max-margin [43] and Bayesian approaches [44]; recently, deep variants have been proposed [45]. However, active (showing query data points) and even coactive learning (showing additionally the prediction of the query data point) do not establish trust: informative selection strategies just pick instances where the model is uncertain and likely wrong. There is a trade-off between query informativeness and user “satisfaction”, as noticed and explored in [46]. To properly modulate trust into the model, we argue it is essential to present explanations, e.g., visual ones as shown in Fig. 6.
Local explainers. There are two main strategies for explaining machine learning models. Global approaches aim to explain the model by converting it as a whole to a more interpretable format [7],[47]. Local explainers instead focus on the arguably more approachable task of explaining individual predictions [9]. While explainable interactive learning can accommodate any local explainer, in our implementations we used either lime [8] or grad-Cam [20], both described next.
Figure 6: grad-Camsof a hyperspectral sample with spatial and spectral explanations of a corrected network. Leftmost image shows the sample followed by the corresponding spatial activations maps mapped to four different hyperspectral areas. The areas are 380-537 nm, 538-695 nm, 696-853 nm and 854-1010 nm.
The idea of lime (Local Interpretable Model-agnostic Explanations) is simple: even though a classifier may rely on many uninterpretable features, its decision surface around any given instance can be locally approximated by a simple, interpretable local model. In lime, the local model is defined in terms of simple features encoding the presence or absence of basic components, such as words in a document or objects in a picture. While not all problems admit explanations in terms of elementary components, many of them do [8]; in this case, lime assumes these to be provided in advance. An explanation can be readily extracted from such a model by reading off the contributions of the various components to the target prediction and translating them into an interpretable visual artifact. For instance, in document classification one may highlight the words that support (or contradict) the predicted class.
grad-Cams are a generalization of Class Activation Maps, introduced by [48] and take advantage of the facts that, firstly, deeper layers of a CNN capture higher-level visual constructs and, secondly, that convolutional features retain spatial information. As such, the last convolutional layer represents a trade-off between high visual representation and spatial information. Specifically, a grad-Cam is computed by forward passing an image through the network, applying a backpropagation of a one-hot encoding vector that specifies the class label of interest up to the last convolutional layer. The resulting gradients of each channel are global average pooled, multiplied with the corresponding feature maps, summed and finally passed through a RELU activation function. In this way, the final feature maps of the convolutional feature extractor are weighted by the importance of these features. The resulting two-dimensional heatmap can finally be interpolated to the original input size for visualization. In case a 3D convolutional network is used to classify hyperspectral data the resulting heatmap is three dimensional also showing activations along the spectral dimension of the data, cf. Fig. 6.
Explanatory Interactive Learning with counterexamples. Why is this data augmentation a sensible idea? To see this, consider the case of linear max-margin classifiers. Let be a linear classifier over two features,
and
, of which only the first is relevant. Fig. 7 shows that f(x) (red line) uses
to correctly classify a positive example
. In order to obtain a better model (e.g. the green line), the simplest solution would be to enforce an orthogonality constraint
(0, 1)
In the separable case, the counterexamples
amount to additional max-margin constraints [49] of the form
1. The only ones that influence the model are those on the margin, for which strict equality holds. For all pairs of such counterexamples
it holds that
, or equivalently
where
). In other words, the counterexamples encourage orthogonality between w and the correction vectors
, thus approximating the orthogonality constraint above. Most importantly, this data augmentation procedure is model-agnostic, although alternatives indeed exist: (manually) adding a discovered data artifact to samples of other classes [50], contrastive examples [51], feature ranking [52] for SVMs and constraints on the input gradients for differentiable models [4].
We note that due to sampling, lime may output different explanations for the same prediction. To reduce the variance of the experiments with ce of Tab. 1, we ran it 10 times and retained the k components identified most often as relevant by lime.
fashion-MNIST dataset. The fashion-MNIST dataset, a fashion product recognition dataset, includes 70,000 images over 10 classes. All images were corrupted by introducing confounders, that is, 4 4 patches of pixels in randomly chosen corners whose shade is a function of the label in the training set and random in the test set (see [4] for details).
PASCAL VOC 2007 dataset. We used a subset of the PASCAL VOC 2007 dataset in our experiment. This subset includes resp. 1470 train and 782 test images over 5 classes (horse, cat, bird, bus, dog). Only samples from the horse class contain confounding features, i.e. watermark text. We rescale all the images to 224*224*3 to use the VGG-16 network [53] as classifier, and we used the ImageNet-pre-trained weights as initial weights, as well as the
Figure 7: Mathematical intuition for the counterexample strategy, exemplified for linear classifiers. Two data features are shown, and
, of which only the first is truly relevant. Left: The positive example
is not enough to disambiguate between the red and green classifiers. Middle: Counterexamples
are obtained by randomizing the irrelevant feature while keeping the label of
. The counterexamples approximate a (local) orthogonality constraint. Right: The red classifier is inconsistent with the counterexamples and eliminated. See the Methods section Explanatory Interactive Learning with counterexamples for details. (Best viewed in color).
ADAM optimizer [54]. We trained a default model without user feedback and a model with user feedback for 2k epochs. The explanation method was instantiated with input gradients (IG).
Sample collection. To demonstrate the significance of XIL, we demonstrate XIL for deep plant phenotyping and plant disease detection, a growing and relevant field of research [55, 56, 57, 58, 59, 60]. To this end, we recorded a scientific, real-world dataset—a plant phenotyping dataset consisting of RGB and hyperspectral images (HS) of healthy and diseased sugar beet leaves. Then, we applied convolutional neural networks to classify the plants’ leaves into the categories control (healthy) and inoculated (diseased) and investigated the underlying reasons for the network’s predictions. As a model disease, Cercospora leaf spot (CLS) was used. This is caused by Cercospora beticola and is the most destructive leaf disease of sugar beet with worldwide economic importance.
The dataset used in this study corresponds to HS and RGB images of leaf discs of sugar beet cv. Isabella (KWS, Einbeck, Germany) inoculated with Cercospora beticola. Sugar beet seeds were pre-grown in small pots and piqued after the primary leaves were fully developed. The seedlings were then transferred into plastic pots (diameter of 17 cm) on commercial substrate (Topfsubstrat 1.5, Balster Erdenwerk, GmbH, Sinntal-Altengronau, Germany) under greenhouse conditions and watered as necessary. After reaching growth stage 16 according to BBCH scale [61] the plants were inoculated with C. beticola conidia which were collected from infested sugar beet leaves after incubation in a moist chamber for 48 hours. A spore suspension of 5 10
was sprayed onto leaves before the plants were transferred into plastic bags to achieve 100% RH for 48 hours. For image acquisition leaf discs were stamped out with a cork borer of 2 cm diameter and placed on 10g/l pyhtoagar (Duchefa Biochemie B.V, Haarlem, Netherlands), containing 0.34 mM benzimidazole, 10 g sucrose and 3 mg kinetin. To observe different symptom classes sugar beet leaves of 9, 14 and 19 days after inoculation (dai) were used since the first symptoms appeared 9 dai. As a control group, 18 leaf discs of untreated sugar beet plants were measured as well and five technical replications with 6 discs each were used for each symptom group.
Each sample, both control and inoculated, was measured daily over five consecutive days such that a sample from 9 dai reappears four further times in the dataset as 10 to 13 dai. A few samples were discarded due to technical issues. The percentage of healthy leaves to unhealthy leaves was approximately 26% to 74%, respectively. For image acquisition leaf discs on agar were placed on a linear stage at a distance of 53 cm to a Hyperspec VNIR E-series imaging sensor (Headwall Photonics, Bolton, MA, USA) in the range of 380 nm to 1010 nm. The VNIR sensor has a spectral resolution of 2-3 nm and a pixel pitch of 6.5 m. The sensor was surrounded by eight lamps (Ushio Halogen Lamp J12V-150WA/80 (Marunouchi, Chiyoda-ku, Tokyo, Japan)) and the distance between lamps and leaves was 60 cm with a vertical orientation of 45
. Exposure times of 44 ms were used for the VNIR sensor.
The dataset consists of 2410 samples with 504 samples labeled as control and 1906 labeled as inoculated. Control samples were not re-used as inoculated samples. The collected hyperspectral raw data size was around 4TB. After preprocessing the data by cutting out the leaf discs into hyperspectral cubes the data size is 140 GB. Since there is a lot of redundancy in the wavelength resolution, we further sub-sampled the depth of the data cubes resulting in a final data size of 32GB.
Data preparation. As mentioned above, each sample was imaged over five consecutive days such that each sample, though slightly differing from day to day, is represented up to 5 times within the full dataset. In this way, a sample from 9 dai would occur for 4 further days (10-13 dai). To prevent the models from memorizing the structure of the individual leaf samples and correlating this to the corresponding labels, a precaution was taken to exclusively contain all days of one sample either in the training or validation dataset.
Removing confounders for the scientific dataset. It is essential to maintain the underlying assumption that the training and test data are drawn from the same distribution. If this is not the case, changes in accuracy might be due to artifacts of different data, rather than deficits of the model [62]. We applied two variations to the test samples of the HS dataset to remove the confounders: we set the background (everything but the plant tissue) (1) to the per-channel average of the non-tissue regions or (2) the per-channel average of the full images of the training data. We then evaluated the default and rrr revised CNNs on this modified test dataset. We focused here only on the HS data and model, due to the limitations of the RGB model’s performance.
RGB/HS classification. The RGB images used for training the classifiers were generated from the hyperspectral data, by slicing the data at the corresponding RGB channels that were provided by the camera system (cf. Fig. 1 (A-Right)). Before training the RGB classifiers, the data was standard scaled following , where u is the mean and s the standard deviation of the training samples.
To train a classifier on the RGB images of sugar beet leaves we used a VGG-16 [53] network pre-trained on ImageNet [63] to finetune the network parameters using the RGB plant images. For training a batch size of 32, a learning rate of 1e-4 and a step learning rate scheduler set to reduce the learning rate at epochs 5 and 15 by a factor of 0.1 were used. Furthermore, the ADAM optimizer was used with L2 regularization 1e-5. Five separate cross-validation folds were trained until convergence, using a data split of 0.75 for training and 0.25 for testing. Convergence was reached after 30 epochs.
To classify the HS data we trained a convolutional neural network (CNN) architecture with batch normalization using 3D convolution filters, rather than standard 2D filters, learning features not only along the image dimensions but also over the spectral dimensions. The used network is build up with four residual blocks, each containing one to three convolutional layers. The last two layers are fully connected layers with a final softmax activation function. The other layers use ReLU activations. During training the networks we used dropout to prevent overfitting. The network’s parameters are trained with a stochastic gradient descent optimizer with momentum using a batch size of 10 HS images, a learning rate of 1e-4 and an L2 regularization of 1e-5.
Five separate cross-validation folds were trained until convergence, using a data split of 0.75 for training and 0.25 for testing. Convergence was reached after 100 epochs.
Analyzing classification strategies of the model. Based on the results of [31], in which the authors performed sanity checks over a variety of saliency methods, we chose to investigate our model’s explanations using Gradientweighted Class Activation Mapping (grad-Cam)[20].
To analyze the resulting strategies produced by the layer-wise relevance propagation method (LRP), the authors of [3] revert to using spectral clustering on the resulting heatmaps in a pipeline they termed ’SpRAy’. This clustering served to receive an overview of the extent of the model’s decision strategies. We apply SpRAy in a similar way, however, rather than using the raw grad-Cam heatmaps, we perform a discrete Fourier transformation on these beforehand to better differentiate different strategies which we had previously identified from single samples. In detail, the pipeline is as follows
• Perform a discrete Fourier transform on downsized grad-Cam heatmaps.
• Using the Euclidean distance for the RGB data and the Cityblock distance for the HS data compute a k-nearest neighbor graph of the Fourier transformed heatmaps, represented as an adjacency matrix, C.
• Compute the affinity matrix as suggested in [64] as ).
• Perform an eigengap analysis [64] to estimate the number of clusters, k, within the dataset.
• Perform spectral clustering on the affinity matrix, given k from the previous step
• Perform a t-SNE analysis [23] on the similarity matrix, estimated from the affinity matrix as in [3] as , whereby
[0, 1], here we used
05.
Applying XIL to CNNs for scientific dataset. We produced the matrix A (Eq. 1) corresponding to full tissue masks for each sample. Specifically, for each sample, we created a binary mask having values of zero within the tissue and values of one everywhere else, i.e. the background. In this way during training the gradients everywhere but on the tissue are to be minimized.
The network models were retrained from the same initial values as in the default training mode (using only the cross-entropy loss), however, now using rrr. To choose the optimal value, the resulting explanations were visually assessed. The five cross-validation folds of HS-CNN were thus trained until convergence between 200 and 280 epochs using a
RGB-CNN with rrr the learning rate was reduced to a constant learning rate of 5e-05. Although applying a range of
values from 0.1 to 1000, using the RGB-CNN, no satisfactory convergence state could be reached in which the regularized model showed acceptable explanations for each cross-validation run. The accuracies in Tab. 1 and the strategies presented in Fig. 4(b) and Fig. S.1(b) correspond to grad-Cams of training the five cross-validation folds with
Extended related work. Using XIL with ce or rrr, users either introduce counterexamples into the dataset and thus teach the learner not to depend on the irrelevant components or directly penalize the learner as soon as it uses irrelevant components, respectively. One important advantage of XIL is that the user does not have to be certain about the right reasons and instead can explore the learned reasons of the machine, in contrast to other procedures such as preprocessing the training set.
Recently, Selvaraju et al. [21] presented a framework (hint) similar to rrr but instead of penalizing the wrong reasons it advises the network to use a specific visual area (right reasons). As ce and rrr, the hint method could be embedded within the introduced XIL framework in case the users are certain about the right reasons. However, in many scientific applications such as the presented plant phenotyping dataset users are uncertain about what a valid explanation should be. In this case, removing wrong reasons might be preferable to applying right reasons.
The possibility of bi-directional exchange between user and model due to interaction [65] also distinguishes XIL from approaches for feature selection such as feature masking and approaches that embed prior knowledge into the training process, e.g. [66]. Lastly, interactions also allow that the user can provide incomplete explanations, in other words: only if it is actually required, the user can revise incorrect aspects of a model’s explanation.
Finally, we present the XIL framework here for visual tasks and visual explanations only. With our definition of XIL, it is also applicable to other data domains like natural language processing, see e.g. [4, 16]. However, we experienced that explanations, i.e. right and wrong reasons, are more difficult to define for this modality. In future work one should generally address the best ways to present explanations, even in multi-modal scenarios.
Details on participant recruitment and study procedure. The presented study is part of an extensive thesis work [15]. It was conducted as an online survey, the link of which was distributed via the social network Facebook and the forum of the student body of the department of computer science at TU Darmstadt. Due to the distribution on these channels a wide range of people with different ages and different backgrounds was generated. Each participant completed only one of the three test conditions with 33 participants in TC1, 36 participants in TC2 and 37 participants in TC3, totaling 106 participants overall.
The wording of the original TiA was modified by replacing “system” with “artificial intelligence (AI)”.The response format to each question was a 5-point rating scale from strongly disagree to strongly agree.
Statistical analysis of the user study. Samples with missing values were removed from the analysis and for all tests a significance level with alpha being 5% was used.
For all tests with the same sample/samples, the alpha level was corrected via the Bonferroni-Holm method. The corrected alpha level will be stated for every analysis. For testing the hypotheses one multi-factorial analysis of variances (MANOVA) and several one-factorial ANOVAs were conducted. The ANOVA, as well as the MANOVA, requires normal distribution of data, independence of data as well as homogeneity of the variances. To test the latter a Levene-Test was conducted before every ANOVA and the MANOVA. Normal distribution was presumed due to the sample sizes and as the samples were drawn randomly the independence of data was also presumed. A significant result of an ANOVA / MANOVA means that at least two of the groups differ significantly with respect to the dependent variable, but it is not stated which groups differ. Therefore, if the carried out analyses of variances were significant, post hoc tests were carried out to investigate which groups differed exactly. Post hoc tests were selected in this study as the hypotheses did not point out which groups should differ, which is why every possible comparison had to be considered. For post hoc testing, the Tukey-HSD-Test and the Pairwise-Test were performed.
The TiA score of subjects being familiar with AI over the whole sample (all test conditions combined) was higher (M = 2.82, SD = .64) than the TiA score of subjects being unfamiliar with AI (M = 2.51, SD = .59). As the conducted Levene-Test (F(5, 99) = 1.8, p = .12, 05) was not significant, the homogeneity of variance assumption held. Therefore, the MANOVA was conducted with a significant result for the independent variable test condition (F(2, 99) = 10.10, p < .001,
025). The MANOVA was significant for the independent variable familiarity with AI (F(1, 99) = 7.12, p = .009,
025). It was not significant for the interaction of the two independent variables (F(2, 99) = .28, p = .75,
025).
For Fig. 5(a) in order to determine which test conditions differed significantly in their TiA score, a pairwise test was conducted as a post hoc test. The pairwise test showed significant differences between TC1 and TC3 (p = .0016, 05) as well as between TC2 and TC3 (p = .0003,
05).
For Fig. 5(b) the conducted Levene-Test was not significant (F(2, 96) = .59, p = .56, 05). Therefore, an ANOVA was conducted afterwards and showed a significant result (F(2, 96) = 33.83, p < .001,
0125). Trust in the correct rule learning by the AI was significantly different between the blocks. The conducted Tukey-HSD test found a significant difference in trust into the correct rule learning only between stage 1 and 3 (p < .001,
05) and between stage 2 and 3 (p < .001,
05).
For Fig. 5(c) the Levene-Test was not significant (F(2, 104) = .28, p = .75, 05). The ANOVA was significant (F(2, 104) = 23.19, p < .001,
0167). Therefore, a Tukey-HDS test was performed to investigate which blocks differed significantly. The test found only stage 1 and 3 (p < .001,
05) and stage 2 and 3 (p < .001,
05) to differ significantly with respect to trust in correct rule learning by the AI.
For Fig. 5(d) the conducted Levene-Test was not significant (F(2, 105) = 1.32, p = .27, 05). The afterwards conducted ANOVA was also not significant (F(2, 105) = 1.62, p = .20,
05). Therefore, there was no significant difference in trust into correct rule learning by the AI in TC3 and no post hoc test was performed.
The ML benchmark Fashion-MNIST is available at https://github.com/zalandoresearch/fashion-mnist. The PASCAL VOC2007 dataset is available at http://host.robots.ox.ac.uk/pascal/VOC/voc2007/. The RGB and hyperspectral data that support the findings of this study are available at https://tudatalib.ulb.tu-darmstadt. de/handle/tudatalib/2278.4 and in the code repository https://codeocean.com/capsule/4559958/tree. The user study is available at https://github.com/ml-research/xil/tree/master/Trust_Study.
The code and a fully runnable capsule to reproduce the figures and results of this article, including pre-trained models, can be found at https://codeocean.com/capsule/4559958/tree.
The authors confirm to have complied with all relevant ethical regulations, according to the Ethics Commission of the TU Darmstadt (https://www.intern.tu-darmstadt.de/gremien/ethikkommisson/auftrag/auftrag.en. jsp). An informed consent was obtained for each participant prior to commencing the user study.
ST an KK thank Antonio Vergari, Andrea Passerini, Samuel Kolb, Jessa Bekker, Xiaoting Shao, and Paolo Morettin for very useful feedback on the conference version of this article. Furthermore, the authors are thankful to Frank J¨akel for support and supervision on the user study, to Cigdem Turan for providing the figure sketches, and to Ulrike Steiner and Stefan Paulus for very useful feedback. PS, AKM, AB and KK acknowledge the support by BMEL funds of the German Federal Ministry of Food and Agriculture (BMEL) based on a decision of the Parliament of the Federal Republic of Germany via the Federal Office for Agriculture and Food (BLE) under the innovation support program, project “DePhenSe” (FKZ 2818204715). WS an KK were also supported by BMEL/BLE funds under the innovation support program, project “AuDiSens” (FKZ 28151NA187). ST acknowledges the supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme, grant agreement No. [694980] “SYNTH: Synthesising Inductive Data Models”. XS and KK also acknowledges the support by the German Science Foundation project “CAML” (KE1686/3-1) as part of the SPP 1999 (RATIO). AKM was partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy - EXC 2070 – 390732324
The authors declare the following competing interests: HS is employed by LemnaTec GmbH.
Affiliations
Technical University of Darmstadt, Computer Science Department, Artificial Intelligence and Machine Learning Lab, Darmstadt, Germany Patrick Schramowski, Wolfgang Stammer, Franziska Herbert, Xiaoting Shao
Technical University of Darmstadt, Computer Science Department and Centre for Cognitive Science, Darmstadt, Germany Kristian Kersting
University of Trento, Department of Information Engineering and Computer Science, Trento, Italy Stefano Teso
University of Bonn, Institute of Crop Science and Resource Conservation (INRES) – Plant Diseases and Plant Protection, Bonn, Germany Anna Brugger
Author Contributions
PS and WS contributed equally to the work. PS, WS, ST, KK designed the study. ST, KK designed and published (AAAI /ACM Conference on Artificial Intelligence, Ethics, and Society 2019) the preliminary version of this manuscript. PS, WS, XS, ST, and KK developed extensions of the basic XIL methods. PS, WS, AB, AKM, and KK interpreted the data and drafted the manuscript. AB and PS designed the phenotyping dataset. AB and HGL carried out the phenotyping dataset measuring. PS, WS, AB did the biological analysis. FH performed and analyzed the user study. AKM and KK directed the research and gave initial input. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Patrick Schramowski and Wolfgang Stammer.
[1] Guidotti, R. et al. A survey of methods for explaining black box models. ACM computing surveys (CSUR) 51, 1–42 (2018).
[2] Gilpin, L. H. et al. Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE International Conference on data science and advanced analytics (DSAA), 80–89 (2018).
[3] Lapuschkin, S. et al. Unmasking clever hans predictors and assessing what machines really learn. Nature communications 10, 1096 (2019).
[4] Ross, A. S., Hughes, M. C. & Doshi-Velez, F. Right for the right reasons: Training differentiable models by constraining their explanations. In Proceedings of International Joint Conference on Artificial Intelligence, 2662–2670 (2017).
[5] Simpson, J. A. Psychological foundations of trust. Current directions in psychological science 16, 264–268 (2007).
[6] Hoffman, R. R., Johnson, M., Bradshaw, J. M. & Underbrink, A. Trust in automation. IEEE Intelligent Systems 28, 84–88 (2013).
[7] Buciluˇa, C., Caruana, R. & Niculescu-Mizil, A. Model Compression. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, 535–541 (2006).
[8] Ribeiro, M. T., Singh, S. & Guestrin, C. Why should I trust you?: Explaining the predictions of any classifier. In Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, 1135–1144 (ACM, 2016).
[9] Lundberg, S. & Lee, S. An unexpected unity among methods for interpreting model predictions. CoRR abs/1611.07478 (2016). URL http://arxiv.org/abs/1611.07478.
[10] Settles, B. Closing the loop: Fast, interactive semi-supervised annotation with queries on features and instances. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1467–1478 (Association for Computational Linguistics, 2011).
[11] Shivaswamy, P. & Joachims, T. Coactive learning. Journal of Artificial Intelligence Research 53, 1–40 (2015).
[12] Kulesza, T. et al. Principles of explanatory debugging to personalize interactive machine learning. In Proceedings of International Conference on Intelligent User Interfaces, 126–137 (2015).
[13] Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J. & Zisserman, A. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html.
[14] Lin, T. et al. Microsoft COCO: common objects in context. In Proceedings of European Conference on Computer Vision, 740–755 (2014).
[15] Herbert, F. P., Kersting, K. & J¨akel, F. Why Should I Trust in AI? Master’s thesis, Technical University Darmstadt (2019).
[16] Teso, S. & Kersting, K. Explanatory interactive machine learning. In Proceedings of AAAI/ACM Conference on AI, Ethics, and Society (AAAI, 2019).
[17] Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1, 206–215 (2019).
[18] Judah, K. et al. Active imitation learning via reduction to iid active learning. In AAAI Fall Symposium Series (2012).
[19] Cakmak, M. et al. Mixed-initiative active learning. ICML 2011 Workshop on Combining Learning Strategies to Reduce Label Cost (2011).
[20] Selvaraju, R. R. et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, 618–626 (2017).
[21] Selvaraju, R. R. et al. Taking a hint: Leveraging explanations to make vision and language models more grounded. In Proceedings of the IEEE International Conference on Computer Vision, 2591–2600 (2019).
[22] Xiao, H., Rasul, K. & Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).
[23] Maaten, L. v. d. & Hinton, G. Visualizing data using t-SNE. Journal of machine learning research 9, 2579–2605 (2008).
[24] K¨orber, M. Theoretical considerations and development of a questionnaire to measure trust in automation. In Congress of the International Ergonomics Association, 13–30 (Springer, 2018).
[25] Jordan, M. I. & Mitchell, T. M. Machine learning: Trends, perspectives, and prospects. Science 349, 255–260 (2015).
[26] Ghahramani, Z. Probabilistic machine learning and artificial intelligence. Nature 521, 452–459 (2015).
[27] Silver, D. et al. Mastering the game of go without human knowledge. Nature 550, 354–359 (2017).
[28] Zech, J. R. et al. Confounding variables can degrade generalization performance of radiological deep learning models. CoRR abs/1807.00431 (2018). URL http://arxiv.org/abs/1807.00431.
[29] Badgeley, M. A. et al. Deep learning predicts hip fracture using confounding patient and healthcare variables. npj Digital Medicine 2, 31 (2019).
[30] Chaibub Neto, E. et al. A permutation approach to assess confounding in machine learning applications for digital health. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 54–64 (ACM, 2019).
[31] Adebayo, J. et al. Sanity checks for saliency maps. In Proceedings of Advances in Neural Information Processing Systems, 9505–9515 (2018).
[32] Chen, C. et al. This looks like that: Deep learning for interpretable image recognition. In Proceedings of Advances in Neural Information Processing Systems, 8928–8939 (2019).
[33] Dombrowski, A. et al. Explanations can be manipulated and geometry is to blame. In Wallach, H. M. et al. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, 13567–13578 (2019).
[34] Odom, P. & Natarajan, S. Human-guided learning for probabilistic logic models. Frontiers in Robotics and AI 5, 56 (2018).
[35] Narayanan, M. et al. How do humans understand explanations from machine learning systems? an evaluation of the human-interpretability of explanation. CoRR abs/1802.00682 (2018). URL http://arxiv.org/abs/ 1802.00682.
[36] Kanehira, A. & Harada, T. Learning to explain with complemental examples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8603–8611 (2019).
[37] Huk Park, D. et al. Multimodal explanations: Justifying decisions and pointing to the evidence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8779–8788 (2018).
[38] Settles, B. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6, 1–114 (2012).
[39] Hanneke, S. et al. Theory of disagreement-based active learning. 7, 131–309 (2014).
[40] Roy, N. et al. Toward optimal active learning through monte carlo estimation of error reduction. ICML 441–448 (2001).
[41] Castro, R. M. et al. Upper and lower error bounds for active learning. In Proceedings of Conference on Communication, Control and Computing, 2.1, 1 (2006).
[42] Balcan, M.-F. et al. The true sample complexity of active learning. Machine learning 80, 111–139 (2010).
[43] Tong, S. & Koller, D. Support vector machine active learning with applications to text classification. Journal of machine learning research 2, 45–66 (2001).
[44] Krause, A. et al. Nonmyopic active learning of gaussian processes: an exploration-exploitation approach. In Proceedings of International Conference on Machine learning, 449–456 (ACM, 2007).
[45] Gal, Y. et al. Deep bayesian active learning with image data. In Proceedings of International Conference on Machine learning, 1183–1192 (2017).
[46] Schnabel, T. et al. Short-term satisfaction and long-term coverage: Understanding how users tolerate algorithmic exploration. In Proceedings of ACM International Conference on Web Search and Data Mining, 513–521 (ACM, 2018).
[47] Bastani, O., Kim, C. & Bastani, H. Interpreting blackbox models via model extraction. CoRR abs/1705.08504 (2017). URL http://arxiv.org/abs/1705.08504.
[48] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2921–2929 (2016).
[49] Cortes, C. et al. Support-vector networks. Machine learning 20, 273–297 (1995).
[50] Anders, C. J. et al. Analyzing imagenet with spectral relevance analysis: Towards imagenet un-hans’ed. arXiv preprint arXiv:1912.11425 (2019).
[51] Zaidan, O. et al. Using “annotator rationales” to improve machine learning for text categorization. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 260–267 (2007).
[52] Small, K. et al. The constrained weight space svm: learning with ranked features. In Proceedings of International Conference on Machine learning, 865–872 (Omnipress, 2011).
[53] Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of International Conference on Learning Representations (2015).
[54] Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. In Proceedings of International Conference on Learning Representations (2015).
[55] Lau, E. High-throughput phenotyping of rice growth traits. Nature Reviews Genetics 15, 778–778 (2014).
[56] de Souza, N. High-throughput phenotyping. Nature Methods 36–36 (2009).
[57] Tardieu, F., Cabrera-Bosquet, L., Pridmore, T. & Bennett, M. Plant Phenomics, From Sensors to Knowledge. Current Biology 27, R770–R783 (2017).
[58] Pound, M. P. et al. Deep machine learning provides state-of-the-art performance in image-based plant phenotyping. Gigascience 6, gix083 (2017).
[59] Mochida, K. et al. Computer vision-based phenotyping for improvement of plant productivity: a machine learning perspective. GigaScience 8, giy153 (2018).
[60] Mahlein, A.-K. et al. Quantitative and qualitative phenotyping of disease resistance of crops by hyperspectral sensors: seamless interlocking of phytopathology, sensors, and machine learning is needed! Current opinion in Plant Biology 50, 156–162 (2019).
[61] Meier, U. et al. Phenological growth stages of sugar beet (Beta vulgaris l. ssp.) codification and description according to the general bbch scale (with figures). Nachrichtenblatt des Deutschen Pflanzenschutzdienstes 45, 37–41 (1993).
[62] Hooker, S., Erhan, D., Kindermans, P. & Kim, B. A benchmark for interpretability methods in deep neural net- works. In Proceedings of Advances in Neural Information Processing Systems, 9734–9745 (2019). URL http:// papers.nips.cc/paper/9167-a-benchmark-for-interpretability-methods-in-deep-neural-networks.
[63] Deng, J. et al. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2009).
[64] Von Luxburg, U. A tutorial on spectral clustering. Statistics and computing 17, 395–416 (2007).
[65] Abdel-Karim, B. M., Pfeuffer, N., Rohde, G. & Hinz, O. How and what can humans learn from being in the loop? 1–9 (2020).
[66] Erion, G. G., Janizek, J. D., Sturmfels, P., Lundberg, S. & Lee, S. Learning explainable models using attribution priors. CoRR abs/1906.10670 (2019). URL http://arxiv.org/abs/1906.10670.