b

DiscoverModelsSearch
About
Lessons from the Development of an Anomaly Detection Interface on the Mars Perseverance Rover using the ISHMAP Framework
2023
·
arXiv
ABSTRACT

While anomaly detection stands among the most important and valuable problems across many scientific domains, anomaly detection research often focuses on AI methods that can lack the nuance and interpretability so critical to conducting scientific inquiry. We believe this exclusive focus on algorithms with a fixed framing ultimately blocks scientists from adopting even high-accuracy anomaly detection models in many scientific use cases. In this application paper we present the results of utilizing an alternative approach that situates the mathematical framing of machine learning based anomaly detection within a participatory design framework. In a collaboration with NASA scientists working with the PIXL instrument studying Martian planetary geochemistry as a part of the search for extra-terrestrial life; we report on over 18 months of in-context user research and co-design to define the key problems NASA scientists face when looking to detect and interpret spectral anomalies. We address these problems and develop a novel spectral anomaly detection toolkit for PIXL scientists that is highly accurate (93.4% test accuracy on detecting diffraction anomalies), while maintaining strong transparency to scientific interpretation. We also describe outcomes from a yearlong field deployment of the algorithm and associated interface, now used daily as a core component of the PIXL science team’s workflow, and directly situate the algorithm as a key contributor to discoveries around the potential habitability of Mars. Finally we introduce a new design framework which we developed through the course of this collaboration for co-creating anomaly detection algorithms: Iterative Semantic Heuristic Modeling of Anomalous Phenomena (ISHMAP), which provides a process for scientists and researchers to produce natively interpretable anomaly detection models. This work showcases an example of successfully bridging methodologies from AI and HCI within a scientific domain, and provides a resource in ISHMAP which may be used by other researchers and practitioners looking to partner with other scientific teams to achieve better science through more effective and interpretable anomaly detection tools.

CCS CONCEPTS

Applied computing  →Physical sciences and engineering; • Computing methodologies  →Anomaly detection; • Humancentered computing  → Open source software; Interactive systems and tools.

KEYWORDS

anomaly detection, human-centered machine learning, scientific computing, x-ray spectroscopy

ACM Reference Format:

Austin P. Wright, Peter Nemere, Adrian Galvin, Duen Horng Chau, and Scott Davidoff. 2023. Lessons from the Development of an Anomaly Detection Interface on the Mars Perseverance Rover using the ISHMAP Framework. In  28th International Conference on Intelligent User Interfaces (IUI’23), March27–31, 2023, Sydney, NSW, Australia. ACM, New York, NY, USA, 15 pages. https://doi.org/10.1145/3581641.3584036

Detecting and interpreting anomalies is one of the core tasks at the heart of the scientific process[24]. However, even while novel machine learning models are rapidly improving the state of the art in anomaly detection[34], scientific applications present an interesting and complex problem for anomaly detection. Because the goal of science is often not just to detect anomalies, but to understand their causes, developing both high-accuracy and interpretable anomaly detection models becomes one of the key ways to support scientific inquiry. While research has explored how algorithms can be made to be explain their reasoning[18, 26, 35], this paper investigates how HCI methods can be deeply integrated within the framing of model development to drive interpretability as a user-defined quality that is considered a first class objective rather than a post-hoc computed explanation.

In this work we present an application of just such a design process, we engaged in a multi-year collaboration with a team of scientists at NASA who analyze data from the PIXL instrument to understand Martian geochemistry, and thus the possibility of extra-terrestrial life [3, 7, 9, 13]. Our collaboration set forth with the following research questions:

image

(1) RQ1: Within the specific scientific workflow of PIXL scientists, what are the particular requirements that a modeling approach must satisfy?

(2) RQ2: In what ways and to what degree does the existing standard approach to anomaly detection through machine learning satisfy and violate these requirements?

(3) RQ3: How might a different anomaly detection modeling framework to enable the development of more effective systems given these requirements?

In the course of addressing these questions we explored broader approaches to anomaly detection in a scientific context and report the following four contributions:

(1) Formative design study. We present the findings of an 18 month long study where through a series of contextual inquiry interviews we outlined the analytic workflow of the NASA PIXL science team, found key challenges faced by scientists in detecting and interpreting spectral anomalies in X-ray florescence (XRF) data, and developed a comprehensive model of how PIXL scientists approach anomaly detection. This study revealed three key design goals used to guide the development of tools assisting in this workflow.

(2) Novel spectroscopy anomaly detection algorithm. We describe a new method to automatically detect and classify diffraction and other spectral anomalies accurately (93.4% test accuracy), directly within raw unquantified spectra, providing a significant improvement over existing methods.

(3) Deployed algorithm and visualization system . We embedded this algorithm within a visualization tool that has been deployed and has now become a regular and important component of all PIXL based analysis conducted over the past year, used daily by over 97 NASA scientists and NASAaffiliated scientists around the globe. We evaluate the success of the application further by examining some examples of novel planetary science enabled by the tool.

(4) We introduce a novel framework, ISHMAP, for the collaborative development of anomaly detection tools for scientific teams like PIXL. Finally we present a framework, Iterated Semantic Heuristic Modeling of Anomalous Phenomena (ISHMAP), which was developed from the success of this application that can serve as a useful tool in itself for future collaborations in similar scientific anomaly detection settings. This framework integrates HCI and AI perspectives on model development and presents a different formulation of the problem of anomaly detection specifically designed to fulfill the needs of scientific users. ISHMAP introduces a process for how to produce anomaly detection models that provide first-class scientific interpretations by default, ensuring scientific users buy-in, and also can be tightly integrated with existing modeling techniques (including deep learning approaches).

This in diving deep into this specific application, this paper looks showcase an example of a possible way to synthesize AI anomaly detection research with methods developed in HCI. We believe that by combining methods across disciplines, researchers may be better able to take on high priority problems like anomaly detection, in partnership with scientific communities, and to help drive discovery.

image

Figure 1: Overview of how we utilized ISHMAP to assist in the PIXL science mission. Using this collaborative process we were able to develop a novel interpretable anomaly detection model and deploy interactive visualizations within the widely used PIXLISE visual analytics program. This deployment proved to provide key insights in ongoing major scientific findings.

We offer evidence of how this bridging supported our own inquiry in the case of the NASA PIXL science team. In this next section, we situate this work within the larger landscape of HCI, AI, and Astrobiology research.

2.1 The Search for Extraterrestrial Life

The search for extraterrestrial life is among the great contemporary scientific endeavors [9], and Mars forms a key component of that search [13]. A principal way scientists around the world build an understanding of Mars as a possible host for life is to study its planetary-scale geology and geochemistry over time [3, 7, 27]. NASA’s PIXL instrument supports that ongoing investigation by capturing co-aligned visible imaging and thousands of spatially localized pairs of X-ray fluorescence (XRF) spectra in a single experiment [2].

The PIXL instrument represents a generational change in the scale and sensitivity of extra-terrestrial XRF measurements[15, 36, 42]. While this can bring analytical leaps, it also means that the data are more sensitive to spectral anomalies that can lead scientists to misinterpret data.

Anomalies in XRF spectra have been historically identified manually. But with each experiment site including thousands of spectra to manually investigate, each with hundreds of peaks, anomalies become increasingly difficult and time consuming to manually identify. And missing spectral anomalies could lead to the misinterpretation of the elemental chemistry, mineralogy and ultimately the planetary scale environment that acted upon the samples under investigation. Therefore there is substantial scientific interest in performing reliable and interpretable anomaly detection on the incoming data from PIXL.

2.2 Bridging HCI and AI methods for Interpretable Modeling

While having more independent origins, the recent history of hybrid disciplines such as HCI-AI[22], HCML [17], and human-guided ML [16] reflect an interest in drawing on knowledge generated across fields to jointly inform the development of systems that touch on each discipline. HCI researchers, for example have looked to understand the challenges to design AI systems that fit with user needs[47], and to use new properties exposed by these systems as a resource for designers[5]. The complexity of dealing with some form of embedded intelligence led other researchers to introduce particular methods to structure ideation and iteration of AI- and ML-systems for ubicomp[10, 33] or dialog systems[46].

Alternatively, researchers in AI and ML have drawn upon expert knowledge to inform ML models [8], or to bring interactivity into learning systems [12], looking to leverage interaction to define ML model rules [19]. These methods have been applied to anomaly detection in cybersecurity, spatio-temporal, and behaviour modeling contexts[23].

In the sciences, this crossover has looked at interactivity and glass-box models as a way to support interpretable and configurable deployed machine learning models [43]. However, fieldwork in disciplines such as oceanography have shown that while interpretation and understanding of code and models is essential, it is insufficient to contributing within a larger scientific workflow as the primary driver of change will often come from anomalous “moments of flux” which naturally lead to reconceptualisations that are not amenable to fixed data perspective implicit in any traditional data science or machine learning model, as “a singular focus on problem-solving may marginalize opportunities for innovation that could drive community engagement, and, therefore, momentum and adoption”[25]. Therefore this work looks to expand the tradition of utilization of participatory design practises in the context of AI approaches to science by enabling a more flexible modeling approach while maintaining established key aspects of interpretability.

This section presents the findings of our inquiry into the analytic workflow of PIXL data by scientists focusing primarily on anomalies. We started this work with a series of contextual inquiry interviews [44] conducted in a cadence of approximately every two weeks over the course of 18 months to understand the different users of

image

PIXL data. While we spoke to and collaborated with many dozens of scientists working with PIXL data, we focused our attention with five primary users: three spectroscopists who we will refer to as R1, R2, and R3; a sedimentologist, R4; and a geochemist, R5.

Our research question entering into this study was to find what are the primary constraints that any modeling intervention in this context must satisfy in order to be useful within the scientists analytic workflow (RQ1). Through these interviews we were able to define three primary design constraints which elucidate the requirements for any anomaly detection method to be useful to PIXL scientists. While each design goal is firmly situated within the context of PIXL analysis, the underling rationale and aspects of PIXL data that inform each goal are not unique to PIXL and likely are applicable in a broad range of scientific applications.

3.1 Background on PIXL Science Workflow

In order to understand the context of the design goals for anomaly detection for PIXL, we must first establish some basic background of the data formats and processing steps scientists work with. During an experiment, PIXL operates on a specially designated sample known as a target. PIXL sends an X-ray beam into a location on the rock’s surface. At each location, PIXL’s X-ray beam causes the chemical elements in the target to fluoresce, which is captured in a data type called a spectrum [4], a 4096-index array of channels. Each channel records the count of electrons, sensed at a distinct energy level measured in kilo electron-volts, or  keV. PIXL’s A and Bdetectors each record the distinctive fluorescence patterns emitted by each point on a target from two different phase angles. These distinctive responses take the form of fluorescence peaks, which are Gaussian peaks of a fixed width of channels over background measurements which are dependant on the chemical composition of the X-ray beam location. During an experiment, PIXL’s camera also captures visible light images of a target. When data are returned to Earth, the X-ray beam geometry is reconstructed, and each spectral point is localized within each of the returned images. Overall, an experiment at a target returns a series of visible light images, and around 4,000 spectral points, each with an A and B spectra, and x and y coordinates within each image [2] (see Figure 2).

Once an experiment is conducted with the PIXL instrument on Mars, the spectral and image data from that experiment is sent back to earth and analyzed primarily through the PIXLISE analysis user interface [39]. This data is analyzed and transformed in multiple steps by different subject matter experts. The first step in analysis, spectroscopists translate the peaks in the spectra into elements that they believe are present in the target [20].

The spectroscopist then uses their list of elements to quantify the spectra, translating the intensity of the various peaks into an empirical estimate of the percent of the total elemental mass each element constitutes in the target [20]. PIXL uses the PIQUANT quantification algorithm [11, 21], and exposes those capabilities through the PIXLISE user interface. Each spectra is quantified independently by PIQUANT, while spectroscopists determine the set of elements to quantify using the bulk sum of all of the spectra in the dataset.

After the spectra have been quantified into elemental weight percents the broader science team begins to analyze the dataset. Since

image

Figure 2: Overview of data provided by PIXL instrument

pure elements are uncommon in nature, the next task of the science team is to determine how the elements that have been detected and quantified combine to form some combination of the currently known 5700+ minerals [37]. The primary driver of this analysis is the quantified elemental weight percent map of the dataset, which is visualised using many standard Geology and Geochemistry data visualization techniques such as ternary diagrams and heatmaps combined with visualizations unique to PIXLISE such as chord diagrams. These quantitative signals are importantly augmented with additional signals from the color, shape, and texture of the rock in the images, or its morphology, to consider mineral candidates.

As candidate minerals emerge, scientists then designate them as regions of interest, or ROI’s, and evolve theories of the historical geologic processes that brought the rock and its individual ROIs into existence and altered them over time [3]. These theories become increasingly dependent on the contextual and visual information and how minerals are situated as the discussion broadens to include Astrobiologists who can aggregate the details of the geochemistry, geophysics, and climate, to build a long-term theory of the broader context of the target, site and region and implications towards biological habitability.

3.2 Design Goal 1: Focus on Raw Data Over Processed and Quantified Data (G1)

When considering the problem of anomaly detection in this workflow we first sought to understand what data structure to analyze, the raw spectra or the quantification. What we observed was that for many scientists the information from the PIXL instrument spectra was almost entirely mediated through elemental quantification of fluorescence phenomena and the visualization of these quantifications within PIXLISE. This means that for the most part, anomalies are found by discovering unexpected results in a quantification and backtracking to find some non-fluorescent spectral phenomena that is causing an erroneous quantification. As R2 described:

image

This method has obvious downsides, as any anomaly that causes an error in quantification not unusual enough to merit deeper scrutiny can propagate misleading information. Furthermore, the total amount of time and effort spent on downstream correction can be much greater than early detection, especially given the scale of data that PIXL produces. Therefore we formed our first design constraint that we should perform analysis on raw and unquantified spectra in the hopes of catching phenomena that may be obfuscated through bulk sum quantification.

3.3 Design Goal 2: Robustness to Limited Ground Truth Labeling (G2)

The next constraint we found was that the actual amount of reliable ground truth labels currently existing in PIXL datasets is very limited. This is due to the requirement of manual processing by a small number of expert users to reach reliable conclusions, paired with the novel scale of data being produced by PIXL. R1 expressed their desire to help winnow down potential anomalies before digging into deeper analysis:

“What we want ... in the automation phase is reduce ... one data set a day with 5000 spectra down to ... a few spectra a week. Because ... this is a multi-day interrogation [for] one data set, and so ... with all of the other science outputs ... ... you then want to go flagged for looking at later."

Thus we formed our second constraint to be that our method must be robust to a small number of ground truth labels and thus provide a reasonable number of flags for experts to be able to precisely analyze.

3.4 Design Goal 3: Allow Differentiation of Anomalies by Scientific Causal Process (G3)

Another major constraint that was emphatically expressed to us throughout our initial interviews was the vital importance of understanding why given anomalies may be presented within the context of scientific models. For instance R1 described why spectra are looked at manually currently due to the huge number of different ways to model a spectra and thus the requirement of background knowledge:

“Fluorescence data is fitted manually for a reason. ... you could have something with ... 15 lines from rare earth elements in the spectra. There’s always some expert user who knows something about the sample fitting the data. ... because the combinations are infinite, that it’s not something ... automatically done.”

Furthermore, anomalous phenomena may contain useful infor- mation in themselves about the physical processes that causes them, and thus may be worthy of analysis in their own right. We found that the non-fluorescent phenomena most often discovered when manually investigating quantification anomalies is diffraction, which is an effect often investigated on its own in the context of purpose built X-ray diffraction instruments [6], but which has a signal response sometimes overlapping fluorescence peaks. Thus we hypothesized that such spectral anomalies, if sufficiently differentiated, could be used as another source of auxiliary information similar to the visual context imaging within the mineral identification process. What is currently lacking is a way to find and characterize anomalies early in the pipeline and then utilize these detections for improved downstream analysis. R1 stated the goal similarly:

“The ultimate test would be if fluorescence yields an ambiguous mineral that the diffraction can make unambiguous.”

Therefore we determined it is very important for scientists to be able to not only find anomalies but interpret and differentiate different kinds of anomalies in the context of helping them understand the underlying science.

3.5 Comparison to the Existing Approach: Evaluating Standard Machine Learning Based Anomaly Detection Methods Using

image

Once we had understood the domain and relevant design goals we sought to evaluate standard approaches to the problem of anomaly detection and take into account any potential issues (RQ2), and to guide the specific problems we need to address in out solution (RQ3). These standard approaches for anomaly detection and machine learning can be broadly broken down into traditional methods and deep learning based methods. Traditional methods tend to be designed for tabular data and are thus generally not well suited for analysis of raw data as required by design goal G1, while deep learning models are well known for their adaptability to complex non-tabular data formats. Deep learning based anomaly detection methods tend to follow the general structure of training a model to encode data into a compact representation to find structural patterns and outliers [34]. The main classes of these methods that do not need extensive labeling (as required by design goal G2) are feature extraction methods and normality representation methods[34], both of which contain explicit assumptions that conflict with our design goals. Feature extraction methods assume that “The feature representations extracted by deep learning models preserve the discriminative information that helps separate anomalies from normal instances.” This violates our finding from goal G2, where measurement is expensive and thus sufficient data to form a comprehensive feature set that includes rare and nuanced anomaly classes without explicit labels is not available. Normality representing models take the form of latent space encoders such as auto-encoders or generative adversarial networks and assume that non-anomalous instances can be better represented and reconstructed by these encoding models than anomalous instances. We found this to not be true in practice, when considering the constraint of goal G2 we had no way to preemptively sort out normal as opposed to anomalous instances for semi-supervision meaning anomalous instances must be included in the training data. This can cause problems as the model then will learn to represent those instances just as well as normal instances. This makes further sense when considering that the choice of reconstruction loss function is “designed for dimension reduction or data compression, rather than anomaly detection. As a result, the resulting representations are a generic summarization of underlying regularities, which are not optimized for detecting irregularities.” This property, when paired with the

image

lack of supervised labels from G2, fundamentally violates goal G3 as general purpose patterns explicitly do not prioritize truly rare or categorically different anomalies but will always prioritize either single point anomalies or simple sub-samples of classes of normal data which happen to be more rare.

Our goal of anomaly differentiation by scientific interpretation G3 shows clearly how scientific end users care most about the underlying causal processes as opposed to the most clear surface level empirical patterns. Lacking robust labeling, unsupervised methods all ultimately can do nothing but optimize in different ways for surface level empirical similarity without considering different causal processes. Thus even when such methods are able to discover some of the anomalies that are present, they fundamentally keep the task of sorting through the important vs unimportant classes of anomaly as a manual process. We hope to be able to reverse this order of operations, and allow scientists to differentiate the kinds of anomalies they care about first, and then detect them directly allowing for pre-sorted interpretations when processing new data.

When considering the weaknesses of deep learning and traditional data science based anomaly detection, a root cause of issues is that the problem framing involves either no direct input from scientists (which inevitably violates design goal G3) or only has input mediated through labeling (which in order to communicate required nuance would require a scale that violates design goal G2). Thus we set out to develop an alternative model development framework which can more effectively and efficiently incorporate scientists’ prior knowledge into anomaly detection models (RQ3).

A key insight in developing this framework is the differentiation between phenomena and data. Most existing methods focus purely on the space of data, and thus anomalies must be defined as individual data points. However scientists work in an ontology that is more abstracted from data space, where there are many underlying processes that can occur in the physical world and these processes can be measured in many different, or incomplete manners. What is important is not tied to a single datum, but rather what any subset or superset of data can imply about the underlying phenomena. Thus what we consider as phenomena can occur at multiple levels of scale. A single data point of a high dimension or complexity may contain within it multiple instances of different phenomena.

This form of analysis is the one side of a trade-off. Data space analysis optimizes for completeness by encoding all measurable information contained samples from dataset up to the limits of the size of the dataset without regard for correctness or why any given datum has a particular set of features. By modeling phenomena we inherently limit completeness as only phenomena considered explicitly can be modeled, which must by necessity be less than the innumerable number of underlying factors that could be modeled given perfect knowledge. However what we gain is correctness, where phenomena that are of interest can be modeled more fittingly to their natural scale and be separated for interpretation by default rather than through post-hoc analysis of data space latent encodings.

Here we present a design framework, Iterative Semantic Heuristic Modeling of Anomalous Phenomena (ISHMAP, see Figure 3), that can provide a template for developer and scientist collaboration to perform phenomena based anomaly analysis. Guided by the design goals laid out previously it can produce heuristic raw data feature extractors based on scientifically determined meaningful anomalous phenomena, and iterate adaptively based on the limited amount of scientist time available. We utilized this framework within the context of PIXL science and here discuss both the details of its was application within PIXL science to enable novel scientific discoveries as well as laying out the general principles as a potential resource for other collaborations guided by similar design goals.

4.1 Scientist Description of Anomalous Phenomena

The entry point into ISHMAP and the key differentiation between phenomena as opposed to data centered analysis is starting with scientist driven explicit identification of a specific anomalous phenomenon (Fig. 3A). The process begins by outlining a semantic class of anomalous phenomena within the ontology of the given scientific domain. This specific class is chosen by scientists as conveying some important information. With PIXL we started with the phenomena of diffraction. Scientists chose to start with this phenomenon because diffraction only occurs with a particular and theoretically well-understood set of conditions involving crystal structure. This means that the presence of diffraction can tell scientists very important information about the physical structure of a sample that elemental composition alone cannot differentiate.

Once a phenomenon is decided upon the first characteristic that must be determined is the scale at which the phenomena is measured by the available data. Does the phenomenon occur as a specific kind of data point? Does is manifest as a pattern between adjacent points? Or does it occur, potentially multiple times, as a subset within a single data point? The scale of analysis is determined based on the prior understanding both of the physical phenomenon as well as the characteristics of the measurement methods producing the available data. This scale determination is absolutely essential to enabling properly interpretable heuristics, as it determines the input of the heuristic function and thus the way it can express a phenomenon as the most natural explanation for how a phenomenon exhibits itself in a larger dataset. What must be decided for a scale determination is a sampling procedure to extract from the primal dataset all potential instances of the target phenomenon, as well as a map back to the primal dataset determining which parts of specific data point or points are being included in a sample. For our example scientists know that diffraction occurs as distinct peaks within spectra, and based on the known resolution properties of the PIXL instrument, these peak responses are assumed to be discrete signals within a window of size 0.2 KeV which is the full-width-half-max size of detectable gaussian peaks for the detector [1]. This means that the sampled input for a heuristic will be all contiguous windows of width 0.2 KeV which can be individually evaluated as potential diffraction peaks.

Once the correct scale of analysis is determined, scientists should then describe the differential causal process of the phenomenon with respect to the data measurement. What this entails is describing how the given phenomenon interacts with the measurement

image

process and thus the differential between the described anomalous and non-anomalous data with respect to their underlying known or hypothesized causality. This description does not need to be extremely thorough, as it will later be translated through various lossy processes, it merely has to describe they primary ways in which this phenomenon differs from the default assumptions of the model, and how these differences manifest in the data. A well chosen scale determination will tend to greatly decrease the complexity of such descriptions when compared to descriptions that must work purely in the primal dataset. This description will form the basic starting point from which heuristics can be designed and iterated upon. For diffraction the causal process is well understood, diffraction is an effect that occurs when the PIXL instrument is particularly aligned with a crystal structure in the sample and PIXL sends X-rays of the correct frequency that resonates with the lattice the response will scatter with constructive interference forming a response peak at that resonant frequency. These response peaks are similar in shape to florescence as gaussian peaks with width determined by the detector resolution, however their causal process differs such that they can occur at arbitrary frequencies as opposed to solely at elemental florescent frequencies and the spatial dependence of the effect is sensitive enough that a diffraction response in one of PIXL’s two detectors is very unlikely to occur at the same frequency in the other detector. Once scientists have formed such a description of the scale of a phenomenon and the causal forward process that generates differential data measurement, developers can proceed to the next step of ISHMAP (Fig. 3B).

4.2 Translation into Heuristic Model

After a definition of the phenomenon is provided by the scientists, it is then the job of the developer to translate this definition into a computable heuristic model (Fig. 3B). This stage of the framework can take many different forms depending on the nature of the description provided. The only requirement being that the end result of an iteration of development be a program that takes as input a sample of data of the form set forth by the phenomenon scale characterization and output a scalar value proportional to how well a given sample conforms to the forward process of the anomaly characterization. The form of this program can depend on a number of factors including the format of forward process description, the amount and format of available data in the phenomenon scale, and compute resources available. When designing the initial heuristic for diffraction we chose to manually implement a statistical test of the assumptions provided in the previous step. We defined a function where given a window of spectrum counts of width 0.2 KeV, we test the hypothesis that one of the two PIXL detectors contains a statistically significant response peak above the spectrum’s noise threshold while the other detector does not. This is done using a paired difference t-test [40] between the two detectors pairwise over each channel in the window. This is is used as the counts in each channel are not independent since the underlying count for each channel is dependent on the X-ray frequency of that channel. We calculate and return the absolute value (since it does not matter which of the two detectors is the one where the diffraction is detected) of the t-statistic for the window as the measure of the statistical effect of potential diffraction.

image

Figure 3: Overview of the ISHMAP Design Framework. Starting with scientist lead descriptions of phenomena (A), translated by developers into a computable heuristic function (B), which enables sampling of archetypal data instances (C), iterated until heuristics can provide a clear signal for the high response samples, followed by sampling of low response samples to determine a classification threshold (D), which when determined allows the return of a finalized model of the target phenomenon (E). An example of such a model as a result of this process can be seen in figure 4.

While in the PIXL case the heuristic model took the form of hypothesis testing based on the assumptions provided by the scientists, there are many possible methods of formulating the heuristic. If the given scientific description is more easily expressed as Baysean priors then models (including deep learning based models) that utilize such probabilistic formulation may be useful. Alternatively if direct simulation techniques for particular anomalies are provided then direct similarity comparisons based on forward process simulation may be a better fit. The role of the heuristic model in the ISHMAP framework is not to enforce a single modeling paradigm for all phenomena, but rather to provide an interface for different models to work in ensemble within the larger framework. The important components are that models must be chosen to have the most appropriate phenomenon-scale inputs, have comparable scalar heuristic outputs, and provide some way of parameterization based on scientist priors with possibility of fine tuning. If all of these requirements are met then the heuristic can be calculated for each phenomenon-scale datum in the dataset and these pairs of data and heuristic value can be utilized in the next step (Fig. 3C).

4.3 Heuristic Model Evaluation from Sampling High Response Archetypes

Given a version of the heuristic model the next step in ISHMAP is to evaluate the heuristic and subsequently determine the kind of iteration for model refinement (Fig. 3C). This is done first by sampling data from the high end of the distribution of heuristic responses. If the heuristic model is performing well these high response samples should form strong archetypes of the phenomena to be modeled. In this phase of the process scientists should be given an opportunity to inspect the class of model archetypes and determine the coherence of the class. If the high response samples are consistently determined to be good examples of the desired phenomenon then the heuristic model can be confirmed and moved on to the next threshold tuning phase. Otherwise the set of high response samples will contain instances of phenomena other than the target.

In this case scientists must then determine what else is being included. If the set contains a meaningful amount of instances of a distinct anomalous phenomenon then we can recursively iterate the ISHMAP procedure to model this other class and thus form a differentiation. This is exactly what occurred when evaluating the initial diffraction heuristic. During this phase R2 pointed out:

“ And, okay, yeah, this is something actually that I think is good for me to point out to you, because the algorithm identifies this as a diffraction. And this is not diffraction. This is actually, I would say, an intensity mismatch that we’re seeing these not just in that peak. But in some other locations, other peaks in the spectra, I think there’s a little bit of intensity mismatch has to do with measuring the rough surface, like measuring like these larger grains. ”

What we had found was an additional class of anomalous phenomena due to surface roughness. This phenomenon was then subsequently modeled using ISHMAP, where the scientists’ background knowledge informed us that surface roughness effects are frequency independent and thus the roughness phenomenon-space included

image

whole spectra, and that the effect is expected to be a constant attenuation of the signal in a single detector. This background informed the heuristic roughness detector of calculating the mean detector difference across the whole frequency range of a spectrum, being essentially the maximum likelihood estimate of an assumed constant attenuation factor. This heuristic proved to be highly effective in the first iteration at distinguishing roughness effects. Upon completion of the deeper level of ISHMAP iteration with an effective roughness detector heuristic we could then separate out diffraction from roughness effects in the high response region leaving a now coherent class of diffraction instances using the original diffraction heuristic. Note that the phenomenon scale for roughness is not the same as that for diffraction, an example of how two different phenomena can have overlapping effects at one scale but can be easily differentiated at another scale. The recursive iteration and phenomenon centric structure of ISHMAP ensures that all heuristics model phenomena at their native scale while still being able to provide information between scales.

After differentiating all distinct classes of anomalies present in the high response sample set the primary heuristic model can be directly iterated to optimize differentiation with non-anomalous normal data and associated background noise. This iteration loop is flexible to the amount of scientist labeling available, as the baseline modeling assumptions assure a certain minimum semantic coherence of the initial heuristic making additional optimization optional while the flexibility of the heuristic modeling interface allow for models that can benefit from additional feedback when available. Thus the expected availability of downstream scientist labeling is an important constraint to consider when choosing a heuristic model class. This phase of ISHMAP is considered completed and we can continue to the next phase (Fig. 3D) after sufficient iteration to ensure the the high response samples of the heuristic model consistently represent archetypal instances of the target phenomenon.

4.4 Heuristic Threshold Tuning from Sampling Moderate Response Edge Cases

Once the heuristic model has been determined to be responding to the correct features there remains the task of determining a threshold for the heuristic response value for the purpose of categorization. This process is very similar to the previous phase where samples with heuristic values around a potential threshold are generated and then evaluated by scientists. If the samples remain highly coherent then the threshold can be lowered, while if there are insufficient numbers of the correct anomalies in a set of samples the threshold can be raised. This process can be iterated and tuned depending on the comparative importance placed by scientists to false positives and false negatives. This phase (Fig. 3D) still contains the same opportunity for iteration present in the high response phase (Fig. 3C), where if additional phenomena are found in the boundary region they can be recursively modeled providing cleaner edge case regions. Furthermore in this phase we additionally consider a special class of phenomenon we call ambiguities. In the region of a decision boundary for phenomena detection, scientists will often find instances which are ambiguous in certain respects making an underlying label difficult or impossible to assign. These ambiguous instances should not be confused with well defined samples whose heuristic values happen to be near the decision boundary. Rather, ambiguities refer to examples where even the scientific ground truth may not be extremely clear. These phenomena introduce challenges for any classification scheme. In ISHMAP they are addressed by simply modeling ambiguity as a distinct phenomenon, where even if scientific priors will be by definition less robust, the features that scientists use to justify differences in interpretation are used as the basis for the heuristic modeling. Such features must necessarily exist as otherwise scientists would not have any empirical basis for uncertainty. This additional iterative phase is repeated until either no more ambiguous instances are found in which case a threshold can be determined exactly based on error tolerances, or is limited by scientist availability as often the ambiguous class may contain infinitely deep amount of different and idiosyncratic features that could be modeled.

When sampling edge cases of diffraction scientists found two different ambiguous phenomena. The first were cases of single detector response where the potentially diffracting window was had a significant difference in detectors, but the difference was not strongly peaked as the prior expectation would be for a diffraction response and was instead more flat. This was then modeled by measuring the relative height of the detector difference peak in the center of the window to differentiate strongly vs weakly peaking instances. The second class of ambiguous phenomena were cases where instead of a consistent background in the non-diffracting detector, the baseline detector (detector with lower average count over the window) contained a small peak as well but attenuated compared to the other detector. This introduced different interpretations by different scientists and thus was modeled as ambiguous using a heuristic of looking at the coefficient of variation of the baseline detector in a window. After differentiating these primary classes of ambiguous phenomena the boundary region of the diffraction heuristic threshold became sufficiently clean to determine a threshold to classify diffraction, in this case with a bias placed on reducing false positive determinations.

Once all interfering phenomena have been modeled and differentiated, and a classification threshold is determined, the final output of ISHMAP is a deployable classification model (Fig. 3E). This model has the same fundamental classification structure as a decision tree, shown in figure 4, since the aforementioned phenomena differentiation is powered by similar recursive iterations of ISHMAP returned classifiers.

In order to evaluate our approach we both conducted a standard quantitative error analysis for the model component of the work (Section 5.1), as well as a qualatative evaluation of the success of the final deployed system (Section 5.2). In doing so we do not aim to show that ISHMAP as a design framework is verifiably superior to other design approaches (as this would require counterfactual information of the success of many different design frameworks on this same problem which is well outside of the scope of this work). Instead we merely wish to show how ISHMAP is capable of producing strong scientific outcomes in this application, and thus may provide useful in further applications.

image

Figure 4: The architecture of the output model (Fig. 3E) of utilizing ISHMAP to detect diffraction peaks.

5.1 Evaluation of Anomaly Detection Model

After completing the ISHMAP procedure for the target phenomenon of diffraction, we are left with a classification model (Fig. 3E / Fig. 4). In order to evaluate the real world accuracy of the model within the context of its use in mineral identification, and to avoid information leakage and test our model’s generalizability, after developing our diffraction model using input from R2 and R3, we set to test the model using labels from additional scientists (R1, R4, and R5) on different datasets. Multiple different scientists were used in order to ensure reliable labels. We presented the three different scientists a representative random sample of 213 spectra that were not used in the training process. The sample was balanced to include 107 spectra uniformly randomly sampled among spectra determined by the model as containing diffraction peaks, and 106 sampled uniformly from spectra determined as not containing diffraction (through either exhibiting Surface Roughness, Ambiguity, or no anomaly). The scientists were then left to determine if any of the presented spectra contained diffraction. While all three scientists were presented with all 213 spectra, some spent considerably less time providing only a few labels on the presented spectra due to their time constraints. Furthermore we also found that the individual scientists had varying sensitivities for positively determining diffraction in cases of uncertainty and thus would often disagree on their determinations. Therefore we could not form reliable ground truth labels for 16 spectra which had only two labels from different scientists who disagreed, as well as for 45 spectra with only a single label from an individual scientist (which as we found is an unreliable indicator without a second opinion). This left 152 spectra (of which 144 had a total consensus and 8 had a majority determination) which we were able to use as a basis for evaluation. Of this reliable ground truth set, the model correctly predicted the presence or absence of diffraction with 93.4% accuracy.

These results match the qualitative experience that scientists expressed when examining the outputs of the model, as in an interview with R2 they expressed that:

“Your tool works very well, and is finding [diffraction peaks] in almost all cases, ... And so there wasn’t really anything that was being identified incorrectly.”

This qualitative reliability and satisfaction from the perspective of scientist domain expert end users forms the most relevant evaluation of the efficacy of the tool when comparing to previous methods which do not present any systematic alternative baseline, and thus a strong example of the effectiveness of the ISHMAP Framework for designing useful models for scientific users.

5.2 Impact of Deployed Interface

An important benefit of the ISHMAP collaborative design process is that once a model detecting a particular anomalous phenomena is complete, there is already assurance that the model is answering an actual problem of interest for scientists and collaborating scientists have a built in degree of ownership and buy-in to the technique[38]. This means that practical deployments of tools that utilize such a model are much more likely to result in adoption into the scientific workflow. To showcase how this collaborative design and scientific interpretation-first modeling approach can not only perform well with regards to general classification benchmarks but additionally

image

result in meaningful new capabilities for scientific end users we can consider the deployment of our ISHMAP diffraction model within the PIXL science workflow. After finalizing the model we integrated the model outputs with the existing primary visual analytics toolkit used by PIXL science: PIXLISE [14, 31, 48].

The first full version of the model and associated visualizations was deployed in November 2021, with preliminary versions available to scientists as early as June 2021. At the time of writing, it is used daily by over 97 NASA scientists and NASA-affiliated scientists across the globe collaborating on the PIXL science mission. Within the PIXL science group discussion board the functionality of our tool was mentioned over 80 times in the context of working group discussions, with 39 instances highlighting diffraction informed geo-science interpretations that would not have otherwise been easily visible and 28 instances of corrected spectroscopy and quantification error detection. Now, whenever new data is beamed down from Mars, our model allows scientists to instantaneously discover diffraction peaks that can inform downstream mineral identification.

5.3 Diffraction Panel Interface

Immediately once a dataset is loaded into PIXLISE, scientists can use the diffraction panel (Fig. 5A) to identify particular diffraction peaks. The diffraction panel includes a histogram of all of the energy levels where diffraction peaks have been detected. The scientists can choose a set of energy ranges in the histogram to in turn select all locations which contain diffraction peaks in those ranges (Fig. 5A.1). This allows scientists to quickly discover where different kinds of diffraction peaks, and thus minerals, are located (as the diffraction energy is a direct function of the crystal structure of the underlying mineral), as can be seen in figure 6. This is a particularly valuable feature for scientific interpretation enabled by the fact that our diffraction model works at the correct diffraction scale framing codified by ISHMAP as opposed to the default framing of anomalous spectra which a machine learning based model would utilize. Scientists can then further verify individual peak identifications from a sortable list of the detected diffraction peaks (Fig. 5A.2). Scientists can select a peak and then see its corresponding spectrum and energy location on the PIXLISE spectrum view plot (Fig. 5A.3), and then confirm if this classification is a correct determination of diffraction or a false positive within the interface. This is required as even a very accurate model contains errors and allowing users to further refine the available data allows the model to be updated and improved continuously, as well as helping build trust with the scientists by ensuring that they always have the ability to override the interpretation of the model.

5.4 Diffraction Map Visualization

In addition to the peak-specific workflow enabled by the diffraction panel, we have also implemented a more high-level visualization of anomaly structure via the diffraction map interface (Fig. 5B). Within the PIXL science team, diffraction maps have become an accessible, shareable, and invaluable piece of information for the process of mineral identification as new data comes down from PIXL.

The diffraction map visualization overlays a heatmap of the density of diffraction peaks or surface roughness anomalies at each

image

Figure 5: Overview of interface components within PIXLISE Application displaying the Guillaumes Mars dataset. (A) The Diffraction Panel enables quick identification and verification of individual diffraction peaks or grouped similar peaks. (B) The Diffraction Map displays the spatial distribution of diffraction peaks either over the whole spectrum range, for particular energy subsets, or in combination with custom defined expressions from other PIXLISE analysis tools.

beam location on top of the visual context image. This allows scientists to quickly find clusters of diffraction that are indicative of a crystal grain, group the locations within that cluster, and create regions of interest that can be applied towards further analysis. The diffraction map has become the preferred method of scientists to share findings of crystal structure, as R3 commented when looking at the diffraction map for the Beaujeu [28] dataset:

“I found an interesting correlation between the regions marked with a high number of diffraction peaks, and the regions that we have geochemically identified as plagioclase... It is great to see the usefulness of the diffraction peak detection algorithm in practice.”

These maps can be customized in a number of ways. The default view when starting is to show the density of diffraction present at all energy levels, this is the most broadly applicable when lacking a particularly strong prior about the specific crystallography of the target. However if a scientist has a hypothesis about a particular crystal configuration which would predict diffraction at predictable frequencies there are two ways to visualise diffraction with more specificity. PIXLISE contains a custom domain-specific language for custom maps of expressions. The output of the diffraction model is a supported query within this language and allows scientists to integrate anomaly information with other existing analysis. Additionally, the diffraction panel histogram selection supports the creation of maps directly. By being able to see the distribution of diffraction peak energy, scientists can analyze the distribution and find clusters without a-priori knowledge, creating maps that rather than showing the overall crystallographic structure of the sample can isolate particulate grains (Fig.. 6).

5.5 Enabling Ongoing Scientific Discoveries about Martian Geology

In November of 2021, the PIXL conducted a series of XRF scans of a sample with the codename Dourbes [29] at the Séítah formation [30] in the floor of the Jezero crater on Mars. This location presented an acute issue for the problem of mineral identification. Due to weathering, it is impossible to clearly identify crystal grains from the context imaging, and XRF information cannot sufficiently differentiate between all relevant physical properties. This information is extremely important to make inferences about the geological history of the site and has formed a significant challenge to previous Mars missions in similar situations.

Fortunately, due to the additional data collected by the PIXL instrument and the development of our suite of tools scientists were rapidly able to visualize the crystal structure of the sample with the diffraction map (Fig. 6). The diffraction map functionality enabled robust spatial comparison of diffraction with elemental analysis, and thus scientists could make strong claims about particular grains of elements, their crystal properties, and thus identified mineralogy. By going beyond the information available in standard fluorescence and elemental quantification, this comparison provided decisive evidence about the mineralogy of Séítah formation rocks, as expressed by a PIXL scientist in a recent abstract to be published at a top-tier scientific journal[41]:

“Collocated crystal sizes and mineral identities are critical for interpreting textural relationships in rocks and testing geological hypotheses, but it has been previously impossible to unambiguously constrain these properties ... Here we demonstrate that

image

Figure 6: Screenshots of diffraction maps of the Dourbes[29] dataset formed from different selections in the diffraction panel of diffraction peak energies. What can be see is rich information regarding the spatial distribution of diffraction peaks at different energy levels, with peaks in close energy clusters also clustering spatially. This implies the presence of unique crystal grains which can be readily seen using these diffraction maps.

diffracted and fluoresced x-rays detected by the PIXL instrument ... provide information about the presence or absence of coherent crystalline domains in various minerals.”

This finding has formed the central component of continuing research within the PIXL science team and is functionally enabled by the diffraction detection and visualization capabilities powered by our model. The effectiveness of the model in integrating with existing scientific workflows and assisting in high impact analysis immediately upon deployment further provides strong evidence of the effectiveness of the ISHMAP design framework.

6.1 Generalizability of Design Goals

While this work has showcased a single specific successful application, we hope that there are a number of useful insights from our solution that that can be utilized in other applications as well. In order to evaluate the applicability of our framework for other use cases the key deciding factors should be through the alignment of our outlined design goals. While we developed these goals strictly within the context of embedded user research for the particular domain of PIXL scientists, we justified the formulation of each goal based on aspects of scientists’ workflows that we found to be common among the different individuals and specialities within the fairly diverse PIXL team as well as on aspects that, while exhibited in this specific workflow, are not necessarily unique to it. For instance our design goal G1 is based on the observation that anomalies are more easily missed the further a dataset is from the ‘native’ scale of the anomalous phenomenon. Since essentially by definition anomalies are classes of phenomena that default assumptions do not apply to, any processing steps are likely to introduce errors with respect to anomaly detection. We can then say that a focus on raw data is not just an important design consideration for PIXL, but for any analytic workflow that includes steps that processes data in an irreversible manner based on violable assumptions. Our design goal G2 essentially formalizes an extremely common design constraint of limited data and label availability. The entire branch of unsupervised machine learning studies various implications for modeling with such a restriction. So while discovering that this constraint was applicable to our specific domain is extremely important, it is also clear that there are many other domains that share a similar requirement. Finally our design goal G3 similarly expresses a well established and known weakness of deep learning based anomaly detection[34] and shows why it is important in our use specific case. What this all implies is that while the goals we developed are in one respect specifically tied to the PIXL science domain, we expect that there are very likely a large number of other domains who share these goals as well, and it is in these domains where ISHMAP may be a useful tool to structure model development.

6.2 Limitations of ISHMAP

While we present the substantial potential effectiveness of the ISHMAP framework, like all frameworks it is not universally applicable and has limitations to where is should be utilized. Of course the primary drivers of whether ISHMAP is appropriate is whether the design goals laid out are of relevance. In prioritizing these goals other potential priorities are de-emphasized. In particular the ISHAMP process can only help in discovering known anomaly classes, as well as anomaly classes that are discovered during the ISHMAP process. There are many domains, including scientific domains, where data sources are sufficiently novel to contain many potential ‘unknown unknown’ classes of anomaly that users do not have a prior expectation for. Since ISHMAP relies on explicit description of known phenomena, such unknown anomalies cannot be captured reliably. In such applications it has been shown that more pure deep learning based methods can be highly effective in discovering such unique or point anomalies [34].

Furthermore the collaborative process may have substantial organizational overhead due to the requirement for consistent iteration and feedback between developers and scientists. Depending on the nature of a collaboration this overhead may occasionally present a greater manual effort burden than the effort of just providing more ground truth labels, undermining design goal G2. Thus when making the choice of whether to undertake an ISHMAP collaboration both sides must consider both the technical and organizational nature of the problem to determine if it is the right fit.

image

6.3 Opportunities for Future Work

In presenting the ISHMAP framework we have only taken a first step in improving scientific anomaly detection with human-centered-AI methodology. While the results of our implemented detection tool with the PIXL team was a clear success, it only formed a proof-of-concept for the methodology. We encourage researchers to replicate, evaluate, and refine the methodology in additional domains. Additionally the framework itself presents clear opportunities for development. The flexibility of the framework makes it amenable to many different forms of utilization and deployment, and so future work to fine tune the most effective processes both technical and procedural for scientific prior encoding, heuristic generation, and sample evaluation may greatly assist in more efficient and effective utilization of ISHMAP. Indeed, while ISHMAP was developed in order to address the shortcomings of pure deep learning based approaches, studying how to integrate deep learning models and all their expressive power within this more interpretation-focused framework may allow for the best of both methods. Finally, for the current formulation of ISHMAP, while the expertise of scientists is absolutely essential, the role of the model designer/developer is comparatively procedural. This leaves a potential opportunity to develop automated tools or interactive interfaces for scientists to engage in their side of the ISHMAP procedure entirely independently, which would massively increase the potential for science teams to develop their own robust and interpretable anomaly detection models.

6.4 Reproducibility

The code for discussed in this work is distributed across a number of repositories in the broader PIXLISE project that is open-sourcing all of its constituent repositories at https://github.com/pixlise[32, 45]. Additionally the deployed PIXLISE tool itself is publicly accessible. Anyone can request an account at https://www.pixlise.org/ and use all of the functionality of PIXLISE, including the anomaly detection functionality discussed in this work, on any public datasets. Most of the datasets used as examples in this work are publicly available either on PIXLISE or in raw data form directly from the NASA Planetary Data System (PDS) at https://pds-geosciences.wustl.edu/ m2020/urn-nasa-pds-mars2020_pixl/.

Many of history’s most important scientific discoveries can be attributed to the thorough analysis of anomalies[24]. Today, as scientific datasets get larger and more complex, so too do the methods used to find the anomalies within them, sometimes at the expense of the ability to interpret and explain the anomalies[34] which is a fundamental component of scientific analysis. In this work we sought to integrate human-computer interaction methodologies with the state of the art in AI to develop a method for anomaly detection that is both effective and interpretable. In collaboration with a world leading science team at NASA, we conducted extensive user research to understand their specific analytic workflow, and developing design goals that represent the needs of scientists with respect to interpretable anomaly detection. Based on these design goals we introduced ISHMAP, a novel design framework for the development of scientific anomaly detection models. By utilizing ISHMAP to develop an anomaly detection toolkit used daily by NASA scientists around the world and contributing to ongoing scientific discoveries, we showcased a proof of concept for a method to enable better science by taking a human-centered approach to both the technical and scientific problems of anomaly detection which we hope can assist scientists and researchers looking to not only detect, but understand anomalies in their data.

The research was carried out in part at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration (80NM0018D0004)

[1] Abigail Allwood, Ben Clark, David Flannery, Joel Hurowitz, Lawrence Wade, Tim Elam, Marc Foote, and Emily Knowles. 2015. Texture-specific elemental analysis of rocks and soils with PIXL: The Planetary Instrument for X-ray Lithochemistry on Mars 2020. In 2015 IEEE Aerospace Conference. 1–13.

[2] Abigail C Allwood, Lawrence A Wade, Marc C Foote, William Timothy Elam, Joel A Hurowitz, Steven Battel, Douglas E Dawson, Robert W Denise, Eric M Ek, Martin S Gilbert, M.E. King, C.C. Liebe, T. Parker, D.A.K. Pedersen, D.P. Randall, R.F. Sharrow, M.E. Sondheim, G. Allen, K. Arnett, M.H. Au, C. Basset, M. Benn, J.C. Bousman, R.J. Calvet, L. Cinquini, B. Clark, S. Conaby, H.A. Conley, S. Davidoff, J. Delaney, T. Denver, E. Diaz, G.B. Doran, J. Ervin, M. Evans, D.O. Flannery, N. Gao, J. Gross, J. Grotzinger, B. Hannah, J.T. Harris, C.M. Harris, C.M. Heirwegh, C. Hernandez, E. Hertzberg, R.P. Hodyss, J.R. Holden, C. Hummel, M.A. Jadusingh, J.L. Jørgensen, J.H. Kawamura, A. Kitiyakara, K. Kozaczek, J.L. Lambert, P.R. Lawson, Y. Liu, K.M. Macneal, McLennan. S., P. McNally, P.L. Meras, J. Napoli, B.J. Naylor, P. Nemere, N. Pootrakul, R.A. Romero, R. Rosas, J. Sachs, M.E. Schein, T.P. Setterfield, V. Singh, E. Song, M.M. Soria, N.R. Tallarida, D.R. Thompson, M.M. Tice, L. Timmermann, V. Torossian, A. Treiman, S. Tsai, K. Uckert, J. Villalvazo, M. Wang, D.W. Wilson, S.C. Worel, P. Zamani, M. Zappe, and R. Zimmerman. 2020. PIXL: Planetary instrument for X-ray lithochemistry. Space Science Reviews 216, 8 (2020), 1–132. https://doi.org/10.1007/s11214-020-00767-7

[3] Jillian F. Banfield, John W. Moreau, Clara S. Chan, Susan A. Welch, and Brenda Little. 2001. Mineralogical Biosignatures and the Search for Life on Mars. Astrobiology 1, 4 (2001), 447–465. https://doi.org/10.1089/153110701753593856

[4] Burkhard Beckhoff, Birgit Kanngießer, Norbert Langhoff, Reiner Wedell, and Helmut Wolff. 2007. Handbook of practical X-ray fluorescence analysis. Springer Science & Business Media.

[5] Jesse Josua Benjamin, Arne Berger, Nick Merrill, and James Pierce. 2021. Machine Learning Uncertainty as a Design Material: A Post-Phenomenological Inquiry. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan)  (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 171, 14 pages. https://doi.org/10.1145/3411764.3445481

[6] David L Bish, DF Blake, DT Vaniman, SJ Chipera, RV Morris, DW Ming, AH Treiman, P Sarrazin, SM Morrison, RT Downs, et al. 2013. X-ray diffraction results from Mars Science Laboratory: Mineralogy of Rocknest at Gale crater. science 341, 6153 (2013), 1238932.

[7] Janice L Bishop, Enver Murad, Melissa D Lane, and Rocco L Mancinelli. 2004. Multiple techniques for mineral identification on Mars:: a study of hydrothermal rocks as potential analogues for astrobiology sites on Mars. Icarus 169, 2 (2004), 311–323.

[8] Mu-Huan Chung, Mark Chignell, Lu Wang, Alexandra Jovicic, and Abhay Raman. 2020. Interactive Machine Learning for Data Exfiltration Detection: Active Learning with Human Expertise. In 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC). 280–287. https://doi.org/10.1109/SMC42975.2020. 9282831

[9] Christopher F. Chyba and Kevin P. Hand. 2005. Astrobiology: The Study of the Living Universe. Annual Review of Astronomy and Astrophysics 43, 1 (2005), 31–74.

[10] Scott Davidoff, Min Kyung Lee, Anind K. Dey, and John Zimmerman. 2007. Rapidly Exploring Application Design through Speed Dating. In Proceedings of the 9th International Conference on Ubiquitous Computing (Innsbruck, Austria) (UbiComp ’07). Springer-Verlag, Berlin, Heidelberg, 429–446.

[11] W. T. Elam, B. D. Ravel, and J.R. Sieber. 2002. A new atomic database for X-ray spectroscopic calculations. Radiation Physics and Chemistry 63, 2 (2002), 121–128.

[12] Jerry Alan Fails and Dan R. Olsen. 2003. Interactive Machine Learning. In Proceedings of the 8th International Conference on Intelligent User Interfaces (Miami, Florida, USA)  (IUI ’03). Association for Computing Machinery, New York, NY, USA, 39–45. https://doi.org/10.1145/604045.604056

image

[13] Alberto G Fairén, Alfonso F Davila, Darlene Lim, Nathan Bramall, Rosalba Bonaccorsi, Jhony Zavaleta, Esther R Uceda, Carol Stoker, Jacek Wierzchos, James M Dohm, et al. 2010. Astrobiology through the ages of Mars: the study of terrestrial analogues to understand the habitability of Mars. Astrobiology 10, 8 (2010), 821–843.

[14] David Flannery, Scott Davidoff, Michael M. Tice, Abigail C. Allwood, William Timothy Elam, Christopher M. Heirwegh, Joel A. Hurowitz, Yang Liu, and Peter Nemere. 2021. Increasing Efficiency of Mars 2020 Rover Operations via Novel Data Analysis Software for the Planetary Instrument for X-ray Lithochemistry (PIXL). Proceedings of the 2021 Committee on Space Research (COSPAR) Scientific Assembly 43 (2021). Issue B0.2.

[15] R Gellert, R Rieder, J Brückner, BC Clark, G Dreibus, G Klingelhöfer, G Lugmair, DW Ming, H Wänke, A Yen, et al. 2006. Alpha Particle X-ray Spectrometer (APXS): Results from Gusev crater and calibration report. Journal of Geophysical Research: Planets 111, E2 (2006).

[16] Yolanda Gil, James Honaker, Shikhar Gupta, Yibo Ma, Vito D’Orazio, Daniel Garijo, Shruti Gadewar, Qifan Yang, and Neda Jahanshad. 2019. Towards human-guided machine learning. In Proceedings of the 24th International Conference on Intelligent User Interfaces. 614–624.

[17] Marco Gillies, Rebecca Fiebrink, Atau Tanaka, Jérémie Garcia, Frédéric Bevilacqua, Alexis Heloir, Fabrizio Nunnari, Wendy Mackay, Saleema Amershi, Bongshin Lee, Nicolas d’Alessandro, Joëlle Tilmanne, Todd Kulesza, and Baptiste Caramiaux. 2016. Human-Centred Machine Learning. In Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems (San Jose, California, USA)  (CHI EA ’16). Association for Computing Machinery, New York, NY, USA, 3558–3565. https://doi.org/10.1145/2851581.2856492

[18] David Gunning, Mark Stefik, Jaesik Choi, Timothy Miller, Simone Stumpf, and Guang-Zhong Yang. 2019. XAI—Explainable artificial intelligence. Science robotics 4, 37 (2019), eaay7120.

[19] Lijie Guo, Elizabeth M Daly, Oznur Alkan, Massimiliano Mattetti, Owen Cornec, and Bart Knijnenburg. 2022. Building Trust in Interactive Machine Learning via User Contributed Interpretable Rules. In 27th International Conference on Intelligent User Interfaces. 537–548.

[20] Michael Haschke. 2014. Laboratory micro-X-ray fluorescence spectroscopy. Cham: Springer International Publishing 10 (2014), 978–983.

[21] C. M. Heirwegh, Elam W. T., and L. P. O’Neil. 2022. The Focused Beam X-ray Fluorescence Elemental Quantification Software Package PIQUANT. Spectrochimica Acta Part B: Atomic Spectroscopy 196 (2022), 106520. https://doi.org/10.1016/j. sab.2022.106520

[22] Kori Inkpen, Stevie Chancellor, Munmun De Choudhury, Michael Veale, and Eric P. S. Baumer. 2019. Where is the Human? Bridging the Gap Between AI and HCI. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk)  (CHI EA ’19). Association for Computing Machinery, New York, NY, USA, 1–9. https://doi.org/10.1145/3290607. 3299002

[23] Liu Jiang, Shixia Liu, and Changjian Chen. 2019. Recent research advances on interactive machine learning. Journal of Visualization 22, 2 (2019), 401–417.

[24] Thomas S Kuhn. 1970. The structure of scientific revolutions. Vol. 111. Chicago University of Chicago Press.

[25] Kateryna Kuksenok. 2016. Influence apart from adoption: How interaction between programming and scientific practices shapes modes of inquiry in four oceanography teams. Ph.D. Dissertation.

[26] Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. Advances in neural information processing systems 30 (2017).

[27] S. McMahon, T. Bosak, J. P. Grotzinger, R. E. Milliken, R. E. Summons, M. Daye, S. A. Newman, A. Fraeman, K. H. Williford, and D. E. G. Briggs. 2018. A Field Guide to Finding Fossils on Mars. Journal of Geophysical Research: Planets 123, 5 (2018), 1012–1040. https://doi.org/10.1029/2017JE005478 arXiv:https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1029/2017JE005478

[28] NASA. 2021. PIXL Sol 140. https://bit.ly/34Mzpl6

[29] NASA. 2021.  PIXL’s View of Dourbes. https://www.jpl.nasa.gov/images/pia25041- pixls-view-of-dourbes

[30] NASA. 2021.  Two Perspectives of Séítah Rocks. https://www.jpl.nasa.gov/images/ pia25023-two-perspectives-of-seitah-rocks

[31] Peter Nemere, Tom Barber, Adrian Galvin, Ryan Stonebraker, S. Michael Fedell, Austin P. Wright, David O. Flannery, Michael M. Tice, Yang Liu, W. Timothy Elam, Christopher M. Heirwegh, Abigail A. Allwood, Joel A. Hurowitz, Morgan Cable, and Scott Davidoff. 2021. PIXLISE Application. https://www.pixlise.org/

[32] Peter Nemere, Ryan Stonebraker, Adrian Galvin, Tom Barber, S. Michael Fedell, and Scott Davidoff. 2023. pixlise/pixlise-ui: Release 2.0.16. https://doi.org/10.5281/ zenodo.7539750

[33] William Odom, John Zimmerman, Scott Davidoff, Jodi Forlizzi, Anind K. Dey, and Min Kyung Lee. 2012. A Fieldwork of the Future with User Enactments. In Proceedings of the Designing Interactive Systems Conference (Newcastle Upon Tyne, United Kingdom)  (DIS ’12). Association for Computing Machinery, New York, NY, USA, 338–347. https://doi.org/10.1145/2317956.2318008

[34] Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. 2021. Deep Learning for Anomaly Detection: A Review. ACM Comput. Surv. 54,

2, Article 38 (mar 2021), 38 pages. https://doi.org/10.1145/3439950

[35] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. " Why should i trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1135–1144.

[36] R. Rieder, T. Economou, H. Wänke, A. Turkevich, J. Crisp, J. Brückner, G. Dreibus, and H. Y. McSween. 1997. The Chemical Composition of Martian Soil and Rocks Returned by the Mobile Alpha Proton X-ray Spectrometer: Preliminary Results from the X-ray Mode. Science 278, 5344 (1997), 1771–1774. https://doi.org/10. 1126/science.278.5344.1771

[37] Steven W. Ruff and Jack D. Farmer. 2016. Silica deposits on Mars with features resembling hot spring biosignatures at El Tatio in Chile. Nature Communications 7, 1 (17 Nov 2016), 13554.

[38] Elizabeth B-N Sanders and Pieter Jan Stappers. 2008. Co-creation and the new landscapes of design. Co-design 4, 1 (2008), 5–18. https://doi.org/10.1080/ 15710880701875068

[39] David Schurman, Pooja Nair, Scott Davidoff, Adrian Galvin, Abigail Allwood, Yang Liu, David Flannery, Robert P. Hodyss, Santiago V. Lombeyda, Maggie Hendrie, Hillary Mushkin, and Christopher P. Heirwegh. 2019. PIXELATE: Novel visualization and computational methods for the analysis of astrobiological spectroscopy data. Proceedings of the 2019 Astrobiology Science Conference (AbSciCon).

[40] Student. 1908. The probable error of a mean. Biometrika (1908), 1–25.

[41] Michael M Tice, Joel A Hurowitz, Abigail C Allwood, Michael WM Jones, Brendan J Orenstein, Scott Davidoff, Austin P Wright, David AK Pedersen, Jesper Henneke, Nicholas J Tosca, et al. 2022. Alteration history of Séítah formation rocks inferred by PIXL x-ray fluorescence, x-ray diffraction, and multispectral imaging on Mars. Science Advances 8, 47 (2022), eabp9084.

[42] Scott J. VanBommel, Ralf Gellert, Jeff A. Berger, John L. Campbell, Lucy M. Thompson, Kenneth S. Edgett, Marie J. McBride, Michelle E. Minitti, Irina Pradler, and Nicholas I. Boyd. 2016. Deconvolution of distinct lithology chemistry through oversampling with the Mars Science Laboratory Alpha Particle X-Ray Spectrometer. X-Ray Spectrometry 45, 3 (2016), 155–161.

[43] Zijie J Wang, Alex Kale, Harsha Nori, Peter Stella, Mark E Nunnally, Duen Horng Chau, Mihaela Vorvoreanu, Jennifer Wortman Vaughan, and Rich Caruana. 2022. Interpretability, Then What? Editing Machine Learning Models to Reflect Human Knowledge and Values. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4132–4142.

[44] Dennis Wixon, Karen Holtzblatt, and Stephen Knox. 1990. Contextual design: an emergent view of system design. In CHI. 329–336.

[45] Austin P. Wright, Peter Nemere, Ryan Stonebraker, Adrian Galvin, and Scott Davidoff. 2022. pixlise/diffraction-peak-detection: 2.0 open source migration release. https://doi.org/10.5281/zenodo.6959138

[46] Qian Yang, Justin Cranshaw, Saleema Amershi, Shamsi T. Iqbal, and Jaime Teevan. 2019. Sketching NLP: A Case Study of Exploring the Right Things To Design with Language Intelligence. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk)  (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/ 3290605.3300415

[47] Qian Yang, Aaron Steinfeld, Carolyn Rosé, and John Zimmerman. 2020. ReExamining Whether, Why, and How Human-AI Interaction Is Uniquely Difficult to Design. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA)  (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376301

[48] Constance Ye, Lukas Hermann, Shravya Bhat, Nur Yildirim, Dominik Moritz, and Scott Davidoff. 2021. PIXLISE-C: Exploring The Data Analysis Needs of NASA Scientists for Mineral Identification. In ACM Conference on Human Factors in computing systems Workshop on Human-Computer Interaction for Space Exploration (Honolulu, HI, USA) (SpaceCHI 2021). Association for Computing Machinery, New York, NY, USA, 5 pages.

Designed for Accessibility and to further Open Science

Thank you Austin P. Wright, Peter Nemere, Adrian Galvin, Duen Horng Chau, Scott Davidoff, who authored Lessons from the Development of an Anomaly Detection Interface on the Mars Perseverance Rover using the ISHMAP Framework 🙏 This page is the html of their arXiv pdf, with no changes made other than format. Please cite their work