Approximating the Hotelling Observer with Autoencoder-Learned Efficient Channels for Binary Signal Detection Tasks

2020·Arxiv

Abstract

Abstract

The objective assessment of image quality (IQ) has been advocated for the analysis and optimization of medical imaging systems. One method of obtaining such IQ metrics is through a mathematical observer. The Bayesian ideal observer is optimal by definition for signal detection tasks, but is frequently both intractable and non-linear. As an alternative, linear observers are sometimes used for task-based image quality assessment. The optimal linear observer is the Hotelling observer (HO). The computational cost of calculating the HO increases with image size, making a reduction in the dimensionality of the data desirable. Channelized methods have become popular for this purpose, and many competing methods are available for computing efficient channels. In this work, a novel method for learning channels using an autoencoder (AE) is presented. AEs are a type of artificial neural network (ANN) that are frequently employed to learn concise representations of data to reduce dimensionality. Modifying the traditional AE loss function to focus on task-relevant information permits the development of efficient AE-channels. These AE-channels were trained and tested on a variety of signal shapes and backgrounds to evaluate their performance. In the experiments, the AE-learned channels were competitive with and frequently outperformed other state-of-the-art methods for approximating the HO. The performance gains were greatest for the datasets with a small number of training images and noisy estimates of the signal image. Overall, AEs are demonstrated to be competitive with state-of-the-art methods for generating efficient channels for the HO and can have superior performance on small datasets.

Index Terms—Autoencoder, objective assessment of image quality, Hotelling observer, imaging system optimization, neural networks, numerical observers, representation learning

I. INTRODUCTION

Medical imaging systems are commonly optimized with consideration of a specific task [1]. Assessing performance of such systems requires an objective metric for image quality (IQ) [2]–[6]. For signal detection tasks, the Bayesian ideal observer (IO) has been advocated for producing a figure-of-merit for assessing IQ because it can maximize the amount of task-specific information in the measurement data [2]–[7]. For a binary signal detection task, the IO test statistic takes the form of a likelihood ratio. Using this likelihood ratio as a test

Jason L. Granstedt is with the Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL, 61820 USA e-mail: (jasonlg@illinois.edu)

Weimin Zhou is with the Department of Electrical and Systems Engineering, Washington University in St. Louis, St. Louis, MO, 63130 USA e-mail: (wzhou24@wustl.edu)

Mark A. Anastasio is with the Department of Bioengineering, University of Illinois at Urbana-Champaign, Champaign, IL, 61820 USA e-mail: (maa@illinois.edu)

statistic in turn maximizes the area under the receiver operating characteristics (ROC) curve [2]–[4], [7]. However, analytically determining the IO is generally difficult because it typically is a non-linear function and requires complete knowledge of the statistical properties of the image data.

There has been recent progress in developing approximations for computing the IO test statistic [5], [6]. One line of research involves sampling-based methods that utilize Markovchain Monte Carlo methods to approximate the IO, but the work in this area has so far been limited to relatively simple object models [3], [5], [8], [9]. Another recent development is the approximation of the IO with convolutional neural networks (CNNs) [10]. An alternative method to approximating the IO’s performance employs variational Bayesian inference [11]. This line of research has shown promise for implementing task-specific optimization of sparse reconstruction methods.

A common surrogate for the often-intractable IO is the Hotelling observer (HO) [12]–[15]. The HO implements the optimal linear test discriminant for maximizing the signal-to-noise ratio of the test statistic [16], [17]. Implementing the HO requires the estimation and inversion of a covariance matrix, which quickly grows as image size increases and can become intractable to compute [18]. There are a few different strategies for mitigating the computational cost of inverting a large matrix [12]. One method is to avoid a direct inversion by implementing an iterative approach to estimate the test statistic [2]. If the measurement noise covariance matrix is known and an estimate of the background covariance matrix is available, covariance matrix decomposition is a viable option [2] with the caveat that certain situations can lead to significant bias in the performance [19]. Alternatively, the test statistic can be learned directly from the images provided that there is a sufficient amount of data [10], [20]. The most commonly employed method, however, is the implementation of channels that approximate the HO [5], [21], [22] These channels are linear transformations applied to reduce the dimensionality of the data, decreasing the computational costs of calculating the HO.

Conceptually, channels function by projecting the high-dimensional image data to a low-dimensional image manifold [23]. Image data frequently can be compressed into a reduced-dimensionality manifold [24], [25]. Ideally, the manifold embedding would preserve the important features of the data [26]. This projection operation is defined by the channel matrix. Channels are known as efficient if they approximate the original observer’s performance while reducing the dimensionality of the data [27]. Prior work in computing efficient channels includes Laguerre-Gauss (LG) [22], singular value decomposition (SVD) [28], partial least squares (PLS) [29], and the filtered channel observer (FCO) [30]. In addition to learning efficient channels, there are approaches that seek to mimic the human observer’s performance. An early approach learned the relationship between channel features and human observer performance with a support vector machine [31]. Another approach investigated optimization with respect to the HO and human observers on accelerated MRI reconstruction [32].

Autoencoders (AEs) are a type of artificial neural network (ANN) that are characterized by a mirror structure, with the target output of the network similar to the input [33]–[37]. They are designed to learn a lower-dimensional representation of the data called an embedding. The portion of the network that transforms the input to the embedding is known as the encoder, and the portion that transforms the embedding back into the original data space is known as the decoder. A good embedding is capable of significant data compression while retaining most of the information from the original data. The data compression qualities of AEs make them desirable to use in many tasks, and they have been applied in state-of-the-art systems for classification [38], noise reduction [39], and regression [40]. The widespread success of the AE is due to its ability to generate low-dimensional representations of images, which increases the efficiency of further processing by attenuating noise and embedding data to its most important components. In general, a linear AE with optimal weights projects the data onto a subspace spanned by its top principle directions [41].

In this work, the problem of learning task-informed embeddings with an AE is explored. The AE is modified to learn the optimum transformation matrix that maximizes the amount of task-specific information encoded in its latent states. This learning task is demonstrated to be equivalent to learning efficient channels for the HO. To the best of our knowledge, this is the first time a connection has been established between numerical observer channels and autoencoder-generated embeddings. The considered model is a linear AE with one hidden layer and one set of tied weights, as described below. Numerical studies are performed with binary signal detection tasks that involve a range of signals and backgrounds. The performance of the AE-learned channels in these studies is compared to state-of-the-art channelized methods. The potential advantages and limitations of this new approach are also discussed.

The remainder of this work is organized as follows. In Sec. II an overview of binary signal detection theory is presented. The HO, CHO and AE are also reviewed in that section. A novel methodology for learning channels for the location-known binary signal detection task using an AE is developed in Sec. III. The numerical studies and results of the proposed method for approximating the HO are included in Secs. IV and V, along with a comparison to other state-of-the-art methods. Finally, the paper concludes with a discussion of the work in Sec. VI.

II. BACKGROUND

Consider the linear digital imaging system

where is the measured image data vector, H denotes a continuous-to-discrete (C-D) imaging operator that maps is the object function with the 2-D spatial coordinate r, and is the random measurement noise. The object function f(r) will be abbreviated as f and can be either deterministic or stochastic, depending on the specification of the signal detection task.

A. Formulation of binary signal detection tasks

The binary signal detection task considered involves the classification of an image by an observer into one of two hypotheses: signal-present () or signal-absent (). The imaging processes under these two hypotheses can be described as

where and represent a background and a signal object, respectively. Depending on the imaging task, these can be either random or fixed.

To perform a binary signal detection task, an observer computes a test statistic t(g) that maps the measured image g to a real-valued scalar variable. This scalar is compared against a threshold to classify g as satisfying either or . To determine the desired performance for the signal detection task, a receiver operating characteristic (ROC) curve can be plotted to depict the trade-off between the false-positive fraction (FPF) and the true-positive fraction (TPF) by varying the threshold . The overall signal detection performance of the observer can be summarized by computing the area under the ROC curve (AUC) [42].

B. Bayesian Ideal Observer and Hotelling Observer

The IO is optimal and sets the upper limit for observer performance on binary signal detection tasks. The IO test statistic is defined as any monotonic transformation of the likelihood ratio that takes the form [2], [3], [43]

where and are conditional probability density functions that describe the measured data g under hypothesis and , respectively.

An alternative to the IO for assessing signal detection performance is the HO. The HO test statistic is defined as

where is the observer template. Let denote the conditional mean of the image data given an object function. Similarly, let denote the conditional mean averaged with respect to object randomness associated with . The Hotelling template is defined as [2]

where

Here, is the covariance matrix of the measured data g under the hypothesis and is the difference between the mean of the measured data g under the two hypotheses.

The signal to noise ratio (SNR) associated with the test statistic t is another commonly employed FOM for assessing signal detection performance and is given by [17]

where and are the mean and variance of t under the hypothesis ). While the IO maximizes the AUC of an observer, the HO maximizes the SNR of the test statistic t(g) [17].

C. Channels

Computation of the HO can become intractable for large image sizes due to the cost of inverting the covariance matrix in Eqn. (5). Additionally, it may be difficult to estimate a full rank covariance matrix in limited-data cases. To mitigate this problem, a channelized version of the image g can be introduced as [17]

where v is a channel-reduced image and T is a matrix. The number of channels, m, determines the dimensionality reduction from the original data of size n Applying the HO to the channel-reduced data yields the channelized HO (CHO) [17], with the test statistic taking the form of

Here, [] and ∆, where ] [and for j = (0, 1).

It is desirable to minimize the number of channels to maximize computational efficiency, since the dimensionality of is proportional to the number of channels. However, these channels should maximize the retained, task-relevant, information to provide an efficient approximation of the HO. There are several methods that exist for selecting efficient channels. One of the first was LG channels [27]. These channels are a combination of a Gaussian function with a Laguerre polynomial and were proposed due to their structural similarity with the Hotelling template for certain detection tasks. These channels are suitable for a smooth rotationally symmetric signal on a lumpy background, but may have suboptimal performance for arbitrary signals and more complex backgrounds [29].

An alternative to LG channels are SVD channels [28]. These channels are singular vectors that form a basis for image vectors in the range of the imaging operator. The most efficient set of channels constructed from this method involved decomposing the noiseless signal image by use of the singular vectors and choosing the top m of them to form the channel set. However, this method is computationally expensive and system-specific.

Two current state-of-the-art methods for generating efficient channels that work on arbitrary signals and backgrounds without any specific knowledge of the imaging system are partial least squares (PLS) [29] and filtered channel observer (FCO) [30]. PLS applies a data reduction technique that iteratively constructs a number of latent vectors that maximize the covariance between the data and the true image labels. PLS represents an attractive method to use in limited-data cases and/or large image sizes and works well with noisy and heavily correlated data. However, the technique suffers a notable degradation of performance when the amount of available image data is small [29].

FCO channels were initially developed as anthropomorphic channels to approximate human signal detection performance for irregularly-shaped signals [30]. However, FCO channels have been explored as efficient channels for the HO [30], [44]. The FCO convolves a selected set of baseline channels with the signal before computing the observer template. For this work, LG channels were selected as the baseline set of channels due to both LG’s past success [27] and similar decisions with the FCO method in more recent work [44]. This realization of the FCO method will be referred to as convolutional LG.

D. Neural Networks for Approximating the IO

A feed-forward ANN is a system of computational units associated with tunable parameters called weights [45], [46]. A feed-forward ANN is capable of approximating any continuous function if it has a sufficiently complex architecture [47], [48]. ANNs have been employed to form numerical observers, with the focus on directly estimating the test statistic [10], [43], [49]. Kupinski et al. [43] utilized conventional fully connected neural networks to approximate the IO on low-dimensional extracted image features. Zhou and Anastasio extended this work to higher-dimensional data and allowed for native processing of image data by replacing the FCNN with a convolutional neural network [10], [49]. However, both of these approaches focus on learning the test statistic directly and may require a large amount of training data to accurately approximate the IO.

E. Autoencoders

A specialized type of ANN is the autoencoder (AE) [33]– [37]. The AE is characterized by a mirror structure, with the input of the network similar to the target output. An AE has three distinct components: an encoder, an embedding, and a decoder. The encoder transforms the input to the embedding, which generally has a significantly reduced dimensionality compared to the input. The decoder transforms the embedding into the target output. In a canonical AE, the decoder is specified to reconstruct an approximation of the input to the encoder. AEs are frequently employed for their data compression properties in state-of-the-art systems for classification [38], regression [40], noise reduction [39], anomaly detection [50], and image recovery [51] tasks. Additional performance improvements can be made by injecting additional information into the AE training process. Studies have shown that exploiting a priori information through implicitly defined nonparametric functions can introduce task-specific information in the training of AEs [52], [53].

In contrast to previous work with ANNs, an AE is usually trained in an unsupervised way [38]. One aspect of AEs that has recently been considered is the concept of tied weights [54]. Tied weights further enforce the mirror-like structure of the AE by forcing the encoder and decoder matrices to be symmetric. Tied-weight AEs have been shown to perform similarly to untied-weight AEs, but require less data to train because of the reduction in parameters.

In general, the layers in an AE specify many sets of matrix multiplications with added bias terms and nonlinear transformations. By restricting the operations to only matrix multiplications, linear AEs can be obtained. In these cases, the encoder and decoder can each be described by a transformation matrix that transforms to or from the data embedding. Such a simplified network is considered in this work since this configuration’s encoder has a natural parallel with the channel matrix in the CHO. The input to the network is a noisy image and the target output is either the input image or a related version of the input image, depending on the task.

An optimization problem is solved to determine the weights of the AE by minimizing a reconstruction loss. The solution of the optimization problem is computed by minimizing the loss function using a variation of the backpropogation algorithm [55]. The traditional loss function for an AE is the mean squared error between the input and the output of the network. Given N vectorized background images of size , the traditional loss function corresponding to a zero-bias linear AE is [33]

where and are each weight matrices that parameterize the encoder and decoder of the AE, respectively. The target reconstruction is represented by , which can be the same as or different from the input data but is usually closely related. For example, in denoising problems the target output is a clean version of the input image.

III. METHOD - AUTOENCODER-LEARNED CHANNELS

A method for learning efficient channels for the CHO with an AE is described below. A connection between AE weights and the CHO framework is established to illustrate the connection between the learned data embeddings and more traditional channels.

A. Autoencoder Channels and Linear Autoencoders

The learned weights of an AE have an additional interpretation when considered in the framework of a signal detection task. The weights define a mapping from the high-dimensional image space to a low-dimensional embedding space. This is conceptually equivalent to the CHO channel matrix T. The AE weights can be employed as channels for the CHO by setting in Eqn. (9). Intuitively, these AE-learned channels capture the data most important for reconstructing the image.

The loss function in Eqn. (11) causes the AE to encode the entirety of the input image. This makes the traditional AE suboptimal for learning channels because a significant portion of the data embedding is dedicated to reconstructing certain components of the background and noise that may not be highly relevant to the detection task. To circumvent this, as described below, information about the signal can be incorporated into the AE training process to preserve task-specific information.

B. Task-Specific Autoencoders

A novel modification to the loss function to improve the learned data embedding and resulting signal detection performance for AE-channels is presented here. Ideally, the entirety of the AE embedding would be dedicated to the task-specific information. This would minimize the proportion of the embedding that is dedicated to extraneous information and lead to a more efficient set of channels. By changing the AE’s target reconstruction to just the mean signal image, the background and noise are suppressed during the reconstruction process. This results in an embedding in which the signal can be accurately represented. This new approach minimizes the MSE between the reconstructed image and the estimated signal image and takes the form of

where is defined in Eqn. (7) and is the indicator function that returns 1 if the signal is present and 0 otherwise. Note that this loss function uses label information, and thus is a supervised learning algorithm. Considering the background as noise permits the entire capacity of the embedding to focus on the task-specific information. Using the signal template as the target image assists the training process in identifying an embedding that preserves task-specific information. The indicator function and alteration to the desired output also breaks the traditional AE’s connection to principle directions [41]. As shown below, this modification to the loss function is capable of generating efficient channels for the CHO. A diagram of the AE with both the traditional and task-based approach for the signal detection task is provided in Fig. 1, with a sample reconstruction from AEs trained using both loss functions shown in Fig. 2. Both the task-specific and traditional loss functions can be minimized by use of a gradient-descent method, with specific implementation details provided in Sec. IV-D.

IV. NUMERICAL STUDIES

Numerical simulation studies were conducted to evaluate the performance of the proposed method for learning efficient

Fig. 1: Diagram of the proposed method. A noisy image is the input to an encoder, which is mapped to an m-dimensional latent space to encode the information. The encoder transformation matrix is given by . The embedded representation is then multiplied by the decoder transformation matrix, to return to image space and generate the output image. Two different loss functions are considered for training this model. The traditional loss function computes the MSE between the output image and the input image. This approach attempts to reconstruct the entire input image. The second considered loss function is the task-specific loss, which calculates the difference between the output image and estimated signal image . This loss maximizes the signal-specific information of the input image.

Fig. 2: Reconstructed images corresponding to each of the loss functions. The grayscale in each case is adjusted to maximize visibility. (a) is the input image, which contains the faint Gaussian elliptical signal in (b). The traditional AE with 20 channels reconstructs the image in (c) while the task-specific tied-weight AE with 10 channels reconstructs the image in (d). Note that the reconstructed image in (c) is noticeably less noisy than the input image (a), which it is attempting to reconstruct. This is due to the limited number of latent states in the model embedding the largest structures in the input images. Noise cannot be effectively encoded for an image, so it is attenuated.

channels for the CHO. All simulations addressed background-known-statistically (BKS) signal detection tasks. Four distinct binary signal detection tasks were considered. Using a lumpy background, a location-known task and signal-known-statistically task were considered. These tasks enabled the HO to be determined both using covariance matrix decomposition [2] and direct computation according to Eqn. (5). These observers will be referred to as HO-CMD and HODirect, respectively. On a breast phantom background, two location-known signal detection tasks using signals of different shapes and sizes were considered. These tasks allowed for the evaluation of channelized methods on a more realistic medical imaging task. ROC curves were fit by use of a binormal

Fig. 3: Sample generation for signal-present images used in the lumpy background experiments. The grayscale in each case was adjusted to maximize visibility. The signal image (a) was added to the lumpy background (b) and the Gaussian noise (c) to produce the composite dataset image (d). The signal image is an elliptical Gaussian signal with width

model [42], [56], [57] with the fitted AUC values reported. The experimental results are reported in distinct sections based on the image background model, with the details for each signal detection task and the training of neural networks are given in the appropriate subsections.

A. Signal detection tasks that utilize a lumpy background model

Two different signal detection tasks were performed on a lumpy background model [58] with an idealized parallel-hole collimator system [3]. Further details about each of the components is provided below.

1) Lumpy Background: A stochastic lumpy object model was used as the background [58]

where is the number of lumps that is sampled from Poisson distribution with the mean set to 5 and is the lumpy function modeled by a symmetric 2D Gaussian function with amplitude a and width s

Here, is the uniformly-sampled position of the lump. The magnitude and width of the lumps were set to the frequently-employed values of a = 1 and s = 7. An example of a signal-present image in the dataset with a circular signal is located in Fig. 3.

2) Imaging system: The stylized imaging system in these studies was a linear C-D mapping describing an idealized parallel-hole collimator system with a point response function [3], [59]

with the height h = 40 and the width w = 0.5. 3) Signals: The signal function was a 2D Gaussian function

where A = 0.2 is the amplitude and is the coordinate of the signal location. Here, is the Euclidean rotation matrix that rotates the Gaussian by an angle of and is given by

and D is a scaling matrix that controls the width of the Gaussian along each axis and is given by

For both experiments involving the lumpy background, the elliptical Gaussian signal was set to have the parameters 5 and . The image size was selected to be with the signal centered at . The value of varied depending on the type of task.

4) Detection Tasks: The first signal detection task employed , forcing the signal to take the same orientation in each image. Thus, the signal location and shape were fixed. The signal template was computed according to Eqn. (7), which resulted in a noisy estimate of the signal. The second signal detection task sampled uniformly from the set . This allowed for four distinct orientations of the elliptical Gaussian. The mean signal was also computed with Eqn. (7), which resulted in a noisy estimate of the signal averaged across the four possible realizations.

5) Dataset Generation: A training set of 60 000 unique background images with noise were generated for the lumpy object model. The background images were generated separately from the signal image in Eqn. (16) using the appropriate

Fig. 4: Sample estimated signals and images from the VICTRE breast phantom dataset. The grayscale was adjusted in each case to maximize visibility. (a) and (c) contain the mean signal image of the spiculated mass and microcalcification cluster, respectively. (b) and (d) are sample images of those corresponding signals embedded into a fatty breast phantom, which is the easiest detection class in the dataset.

background model. Each background image was summed with a unique noise vector drawn from an i.i.d. Gaussian distribution with a mean of 0 and standard deviation . These images were then paired, with half designated for signal present and half for signal absent. Each signal present image was summed with the signal image to generate the final training data set of 30 000 paired images. Another set of 5000 paired images was generated for determining the channel covariance matrix after the channels had been learned and a further set of 5000 paired images were held out as a testing dataset.

B. Location-known tasks that utilize a breast phantom dataset

Two further signal detection tasks were performed on a breast phantom background employing the VICTRE dataset [44]. This dataset contains simulated digital mammography (DM) images and was employed previously in a location-known human observer study to evaluate imaging systems [44]. The images are divided into four categories of breast types of decreasing difficulty for lesion detection: extremely dense, heterogeneously dense, scattered fibroglandular, and fatty. The signals in the dataset are microcalcification clusters and spiculated masses. For each signal, there are associated signal-absent and signal-present images. The signal remains constant in location and shape throughout all the signal-present images, but a clean signal image is not available. An estimation of the signal is obtained from the difference of the mean signal-present and signal-absent images according to Eqn. (7), making this a location-known task [44].

For each type of signal, 12500 total images were selected from the dataset to form training, validation, and testing sets of 5000, 625, and 625 paired images, respectively. The breast types selected maintained the proportions of the VICTRE study [44]. The signals were estimated by taking the mean of the signal-present images and subtracting the mean of the signal-absent images for the combined training and validation dataset. Sample images and estimated signals are included in Fig. 4.

C. AE Topology

The considered network topology was a tied-weight AE with no nonlinear or bias terms. This structure parallels the CHO formulation in Eqn. (9), as the AE is learning the transformation matrix T. Tied weights were chosen because

Fig. 5: Performance of the CHO on varying training dataset sizes for the lumpy background model. (a) contains the results for the location-known elliptical signal while (b) contains the SKS elliptical signal results. The error bars correspond to the standard deviation of the fit AUC values. The HOCMD is provided as an estimate of the upper bound of the HO, given an infinite amount of images, and is included to benchmark the efficiency of the channels for all methods.

they couple the encoder and the decoder by enforcing , making the encoder a transpose of the decoder. This formulation prevents loss of information that may solely exist in the decoder since only the encoder is employed as the transformation matrix. Additionally, tied weight AEs have fewer parameters to train and thus perform better in the limited-data experiments considered [60].

D. Experimental Parameters

1) Training Details: AE-channels were determined by minimizing the modified autoencoder loss function in Eqn. (12). The models were trained in Tensorflow [61] using the Adam algorithm [55]. The AE weights were initialized using a truncated normal initializer with a standard deviation of 5e-6. The models were trained for 500 epochs. Provided the considered dataset contained more than 500 images, pre-training the models on a subset of 500 images for 500 epochs

Fig. 6: Performance of the CHO on varying training dataset sizes for the VICTRE breast phantom model. (a) contains the results for the spiculated mass signal while (b) contains the microcalcification cluster results. The error bars correspond to the standard deviation of the fit AUC values. For both signals, the HO-Direct diverged for most of the dataset sizes due to insufficient data and the more complex background. The analytic HO estimate is not included as the clean images are not available to generate an estimate. The matched filter is also omitted in (b) since it had an AUC of 0.53, significantly less than the channelized methods.

to burn in the network sometimes improved performance. A mini-batch size of 250 was employed, with an equal number of signal-present and signal-absent images in each mini-batch. The learning rate was set to 5e-3 for the VICTRE phantom background study and 1e-5 for the lumpy background study. All networks were trained on a single NVIDIA TITAN X GPU.

Several reference methods were implemented to compared against the AE-learned channels, including convolutional LG [30], partial least squares [29], and the matched filter. The HO-Direct [17] was also computed on each subset using Eqn. (5). A grid search on the entire training dataset for each background was used to select the parameters for all methods, with the number of channels capped at 20. This grid search also implicitly provided multiple random initializations for the

Fig. 7: Location-known elliptical channels for the 30 000 paired image dataset on the lumpy background. The grayscale is constant and fixed. Note that some channels have redundant functionality, such as 1 and 6. In this case, removing channel 6 only results in a loss of 0.0005 AUC.

AE.

2) Evaluation: Each model was on trained across a range of restricted-size subsets of the training data. The VICTRE case detailed in Sec. IV-B contained subsets of size K = 250, 500, 1000, 2000, and 5000 image pairs. The larger lumpy background experiments detailed in Sec.IV-A also considered sets of 10 000, 15 000, 20 000, 25 000, and 30 000 image pairs.

The standard train-validate-test scheme [62] was employed to evaluate performance. The AE and competing methods were given the training data and signal estimate to operate on, with the performance evaluated on the validation data to select the best set of parameters. Once the parameters were determined for each method, the CHO was numerically determined according to Eqn. (10) using the combined training subset and validation dataset to compute . The final models were then evaluated on the testing set to obtain the AUC values.

The HO-CMD was also computed for the experiments on lumpy backgrounds to analyze the efficiency of the channels for each method. The empirical background covariance matrix was calculated using the combined training and validation datasets for a total of 70 000 noiseless background images. This method was unavailable for estimating the HO of the VICTRE experiments as noiseless images were not available.

V. RESULTS

The results for the limited-image tests for the lumpy model and VICTRE breast phantom model are provided in Figs. 5 and 6. The traditional AE was also tested, but failed to exceed 0.55 AUC in all four experiments. Overall, the proposed method was competitive with the state-of-the-art channelized methods for both the lumpy background and VICTRE phantom background cases. For the lumpy background cases, the AE-channels performed significantly better than the PLS channels for all but the largest dataset sizes. In those cases, performance was comparable. Convolutional LG channel performance was relatively static since the models were tuned at the maximum dataset size and it is not a learning method, but were the best performing channels for the majority of the lumpy dataset sizes considered. However, both the PLS and AE channels outperformed convolutional LG when sufficient images were available. The HO-Direct had inferior performance to both the AE and convolutional channels while also requiring significantly more computation to evaluate. Thus, some channelized methods outperformed the standard method of computing the HO. The HO-CMD serves as an upper bound.

In the VICTRE background case, the AE-learned channels outperform every other tested method for the smaller training subsets. Given a sufficient amount of data, the AE and PLS channels approach the same AUC and are approximately equivalent. This occurred more quickly for the larger spiculated mass signal than the smaller microcalcification clusters. The HO-Direct also had substandard performance in most cases due to the degeneracy of the covariance matrix in the data-constrained experiments. In these ill-conditioned cases the test statistic was estimated by solving a linear system, but the resulting low AUC demonstrates the superiority of channelized methods for calculating an observer for this more complicated background.

During the course of the experiments, it was observed that the convolutional LG channels were especially sensitive to the quality of the estimated signal. When provided with the signal used to generate the data in both the location-known and SKS lumpy experiments, the method outperformed all other competitors. When fewer images were available and thus there is more noise in the signal image, such as in the VICTRE phantom dataset, the performance degraded significantly. Although the AE-learned channels attempt to reconstruct the given signal image directly, and thus would seem to be impacted more by noise, the method was more robust to error in the estimated signal than the convolutional LG approach. This is likely due to the same innate denoising AEs demonstrate due to the limited embedding dimensionality.

The learned channels for the 30 000 location-known lumpy image case are included in Fig. 7. Many of the channels are similar to one another in the features they extract, and can be removed without significant loss of performance. These extraneous channels likely exist due to the AE training process. Random initializations generate different starting locations for each channel, which is iteratively optimized by the AE training process. During this process, the channels are updated to better jointly reconstruct the signal image. Thus, even if the final model makes inefficient use of its full channel budget, the channels are influenced by their interactions during the training process. One of the limitations of this approach is its sensitivity to the random initialization, which can result in models of dramatically varying quality even with the same structure.

VI. DISCUSSION AND CONCLUSION

This study demonstrated that AEs are capable of learning efficient CHO channels for both location known and certain SKS signal detection tasks. Data embeddings and observer channels were demonstrated to be fundamentally related, with the task of optimizing a data embedding to preserve signal-specific information equivalent to determining an efficient channel selection for the CHO. Furthermore, the presented method of computing channels is capable of meeting or exceeding the performance of state-of-the-art methods on the investigated tasks.

Channels were learned for the CHO by minimizing the reconstruction loss of an AE. Modification of the AE loss function to focus only on task-specific information involving the signal was found to have a significant benefit over using the traditional AE approach. Empirical sweeps over the network topology revealed that the AE could efficiently approximate the HO for a wide range of cases utilizing comparable numbers of channels to other approaches. The proposed method was equivalent to state-of-the-art approaches for the lumpy background and significantly superior on the more complicated VICTRE breast phantom dataset, demonstrating the robustness and versatility of the method.

Performance improvements were especially noticeable for low numbers of training-set images as the AE-learned channels plateaued to higher AUC values sooner than other learningbased methods. However, the AE-learned channels were sensitive to the random initialization of the weights and frequently learned redundant channels. The training scheme can likely be further improved with a more robust approach to weight initialization.

Opportunities for future work include expanding the current channels to the IO and extending the formulation to both more sophisticated SKS cases and 3D input images. The channels should work directly for any standard Markov chain Monte Carlo method for estimating the IO. Although the current form of the loss function for learning AE-channels requires knowing the signal centroid, it could be generalized by considering convolutional AEs [63]. The superior performance of AElearned channels on smaller datasets and medically realistic phantoms also expands the applicability of the method to realworld cases, and the method should be tested on experimental data to identify remaining challenges in tuning the AE.

ACKNOWLEDGMENT

This work was supported in part by grants NIH NS102213, NIH EB020604, and NSF DMS1614305.

REFERENCES

[1] R. F. Wagner and D. G. Brown, “Unified SNR analysis of medical imaging systems,” Physics in Medicine & Biology, vol. 30, no. 6, p. 489, 1985.

[2] H. H. Barrett and K. J. Myers, Foundations of Image Science. John Wiley & Sons, 2013.

[3] M. A. Kupinski, J. W. Hoppin, E. Clarkson, and H. H. Barrett, “Ideal- Observer computation in medical imaging with use of Markov-Chain Monte Carlo techniques,” JOSA A, vol. 20, no. 3, pp. 430–438, 2003.

[4] S. Park, H. H. Barrett, E. Clarkson, M. A. Kupinski, and K. J. Myers, “Channelized-Ideal Observer using Laguerre-Gauss channels in detection tasks involving non-Gaussian distributed lumpy backgrounds and a Gaussian signal,” JOSA A, vol. 24, no. 12, pp. B136–B150, 2007.

[5] S. Park and E. Clarkson, “Efficient estimation of Ideal-Observer per- formance in classification tasks involving high-dimensional complex backgrounds,” JOSA A, vol. 26, no. 11, pp. B59–B71, 2009.

[6] F. Shen and E. Clarkson, “Using Fisher information to approximate Ideal-Observer performance on detection tasks for lumpy-background images,” JOSA A, vol. 23, no. 10, pp. 2406–2414, 2006.

[7] W. Vennart, “ICRU report 54: Medical imaging - The assessment of image quality,” Radiography, vol. 3, no. 3, pp. 243–244, April 1996. [Online]. Available: https://doi.org/10.1016/S1078-8174(97)90038-9

[8] X. He, B. S. Caffo, and E. C. Frey, “Toward realistic and practical Ideal Observer (IO) estimation for the optimization of medical imaging systems,” IEEE Transactions on Medical Imaging, vol. 27, no. 10, pp. 1535–1543, 2008.

[9] C. K. Abbey and J. M. Boone, “An Ideal Observer for a model of X-ray imaging in breast parenchymal tissue,” in International Workshop on Digital Mammography. Springer, 2008, pp. 393–400.

[10] W. Zhou, H. Li, and M. Anastasio, “Approximating the Ideal Observer and Hotelling observer for binary signal detection tasks by use of supervised learning methods,” IEEE Transactions on Medical Imaging, 04 2019.

[11] Y. Chen, Y. Lou, K. Wang, M. A. Kupinski, and M. A. Anastasio, “Reconstruction-aware imaging system ranking by use of a sparsitydriven numerical observer enabled by variational Bayesian inference,” IEEE Transactions on Medical Imaging, vol. 38, no. 5, pp. 1251–1262, May 2019.

[12] H. H. Barrett, K. J. Myers, C. Hoeschen, M. A. Kupinski, and M. P. Little, “Task-based measures of image quality and their relation to radiation dose and patient risk,” Physics in Medicine & Biology, vol. 60, no. 2, p. R1, 2015.

[13] I. Reiser and R. Nishikawa, “Task-based assessment of breast tomosynthesis: Effect of acquisition parameters and quantum noise,” Medical Physics, vol. 37, no. 4, pp. 1591–1600, 2010.

[14] A. A. Sanchez, E. Y. Sidky, and X. Pan, “Task-based optimization of dedicated breast CT via Hotelling observer metrics,” Medical Physics, vol. 41, no. 10, 2014.

[15] S. J. Glick, S. Vedantham, and A. Karellas, “Investigation of optimal kVp settings for CT mammography using a flat-panel imager,” in Medical Imaging 2002: Physics of Medical Imaging, vol. 4682. International Society for Optics and Photonics, 2002, pp. 392–403.

[16] H. H. Barrett, T. Gooley, K. Girodias, J. Rolland, T. White, and J. Yao, “Linear discriminants and image quality,” Image and Vision Computing, vol. 10, no. 6, pp. 451–460, 1992.

[17] H. H. Barrett, J. Yao, J. P. Rolland, and K. J. Myers, “Model observers for assessment of image quality,” Proceedings of the National Academy of Sciences, vol. 90, no. 21, pp. 9758–9765, 1993.

[18] H. H. Barrett, K. J. Myers, B. D. Gallas, E. Clarkson, and H. Zhang, “Megalopinakophobia: its symptoms and cures,” in Medical Imaging 2001: Physics of Medical Imaging, vol. 4320. International Society for Optics and Photonics, 2001, pp. 299–308.

[19] M. A. Kupinski, E. Clarkson, and J. Y. Hesterman, “Bias in Hotelling observer performance computed from finite data,” in Medical Imaging 2007: Image Perception, Observer Performance, and Technology Assessment, vol. 6515. International Society for Optics and Photonics, 2007, p. 65150S.

[20] W. Zhou, H. Li, and M. A. Anastasio, “Learning the Hotelling observer for ske detection tasks by use of supervised learning methods,” in Medical Imaging 2019: Image Perception, Observer Performance, and Technology Assessment, vol. 10952, 2019. [Online]. Available: https://doi.org/10.1117/12.2512607

[21] H. H. Barrett, C. K. Abbey, B. D. Gallas, and M. P. Eckstein, “Stabi- lized estimates of Hotelling-observer detection performance in patientstructured noise,” in Medical Imaging 1998: Image Perception, vol. 3340. International Society for Optics and Photonics, 1998, pp. 27–44.

[22] B. D. Gallas and H. H. Barrett, “Validating the use of channels to estimate the ideal linear observer,” Journal of the Optical Society of America A, vol. 20, no. 9, pp. 1725–1738, Sep 2003. [Online]. Available: http://josaa.osa.org/abstract.cfm?URI=josaa-20-9-1725

[23] J. B. Tenenbaum, “Mapping a manifold of perceptual observations,” in Advances in Neural Information Processing Systems 10, M. I. Jordan, M. J. Kearns, and S. A. Solla, Eds. MIT Press, 1998, pp. 682–688. [Online]. Available: http://papers.nips.cc/paper/ 1332-mapping-a-manifold-of-perceptual-observations.pdf

[24] D. Beymer and T. Poggio, “Image representations for visual learning,” Science, vol. 272, no. 5270, pp. 1905–1909, 1996. [Online]. Available: https://science.sciencemag.org/content/272/5270/1905

[25] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of Cognitive Neuroscience, vol. 3, no. 1, pp. 71–86, 1991, pMID: 23964806. [Online]. Available: https://doi.org/10.1162/jocn.1991.3.1.71

[26] K. Q. Weinberger and L. K. Saul, “Unsupervised learning of image manifolds by semidefinite programming,” International Journal of Computer Vision, vol. 70, no. 1, pp. 77–90, Oct 2006. [Online]. Available: https://doi.org/10.1007/s11263-005-4939-z

[27] K. J. Myers and H. H. Barrett, “Addition of a channel mechanism to the ideal-observer model,” Journal of the Optical Society of America A, vol. 4, no. 12, pp. 2447–2457, Dec 1987. [Online]. Available: http://josaa.osa.org/abstract.cfm?URI=josaa-4-12-2447

[28] S. Park, J. M. Witten, and K. J. Myers, “Singular vectors of a linear imaging system as efficient channels for the bayesian ideal observer,” IEEE Transactions on Medical Imaging, vol. 28, no. 5, pp. 657–668, May 2009.

[29] J. M. Witten, S. Park, and K. J. Myers, “Partial least squares: A method to estimate efficient channels for the Ideal Observers,” IEEE

Transactions on Medical Imaging, vol. 29, no. 4, pp. 1050–1058, April 2010.

[30] I. Diaz, C. K. Abbey, P. A. Timberg, M. P. Eckstein, F. R. Verdun, C. Castella, and F. O. Bochud, “Derivation of an observer model adapted to irregular signals based on convolution channels,” IEEE Transactions on Medical Imaging, vol. 34, no. 7, pp. 1428–1435, 2015.

[31] J. G. Brankov, Y. Yang, L. Wei, I. El Naqa, and M. N. Wernick, “Learning a channelized observer for image quality assessment,” IEEE transactions on medical imaging, vol. 28, no. 7, pp. 991–999, 07 2009. [Online]. Available: https://www.ncbi.nlm.nih.gov/pubmed/19211351

[32] A. R. Pineda, “Laguerre-Gauss and sparse difference-of-Gaussians observer models for signal detection using constrained reconstruction in magnetic resonance imaging,” in Medical Imaging 2019: Image Perception, Observer Performance, and Technology Assessment, vol. 10952, 2019. [Online]. Available: https://doi.org/10.1117/12.2512813

[33] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, D. E. Rumelhart, J. L. McClelland, and C. PDP Research Group, Eds. Cambridge, MA, USA: MIT Press, 1986, pp. 318–362. [Online]. Available: http://dl.acm.org/citation.cfm?id=104279.104293

[34] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, 2006, pMID: 16764513. [Online]. Available: https://doi.org/10.1162/neco.2006.18.7.1527

[35] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science (New York, N.Y.), vol. 313, pp. 504–7, 08 2006.

[36] Y. Bengio and Y. Lecun, Scaling learning algorithms towards AI. MIT Press, 2007.

[37] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio, “Why does unsupervised pre-training help deep learning?” J. Mach. Learn. Res., vol. 11, pp. 625–660, Mar. 2010. [Online]. Available: http://dl.acm.org/citation.cfm?id=1756006.1756025

[38] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extract- ing and composing robust features with denoising autoencoders,” in Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML’08), W. W. Cohen, A. McCallum, and S. T. Roweis, Eds. ACM, 2008, pp. 1096–1103.

[39] M. Nishio, C. Nagashima, S. Hirabayashi, A. Ohnishi, K. Sasaki, T. Sagawa, M. Hamada, and T. Yamashita, “Convolutional auto-encoder for image denoising of ultra-low-dose ct,” Heliyon, vol. 3, no. 8, p. e00393, 2017. [Online]. Available: http://www.sciencedirect.com/ science/article/pii/S2405844016321600

[40] Z. Zhang, Y. Song, and H. Qi, “Age progression/regression by condi- tional adversarial autoencoder,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4352–4360, 2017.

[41] D. Kunin, J. M. Bloom, A. Goeva, and C. Seed, “Loss landscapes of regularized linear autoencoders,” CoRR, vol. abs/1901.08168, 2019. [Online]. Available: http://arxiv.org/abs/1901.08168

[42] C. E. Metz, “ROC methodology in radiologic imaging.” Investigative Radiology, vol. 21, no. 9, pp. 720–733, 1986.

[43] M. A. Kupinski, D. C. Edwards, M. L. Giger, and C. E. Metz, “Ideal Observer approximation using Bayesian classification neural networks,” IEEE Transactions on Medical Imaging, vol. 20, no. 9, pp. 886–899, 2001.

[44] A. Badano, C. G. Graff, A. Badal, D. Sharma, R. Zeng, F. W. Samuelson, S. J. Glick, and K. J. Myers, “Evaluation of Digital Breast Tomosynthesis as Replacement of Full-Field Digital Mammography Using an In Silico Imaging Trial,” JAMA Network Open, vol. 1, no. 7, pp. e185 474–e185 474, 11 2018. [Online]. Available: https: //doi.org/10.1001/jamanetworkopen.2018.5474

[45] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85–117, 2015.

[46] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.

[47] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Networks, vol. 2, no. 5, pp. 359–366, 1989.

[48] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks, vol. 4, no. 2, pp. 251 – 257, 1991. [Online]. Available: http://www.sciencedirect.com/science/article/ pii/089360809190009T

[49] W. Zhou and M. A. Anastasio, “Learning the Ideal Observer for SKE detection tasks by use of convolutional neural networks,” in Medical Imaging 2018: Image Perception, Observer Performance, and

Technology Assessment, vol. 10577. International Society for Optics and Photonics, 2018, p. 1057719.

[50] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Comput. Surv., vol. 41, pp. 15:1–15:58, 2009.

[51] A. Mousavi, A. B. Patel, and R. G. Baraniuk, “A deep learning approach to structured signal recovery,” in 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), Sep. 2015, pp. 1336–1343.

[52] J. Snoek, R. Adams, and H. Larochelle, “On nonparametric guidance for learning autoencoder representations,” in Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, N. D. Lawrence and M. Girolami, Eds., vol. 22. La Palma, Canary Islands: PMLR, 21–23 Apr 2012, pp. 1073–1080. [Online]. Available: http://proceedings.mlr.press/v22/snoek12.html

[53] J. Snoek, R. P. Adams, and H. Larochelle, “Nonparametric guidance of autoencoder representations using label information,” J. Mach. Learn. Res., vol. 13, no. 1, pp. 2567–2588, Sep. 2012. [Online]. Available: http://dl.acm.org/citation.cfm?id=2503308.2503324

[54] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” J. Mach. Learn. Res., vol. 11, pp. 3371–3408, Dec. 2010. [Online]. Available: http://dl.acm.org/citation.cfm?id=1756006.1953039

[55] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[56] C. E. Metz and X. Pan, “‘Proper’ binormal ROC curves: theory and maximum-likelihood estimation,” Journal of Mathematical Psychology, vol. 43, no. 1, pp. 1–33, 1999.

[57] X. Pan and C. E. Metz, “The ’proper’ binormal model: parametric receiver operating characteristic curve estimation with degenerate data,” Academic Radiology, vol. 4, no. 5, pp. 380–389, 1997.

[58] J. Rolland and H. H. Barrett, “Effect of random background inhomo- geneity on observer detection performance,” JOSA A, vol. 9, no. 5, pp. 649–658, 1992.

[59] M. A. Kupinski, E. Clarkson, J. W. Hoppin, L. Chen, and H. H. Barrett, “Experimental determination of object statistics from noisy images,” JOSA A, vol. 20, no. 3, pp. 421–429, 2003.

[60] J. L. Granstedt, W. Zhou, and M. A. Anastasio, “Autoencoder embedding of task-specific information,” in Medical Imaging 2019: Image Perception, Observer Performance, and Technology Assessment, vol. 10952, 2019. [Online]. Available: https://doi.org/10.1117/12.2513120

[61] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: a system for largescale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.

[62] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. MIT press Cambridge, 2016, vol. 1.

[63] M. Ranzato, F. J. Huang, Y. Boureau, and Y. LeCun, “Unsupervised learning of invariant feature hierarchies with applications to object recognition,” in 2007 IEEE Conference on Computer Vision and Pattern Recognition, June 2007, pp. 1–8.

designed for accessibility and to further open science