Chat
Not logged in
Neural Memory Plasticity for Anomaly Detection
2019
·
arXiv

In the domain of machine learning, Neural Memory Networks (NMNs) have recently achieved impressive results in a variety of application areas including visual question answering, trajectory prediction, object tracking, and language modelling. However, we observe that the attention based knowledge retrieval mechanisms used in current NMNs restricts them from achieving their full potential as the attention process retrieves information based on a set of static connection weights. This is suboptimal in a setting where there are vast differences among samples in the data domain; such as anomaly detection where there is no consistent criteria for what constitutes an anomaly. In this paper, we propose a plastic neural memory access mechanism which exploits both static and dynamic connection weights in the memory read, write and output generation procedures. We demonstrate the effectiveness and flexibility of the proposed memory model in three challenging anomaly detection tasks in the medical domain: abnormal EEG identification, MRI tumour type classification and schizophrenia risk detection in children. In all settings, the proposed approach outperforms the current state-of-the-art. Furthermore, we perform an in-depth analysis demonstrating the utility of neural plasticity for the knowledge retrieval process and provide evidence on how the proposed memory model generates sparse yet informative memory outputs.

Keywords: Neural Memory Networks, Anomaly Detection, Neural Plasticity,

Abnormal EEG Identification, MRI Tumour Type Classification,

Schizophrenia Risk Detection.

Neural Memory Networks (NMNs) have recently achieved tremendous success on larger knowledge bases via the use of an external memory to explicitly store and retrieve relevant information [1, 2, 3, 4, 5, 6]. They elevate the temporal model capability of Recurrent Neural Networks (RNNs) [7] to capture both long-term and short-term relationships where the added capacity is utilised to store the information that constitutes these relationships. The evolution of the memory occurs through read and write functions, which are both differentiable and trained alongside the rest of the components of the network.

Plasticity is a biological process which refers to the human brain’s ability to change throughout life by forming new connections among neurons and degrading unwanted connections [8]. Most recently, in [9] the authors employ plasticity in neural networks and optimise it along with the rest of the parameters through back propagation. Their evaluations demonstrate encouraging results with the addition of dynamicity in connections to capture temporal relationships. However, we argue that the full potential of plasticity in knowledge discovery tasks is yet to be explored as plasticity has only been exploited for vanilla neural networks and they have not ventured into memory networks. Hence, their capacity for knowledge discovery is yet to be fully enabled.

Although plasticity in neural networks has some potential to model temporal relationships, their capacity is limited, hence, they fail to recover long-term dependencies. On the other hand, even though a memory network is analogous to that of the human brain, the naive structure of the read and write mechanisms in NMNs hinder their potential for knowledge discovery. Firstly, the attention mechanism employed for the extraction of relevant information tries to embed the stored knowledge as a fixed dimensional vector. In contrast, in biological brains, the connections are changing with the aid of plasticity [9]. Secondly, the Long Short-Term Memory (LSTM) function which is commonly employed in NMN to read from and write to the memory over time is proven to focus more on recent history as opposed to discovering long-term relationships [10]. Via carefully evaluating the merits of both neural plasticity and NMN techniques we propose a new knowledge retrieval structure to be used by NMNs through trainable neural plasticity. In order to demonstrate the ability of our proposed architecture in achieving the full potential of neural plasticity we evaluate our memory model in an anomaly detection setting. Anomaly detection is a fundamental, yet challenging task in machine learning, primarily due to the lack of a consistent criteria for what constitutes an anomaly. A possible solution to this problem is to develop a memory architecture which would optimally compare and contrast between different characteristics that arise through analysing the long-term relationships within the data domain.

A plastic memory network architecture as we have proposed would allow the underlying framework to learn a vast range of subject and problem specific characteristics from the data via temporally varying the level of attention that it pays in the memory read and write operations to different salient information cues. As a result, the same underlying approach can be applied to a range of applications. To demonstrate this ability, we explore a range of anomaly detection tasks in the medical domain. The application of machine learning techniques to automatically detect anomalies in medical data is particularly attractive considering its consistency and non-subjectivity, along with its cost-effectiveness, eliminating the need for extensive training of human practitioners which is required to master manual screening [11]. There exists numerous medical anomaly detection tasks, ranging from identifying abnormal EEG recordings [12], detecting tumours in Magnetic Resonance Imaging (MRI) [13] to anomaly detection in medical wireless sensor networks [14]. However, medical data itself poses new challenges as there exists significant variability among subjects, and across different conditions. For instance, identifying an anomaly in an Electroencephalogram (EEG) is inherently difficult even for trained professionals, as there exists significant variability among the patients in the manifestation of any abnormality which is accentuated further by the variability in the operating conditions [12].

To demonstrate the applicability of our proposed technique to a breadth of applications in the medical domain, this paper investigates abnormal EEG identification, MRI tumour type classification, and schizophrenia risk detection tasks, and proposes a unified system which learns these patient-specific and problem-specific characteristics from the data. Our evaluations on these challenging abnormality detection tasks, which involves both one and two-dimensional signals, confirms the viability of the method for real-world applications. Through extensive experimental evaluations, we demonstrate that neural plasticity enhances the knowledge retrieval process in NMNs where the memory is translated into very different forms, which are learned over time, and allows us to filter out the most salient information. Furthermore, we analyse the utility of plasticity in terms of model activations and statistical interactions, and demonstrate how it acts as an attention mechanism during memory access. The main contributions of the proposed work can be summarised as follows:

We propose a novel memory addressing mechanism for NMNs which fa-

image

We outperform state-of-the-art methods in three challenging tasks in the

image

We interpret the model learning process in terms of activations from the

image

We would like to emphasise that even though we are demonstrating our approach on three different applications specifically in the medical domain, the varied nature of these problems demonstrates how the proposed model can be directly applied to any anomaly detection problem in different domains where modelling long term relationships is necessary. Possible application areas include, detecting anomalies in daily human activities and sports activities [16], anomaly detection in vehicle driving [17], and detecting anomalies in stock exchange [18] and in credit card transactions [19].

2.1. Neural Memory Networks

The authors in [1, 5, 3, 2] and our prior works [6, 20, 21, 4] have extensively demonstrated the effectiveness of what are termed “memory modules” to store and retrieve relevant information, and capture relationships between different input sequences in the data domain. These dependencies are missed by models such as LSTMs and Gated Recurrent Units (GRUs) as they consider dependencies only within a given input sequence. Due to this ability external memory modules have gained traction in numerous domains, including language modelling [1], visual question answering [3], trajectory prediction [6, 20], object tracking [5], and saliency modelling [4].

Even through they exhibit great potential for capturing salient information, we observe several factors that hinder their capabilities. Firstly, the LSTM functions utilised for the memory read and write processes are demonstrated to focus more on the recent dependencies and completely ignore long-term dependencies [10]. Furthermore, recent works [9] have shown that the attention based knowledge retrieval mechanism is not ideal in a memory unit as the stored information is temporally evolving, requiring the weighted connections to be updated over time.

Inspired by the works of Miconi et. al [9] where they introduce the concept of differentiable plasticity, we propose a novel mechanism to retrieve and update neural memory models. It should be noted that [9] investigated the plasticity only in vanilla deep neural networks such as LSTMs, hence its full potential for knowledge discovery is yet to be fully exploited.

2.2. Neural Plasticity

“Neural plasticity”, the strengthening and weakening of connections between

the neurons using the neural activity as the basis, has been extensively investigated within artificial neural networks [22]. However, these investigations were conducted before the dawn of deep learning, hence, requiring extensive research to utilise its full potential in knowledge retrieval. Plastic methods build upon the “Hebbian rule”, neurons that fire together, wire together [23]. For instance in [24] the authors Nolfi and Parisi propose to evolve networks with “auto-teaching” inputs and utilise them to provide an error signal for the network weight adjustment over their lifetime. In [25] the authors propose eight different rules to inject the biological evolution of rules when updating the neural network parameters. In [26] the authors use separate neural networks as evolving agents to learn Hebbian-like learning rules for simple navigation tasks. However, until recently, neural plasticity has not been investigated with deep neural networks.

Most recently, Miconi et. al [9] demonstrated how neural plasticity can be tuned in deep neural networks together with other parameters using gradient decent. Inspired by the success of their system when extracting salient information cues, we propose the development of neural memory plasticity where the memory access mechanisms in neural memory models are made plastic to provide varying level of attention to the stored information, introducing a novel knowledge retrieval paradigm in NMNs.

In a separate line of work, Harris et. al [27] investigated the concept of a plasticity based working memory for visual recognition tasks. However, this is different from the proposed work as they are not utilising an external memory. In contrast, [27] uses the neural plasticity itself as the temporal modelling

mechanism.

2.3. Anomaly Detection

In the domain of machine learning anomaly detection is primarily regarded as an unsupervised learning task. For instance, in [28] the authors try to detect anomalies in network traffic through a geometric pattern based framework. For video based anomaly detection, numerous methods including pixel-level features [29], trajectory features [30] and spatio-temporal features [31] are proposed which are subsequently categorised through an unsupervised learning paradigm. Numerous works have also exploited deep learning in an unsupervised setting for anomaly detection [32, 33, 34]. Please refer to [35] for a complete review of these methods.

In the medical domain, supervised anomaly detection methods have been preferred due to the inherent difficulties present in medical data. For instance, detecting abnormalities in an EEG recording is challenging in an unsupervised setting as abnormal artefacts are not clearly evident as there exist numerous natural variations among subjects. Therefore, what is defined as normal for one subject can be abnormal for another, requiring learning of abnormal and normal scenarios in order to discriminate. Hence most approaches in the medical domain have used supervised learning [13, 36, 37, 38].

With the recent spectacular success of deep learning methods for automatically learning task specific features, hand engineered features have been replaced by deep learned features for medical anomaly detection. Convolutional Neural Networks (CNNs) and recurrent neural networks such as LSTMs have been extensively applied to detect abnormal behaviour. However, as noted by [12, 13] abnormalities can take many forms and there exists subtle differences between subjects, differences in the regions where the data is captured, etc. Hence we propose to augment the capacity of the modelling framework through the introduction of an external memory which can be used to store the observed knowledge and map the long-term dependencies between data samples.

Recently, a neural memory network based approach for anomaly detection

image

Figure 1: Overview of an external memory which is composed of input, output and write controllers. The input controller determines what facts within the input are used to query the memory. The output controller determines what portion of the memory is passed as the output for that query. Finally the write controller updates the memory state and propagates it to the next time step.

is proposed in [39] where the authors try to memorise the patterns within the normal data in order to detect abnormal instances. However, this approach is quite distinct from the proposed approach as we are learning our memory model from both normal and abnormal data and as such the memory learns to store distinctive characteristics from both normal and abnormal data streams.

In this section we introduce the structure of a typical NMN and its basic operations, and how they can be augmented to facilitate plasticity.

3.1. Neural Memory Networks

As shown in Fig. 1, a typical memory module is composed of 1) a memory stack for information storage, 2) a read controller to query the knowledge stored in the memory, 3) a write controller for memory update, and 4) an output controller which controls what results are passed out from the memory.

Let  M ∈ Rl×kbe a memory stack with l memory slots where each slot contains an embedding of dimension k. We represent the state of memory at time instance  t −1 as  Mt−1. In a typical memory implementation [1, 4, 21] first the read controller passes the input,  xt, at the current time instance t through a

read function composed of a Long Short Term Memory (LSTM) cell such that,

image

Following [40], using a softmax function we quantify the similarity between

the content stored in each slot of  Mt−1and the query vector,  qt, such that,

image

Now the output controller can retrieve the memory output,  mt, for the

current state by,

image

Finally the write controller, which also uses an LSTM cell generates an

image

and updates the memory using,

image

where I is a matrix of ones,  el ∈ Rland  ek ∈ Rkare vectors of ones and  ⊗deontes the outer product which duplicates its left vector l or k times to form a matrix.

3.2. Injection of Plasticity for Memory Components

We follow the formulation of the Hebbian rule proposed in [9] for its flexibility and simplicity.

We define a fixed component and plastic component for each pair of neurons i and j, and the plastic component is stored in a Hebbian trace Hebbi,j, which evolves over time based on the inputs and outputs.

Formally let there be two input layers, each with k neurons and let  w ∈ Rk×kbe the fixed weights and Hebb  ∈ Rk×kstore the Hebbian trace. Then a sample input  xtto the first layer at time instance, t, is passed to the next layer such

that,

image

where  αis a coefficient which controls the contribution from fixed and plastic

terms of a particular weight connection and  ηis the learning rate of plastic components.

Thus using the formulation in Eq. 6 we replace the components of the read,

write and output controllers to facilitate plasticity such that,

image

3.3. Abnormality Detection

To evaluate the abnormality detection accuracy of the proposed memory architecture we conduct three experiments. Two experiments are conducted

image

Figure 2: Overview of the Proposed Abnormality Detection Framework with proposed Plastic NMN: We map the input sequences using two layers of LSTMs. The resultant embedding is utilised to retrieve the salient information from the stored knowledge in the memory and the retrieved vector is passed through the output and write controllers which determines the memory output at the current time instance and how to update the memory, respectively. These controllers utilise a combination of fixed weights and plastic components. A dense layer with softmax activation is used to determine the classification of the input.

with EEG data and one with MRI data. Fig. 2 illustrates the structure of the proposed model.

For the EEG data we consider a short time window of the EEG signals, hence requiring temporal modelling within the time window considered. In the MRI experiment, as the MRI is a spatial representation, motivated by [3] we extract spatial features from the MRI using a pre-trained ResNet 50 [41] and represent each element of the extracted feature block as an element in a sequence. The process is illustrated in Fig. 3.

image

Figure 3: Feature Extraction from MRI data: We utilise a ResNet [41] CNN architecture pre-trained on ImageNet [42] as our visual feature extractor, and extract features from the “Activation-85” layer which has an output dimensionality of 14  × 14 ×256. Then we consider each element in the extracted feature block individually and map that to a sequence, row wise, from top left to bottom right, which containes 14  ×14 = 196 elements.

This allows us to analyse the correspondences between pixels. For modelling the short-term relationships within the sequence we use LSTMs. For extracting out the relevant attributes through long-term dependencies we employ the proposed memory architecture. The normal/ abnormal classification or the tumour type classification (i.e 3 classes, Meningioma, Pituitary, and Glioma) is generated through a dense layer with softmax classification.

As opposed to unsupervised abnormality detection, which is frequently used in video tasks, we follow the baseline algorithms in the medical domain and use supervised learning, enabling direct comparison.

The following subsections discuss the dataset details, experimental setup and results of the experiments conducted for abnormal EEG detection, MRI tumour type classifications and EEG based schizophrenia risk detection.

4.1. Abnormal EEG Identification

4.1.1. Dataset

For this experiment we use the TUH Abnormal EEG database [43] which is the world’s largest publicly available dataset of its type [37]. This dataset consists of 1488 abnormal EEGs and 1529 normal EEGs and is demographically balanced with respect to the age and gender of patients. For training and testing of the systems we utilise the splits provided by the authors of the dataset.

The EEG signals were obtained at 250 HZ and we extract 60 second samples (i.e 1500 data points within a window) using a sliding window approach with 50% overlap between two consecutive windows. Similar to [12] we utilise only the T5-O1 channel of the EEG recordings as input to our model.

We perform min-max scaling of the input and no other pre-processing is performed. The final decision for the entire EEG is obtained through majority voting.

4.1.2. Experimental Setup

From the training set we use an 80% - 20% split for training and validation. We train the model using the Adam [44] optimiser and binary cross entropy loss for 50 epochs. Similar to [9] we share the same values for all  ηs (i.e ˙η, ˆηand ˜η). For both LSTMs we maintain the same embedding dimension k which is also used as the embedding dimension of the memory.

Hyper-parameters k, l, and  ηare evaluated experimentally using the validation set and these evaluations are shown Fig. 4. As k = 80, l = 25, and η= 0.50 provides best accuracies in the validation set we use these parameters for model training.

image

Figure 4: Hyper-parameter evaluation for the Abnormal EEG Evaluation (see Sec. 4.1). System performance as a single parameter (l, k or η) is changed while the others are held constant is reported. Values we selected for the three parameters in this experiment are

image

4.1.3. Results

Experimental results are presented in Tab. 1. For comparisons we report the results of the k-Nearest Neighbour (kNN) and Random Forest (RF) classifiers utilised by [43], where the authors first apply class dependent PCA to extract features from the EEG window and train the kNN and RF classifiers using those features. In [45] the authors apply a 2D CNN model on four channels of the EEG signal. In [12], the authors propose a 1D CNN model that uses the T5-O1 channel of the EEG as the input. This model is extended in [37] where the authors combine of one-dimensional convolution layers together with GRU layers. To further illustrate the utility of the proposed plasticity based memory addressing mechanism we evaluate the performance of the baseline memory model proposed in [1]. For this model, similar to the proposed abnormality detection model in Sec. 3.3, we use 2 layers of LSTMs followed by the memory model in [1] and a dense layer with softmax activation for generating the final classifica-tion. The hyper-parameters of this model, memory length, l, and embedding dimension, k, are evaluated experimentally using the validation set and we set l = 25 and k = 50.

When comparing the results we observe a significant performance boost with the introduction of deep learning techniques compared to traditional classifiers

Table 1: Evaluations of abnormal EEG identification using the dataset proposed in [43]

image

such as kNN and RF. With the introduction of GRU layers the performance of the 1D-CNN model is improved in [37]. Though we observe a slight increase in performance with the addition of a memory component in the memory model of [1], we do not observe a substantial accuracy increase mainly due to the limitations of the memory read and update mechanisms. However, with the proposed memory model we provide the utility of long-term dependency modelling and allow the model to extract the most salient components for decision making using the proposed plastic components in the memory addressing mechanism. This allows us to outperform all the baseline models by a significant margin.

4.2. MRI Tumour Type Classification

4.2.1. Dataset

We use the MRI database provided by [46]. The dataset is comprised of 3064 brain tumour MRI images taken from 233 patients. Types of brain tumours in the dataset are meningioma (1426 samples), glioma (708 samples), and pituitary (930 samples).

4.2.2. Experimental Setup

Due to the small dataset size we perform data augmentation as per [47], where we apply random rotations and flipping to augment the data. Then using a ResNet 50 model [41] pre-trained on ImageNet [42] we extract features from the Activation 85 layer. Then, as shown in Fig. 3 we generate a sequence of features which are used as the input to the proposed model. As there is no standard training/ testing splits for this dataset, similar to [38], we apply 5-fold cross validation where 80% of the training data is used for model training while the remaining 20% is used for validation. We train the model using the Adam [44] optimiser with categorical cross entropy loss for 50 epochs. Similar to Sec. 4.1.2 we evaluate hyper-parameters k, l, and  ηexperimentally and show these evaluations in Fig. 5. Based on this evaluation we set k = 150, l = 30, and η= 0.55.

image

Figure 5: Hyper-parameter evaluation for the MRI Tumour Type Classification Evaluation (see Sec. 4.2). System performance as a single parameter (l, k or η) is changed while the others are held constant is reported. Values we selected for the three parameters in this experiment

image

4.2.3. Results

We present the evaluation results in Tab. 2 where we compare the proposed method against a series of baselines. In [47] the authors utilise a pre-trained VGG-19 architecture and fine-tune it for the classification task. In [36] the authors use a shallow CNN architecture which comprises a single convolution layer. In [13] the authors investigate the utility of capsule networks for exploiting the spatial relationships within the image. In [38] the authors propose the use of the GIST descriptor with PCA for feature extraction as opposed to using deep learned features. In addition, we utilise the baseline memory model of [1] in our comparisons. The structure of this model is as defined in Sec. 4.1.3 and we have evaluated the hyper-parameters l and k experimentally and values of l = 30 and k = 120 are chosen.

Table 2: Evaluations of MRI tumour type classification using the dataset proposed in [46]

image

When comparing the results it is clear that there exists a higher sensitivity within the model performance, especially among the deep learned models, due to data scarcity. Even though we expect the baseline memory model of [1] to obtain better performance compared to other deep learning baselines, it has obtained poorer performance, mainly due to the higher dimensionality of the MRI data, where a better memory access scheme is required to retrieve the most salient information. In contrast, the proposed method, exploiting the neural plasticity of the knowledge retrieval process, has been able to optimally utilise the limited training data that is available and effectively capture the salient features for the task at hand.

4.3. EEG based Schizophrenia Risk Detection

4.3.1. Dataset

We use the EEG recordings from the auditory oddball trials conducted in [48] where the subjects listen to different tones of which some are frequent and some are less frequent. Several studies in neuroscience research have indicated a reduction in the amplitude of brain response to auditory change detection in patients who risk development of schizophrenia [49], and a number of studies have subsequently employed the auditory oddball paradigm for detection of

schizophrenia [50, 51, 52].

The dataset includes EEG recordings from children aged 9 to 12 years, including 65 children with an increased Risk of Schizophrenia (RSz) due to a positive family history of schizophrenia (in at least one first- or second-degree relative) and/or their presenting multiple replicated developmental antecedents of schizophrenia, and 39 Typically Developing (TD) children who presented none on those developmental antecedents or family history of schizophrenia in first-, second-, or third-degree relatives. Stimuli used for the auditory oddball paradigm were 1600 tones at 1000 Hz, including 1360 (85%) standard tones of 25ms duration and 240 (15%) deviant tones of 50ms duration. These standard and deviant tones were presented in pseudo-random order to avoid successive deviant stimuli, with an isochronous inter-stimulus interval of 300ms. The participants passively listen to the auditory oddball task and their electrocortical data are recorded according to the international standard 10-10 system of electrode placement. Please refer to [53, 48] for details.

We select the Fz, FCz, Cz, CPz and Pz channels from the EEG recordings. As pre-processing we apply a 50 Hz notch filter and artefact rejection to the ocular channels to remove blinks [48]. After the occurrence of the stimulus (i.e standards/ deviants) we extract a 300ms window from each of the selected channels of the EEG and perform min-max scaling of each channel separately. No other pre-processing is performed.

4.3.2. Experimental Setup

Due to the unavailability of standard training and testing splits, we adopt a 5 fold cross validation of the training set where we select 80% of the training data for training the model and 20% for validation. In order to generate the final classification of each participant (i.e RSz / TD) we utilise majority voting.

Similar to Sec. 4.1.2 we evaluate hyper-parameters k, l, and  ηexperimentally and present these evaluations in Fig. 6. As k = 100, l = 30, and  η= 0.65 provides best accuracies we use these parameters for model training.

image

Figure 6: Hyper-parameter evaluation for the EEG Based Schizophrenia Risk Detection Evaluation (see Sec. 4.3). System performance as a single parameter (l, k or η) is changed while the others are held constant is reported. Values we selected for the three parameters in this

image

4.3.3. Results

Evaluation of the proposed method along with the baselines are reported in Tab. 3. To the best of our knowledge there are no existing machine learning models that attempt classification of schizophrenia risk using EEGs.

The authors of [48] evaluate the peak amplitudes elicited by the mismatch between the deviant and standard tones in the auditory oddball paradigm and demonstrate that there exists a statistical difference between children at risk for schizophrenia relative to typically developing children. Hence as the first baseline model we extract the early positive (between 160 - 290 ms after the stimulus) and early negative (between 80 - 200 ms after the stimulus) peak amplitude values which are baselined to the average amplitude of the 100 ms window before the stimulus; and these features are subsequently passed through a SVM classifier.

In order to evaluate the performance of a deep learned model we pass the 5 input channels of the EEG through a single 2D convolution layer with 32 kernels, each with the kernel size of 64  ×5 1. The resultant feature vector is passed through an LSTM and the final classification is obtained by a dense layer with softmax activation. Finally, we use the baseline memory model of [1] where the structure of this model is defined in Sec. 4.1.3 and hyper-parameters l and k are evaluated using the validation set and are set to l = 25 and k = 100.

Table 3: Evaluations of EEG based schizophrenia detection task using the dataset proposed

image

When comparing the results it is evident that use of the standard peak amplitude feature is not sufficient to obtain good segregation between the RSz and TD groups. Furthermore, baseline CNN and memory models have not been able to generate satisfactory performance. We believe this is due to the inherent challenges in the task as there is less clear separation between the RSz and TD groups. However, through the utilisation of the proposed memory model and via augmenting the read and write mechanisms, we are able to attain superior classification results.

In this section we provide qualitative evidence indicating the importance of neural plasticity in the memory addressing mechanism and interpret what the model has learnt in terms of model activations.

5.1. Importance of Neural Plasticity in NMNs

We extract the output of the memory model in [1] (i.e Eq. 3) and the output of the proposed memory model (i.e Eq. 12) for the experiment outlined in Sec. 4.3. The model in [1] utilises attention to extract relevant information from the stored knowledge in the neural memory while the proposed method exploits a combination of fixed weights and neural plasticity components.

image

Figure 7: Visualisation of memory outputs from the memory model of [1] (i.e Eq. 3) and the Plastic NMN (Proposed) method (i.e Eq. 12). Colours blue to yellow corresponds to low-high values.

In Fig. 7 we visualise the extracted activations for 3 sample inputs, where colours from blue to yellow correspond to low to high output values. From this illustration it is evident that the memory outputs are significantly sparser in the proposed method, which clearly exhibits that neural plasticity is able to systematically strengthen/ weaken the connections based on their importance, and this leads to the identification of the most salient information for decision making.

In addition, using Figs 8 and 9 we provide memory embedding space visualisations in order to illustrate how the proposed memory network discriminates between different classes in the schizophrenia risk detection task presented in Sec. 4.3 as well as the multi-class MRI tumour type classification problem presented in Sec. 4.2.

We randomly sample 500 inputs from the test set and we apply PCA [54] and plot these embeddings in 2D. Fig. 8 presents the resultant plot where the TD and RSz classes are indicated based on the ground truth class identities. We observe a clear separation between the TD and RSz classes. In Fig. 9 we illustrate the memory embedding plot for the MRI tumour type classification task in Sec. 4.2. Similar to the previous analysis we extract memory outputs for a randomly selected sample of 500 MRI inputs from the test set and applied PCA to plot the memory outputs in 2D. We have indicated glioma, meningioma and pituitary tumour types in red diamonds, blue starts and green circles, respectively based on their ground truth classes. We observe clear separation between the three classes. This clearly demonstrates that even though plasticity has led to additional sparsity in the memory output, the resultant sparse vectors are sufficient to discriminate between the classes. This highlights the deficiencies with the attention based memory output generation process, which leads to much denser outputs and incurs additional overhead on the classification layer as it has to discriminate between dense vectors.

To further demonstrate the strengths of neural plasticity in the proposed method, in Figs. 10 and 11 we illustrate the attention weights of the baseline memory model of [1] , the fixed weights ˆw and ˆHebb of the output controller

image

Figure 8: 2D illustration of extracted memory embeddings for 500 randomly selected samples from the test set of the schizophrenia risk detection task presented in Sec. 4.3.

image

Figure 9: 2D illustration of extracted memory embeddings for 500 randomly selected samples from the test set of the MRI tumour type classification task in Sec. 4.2.

of the proposed method, once the training has completed, for the EEG based schizophrenia risk detection evaluation in Sec. 4.3 and MRI tumour type clas-sification evaluation in Sec. 4.2, respectively. In the EEG based schizophrenia risk detection evaluation as k = 100 each of the plots are of dimension 100×100. In Fig. 11 we observe dimensions 150×150 as k is set to 150 in the MRI tumour type classification evaluation for the proposed model and the attention weights of the baseline memory model has a dimension of 120  ×120. Colours from blue to yellow correspond to low-high connection weight values.

When analysing the plot it is clear that the attention weight matrix of the baseline memory module is denser compared to the fixed weights in ˆw, demonstrating that only a subset of connections are required in all the scenarios. It should be emphasised that in the baseline memory model the attention weights are fixed once the training completes. In contrast, the ˆHebb of the proposed method evolves over time and its visualisation illustrates a subset of connections that change over time to retrieve the salient components that are required at that particular time step. We argue that this process facilitates the sparser fixed weight matrix, as this can simply encode core information common to all patients, with the Hebb able to adapt to extract salient information for each individual case.

Fig. 12 visualises how the Hebbian trace, ˆHebb, changes between the training and testing instances. We visualise the ˆHebb once the training has completed (i.e. at the start of testing) and when the testing process has been completed (i.e. once it has seen all samples in the testing dataset). The changes in the Hebbian trace clearly demonstrates that different connections are needed in order to extract salient information in different cases, as opposed to having fixed connections throughout.

5.2. What is the model learning?

In this subsection we attempt to interpret what the proposed model detects as salient when distinguishing between RSz and TD groups in the experiment outlined in Sec. 4.3.

image

Figure 10: Visualisation of attention weights of the baseline memory, and the fixed weight and Hebbian trace of the proposed output controller for the EEG Based Schizophrenia Detection Evaluation in Sec. 4.3. Colours blue to yellow correspond to low-high connection strengths.

image

Figure 11: Visualisation of attention weights of the baseline memory, and the fixed weight and Hebbian trace of the proposed output controller for the MRI Tumour Type Classification Evaluation in Sec. 4.2. Colours blue to yellow correspond to low-high connection strengths.

image

Figure 12: Visualisation of the Hebbian trace, ˆHebb, evolving over time.

First, we extract activations from the first LSTM layer of the proposed model (See Fig. 2) for four randomly selected examples which are presented in Fig. 13. The model provides more attention to the areas of the input EEG which corresponds to sudden fluctuations in the waveforms, identifying important events such as peaks and valleys generated in the mismatch trials.

image

Figure 13: Activations from the first LSTM layer of the proposed model and its corresponding inputs.

In order to determine the exact channels that correspond to decision making we adapt the statistical interaction analysis framework proposed in [15]. This method ranks the neural network weights of the input layer on its statistical interactions that are performed with its first hidden layer. In particular the utilised pairwise interaction ranking scheme ranks all pairs of input features according to their interaction strengths. Fig. 14 illustrates the output. As our input contains 5 input channels (i.e Fz, FCz, Cz, CPz and Pz) the figure illustrates the possible pairwise connections between these 5 inputs. We observe higher interaction strengths between channels Fz - FCz and FCz - Cz which supports the findings reported in [48] which also indicates the higher degree of activities in brain frontal lobe in the auditory mismatch paradigm.

image

Figure 14: Visualisation of feature importance of EEG channels in terms of the pariwise interactions based analysis proposed in [15]. Colours blue to yellow corresponds to low-high importance.

5.3. Hardware and Time Complexity Details

The implementation of the proposed model is completed using Keras [55] with a Theano [56] backend. The proposed model does not require any special hardware such as GPUs to run and the model used for the abnormal EEG identification task presented in Sec. 4.1 has 170K trainable parameters. We measured the time complexity of the proposed method using the test set of THU EEG database [43] used in Sec. 4.1. The proposed model is capable of generating 1000 predictions ( i.e using 1000 input sequences of 60 second in lengths and generates 1000 classifications) in 13.16 seconds on a single core of an Intel Xeon E5-2680 2.50 GHz CPU.

In the same experimental setting, we measured the time required to generate 1000 predictions for different lengths of the memory module, l, and different embedding dimensions, k. Results are given in Fig. 15. With the memory length, l,

image

Figure 15: Evaluation of runtimes for different memory length, l (a), and embedding dimensions, k, values (b).

the runtime grows approximately linearly, while with the embedding dimension, k, it grows exponentially. This is because an increase in embedding dimension increases the dimensionalities of the fixed weight and plasticity components in the memory which incur significant additional computational overhead.

In this paper, we propose a plastic neural memory network architecture which exploits the the advances in Neural Memory Networks (NMNs) and neural plasticity. We introduce plasticity in memory access mechanism which allows the underlying framework to pay varying level of attention to different problem specific and subject specific information that it requires to retrieve from the stored knowledge within the memory. We point out and illustrate the drawbacks of current attention-based knowledge retrieval processes in NMNs and demonstrate how the neural plasticity can be used to overcome these deficien-cies. Through the evaluation conducted on three challenging anomaly detection tasks in the medical domain, we demonstrate that our proposed memory architecture is able to outperform all considered baselines. Through visualisation of the of the mechanisms of the proposed memory architecture, we provide evidence of the power of our our memory addressing process to capture salient information cues that are needed for anomaly detection. The varied nature of the evaluations, which includes both one and two-dimensional data, demonstrates how the proposed model can be directly applied to any anomaly detection or classification problem where modelling long term relationships is necessary. In future work, we will be exploring the applications of neural memory plasticity for encoding multimodal inputs and where the sparsity of the generated memory outputs can be utilised to summarise and represent denser input representations.

K.R.L was supported by an Australian Research Council Future Fellowship (FT170100294). Funding for collection of EEG data in the schizophrenia risk sample was provided by a National Institute for Health Research (UK) Career Development Fellowship (CDF/08/01/015) and BIAL Foundation Research Grants (36/06 and 194/12).

[1] T. Munkhdalai, H. Yu, Neural semantic encoders, in: Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 1, NIH Public Access, 2017, p. 397.

[2] A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, R. Socher, Ask me anything: Dynamic memory networks for natural language processing, in: International conference on machine learning, 2016, pp. 1378–1387.

[3] C. Xiong, S. Merity, R. Socher, Dynamic memory networks for visual and textual question answering, in: International conference on machine learning, 2016, pp. 2397–2406.

[4] T. Fernando, S. Denman, S. Sridharan, C. Fookes, Task specific visual saliency prediction with memory augmented conditional generative adver-

sarial networks, in: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2018, pp. 1539–1548.

[5] T. Yang, A. B. Chan, Learning dynamic memory networks for object track- ing, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 152–167.

[6] T. Fernando, S. Denman, A. McFadyen, S. Sridharan, C. Fookes, Tree memory networks for modelling long-term temporal dependencies, Neurocomputing 304 (2018) 64–81.

[7] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, et al., Gradient flow in recurrent nets: the difficulty of learning long-term dependencies (2001).

[8] Y. Perwej, F. Parwej, A neuroplasticity (brain plasticity) approach to use in artificial neural network, International Journal of Scientific & Engineering Research 3 (6) (2012) 1–9.

[9] T. Miconi, J. Clune, K. O. Stanley, Differentiable plasticity: training plas- tic neural networks with backpropagation, International Conference on Machine Learning (ICML).

[10] Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, Enhancing and combining sequential and tree lstm for natural language inference, Annual meeting of the association for computational linguistics (ACL),.

[11] M. Mahmud, M. S. Kaiser, A. Hussain, S. Vassanelli, Applications of deep learning and reinforcement learning to biological data, IEEE transactions on neural networks and learning systems 29 (6) (2018) 2063–2079.

[12] ¨O. Yıldırım, U. B. Baloglu, U. R. Acharya, A deep convolutional neural network model for automated identification of abnormal eeg signals, Neural Computing and Applications (2018) 1–12.

[13] P. Afshar, A. Mohammadi, K. N. Plataniotis, Brain tumor type classifica- tion via capsule networks, in: 2018 25th IEEE International Conference on Image Processing (ICIP), IEEE, 2018, pp. 3129–3133.

[14] G. Pachauri, S. Sharma, Anomaly detection in medical wireless sensor net- works using machine learning algorithms, Procedia Computer Science 70 (2015) 325–333.

[15] M. Tsang, D. Cheng, Y. Liu, Detecting statistical interactions from neural network weights, International Conference on Learning Representations.

[16] Z. Ghafoori, S. M. Erfani, S. Rajasegarar, J. C. Bezdek, S. Karunasekera, C. Leckie, Efficient unsupervised parameter estimation for one-class support vector machines, IEEE transactions on neural networks and learning systems 29 (10) (2018) 5057–5070.

[17] M. Zhang, C. Chen, T. Wo, T. Xie, M. Z. A. Bhuiyan, X. Lin, Safedrive: online driving anomaly detection from large-scale vehicle data, IEEE Transactions on Industrial Informatics 13 (4) (2017) 2087–2096.

[18] Y. Cao, Y. Li, S. Coleman, A. Belatreche, T. M. McGinnity, Adaptive hidden markov model with anomaly states for price manipulation detection, IEEE transactions on neural networks and learning systems 26 (2) (2014) 318–330.

[19] A. Dal Pozzolo, G. Boracchi, O. Caelen, C. Alippi, G. Bontempi, Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems 29 (8) (2017) 3784–3797.

[20] T. Fernando, S. Denman, S. Sridharan, C. Fookes, Pedestrian trajec- tory prediction with structured memory hierarchies, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2018, pp. 241–256.

[21] T. Fernando, S. Denman, S. Sridharan, C. Fookes, Learning temporal strategic relationships using generative adversarial imitation learning, in: Proceedings of the 17th International Conference on Autonomous Agents

and MultiAgent Systems, International Foundation for Autonomous Agents and Multiagent Systems, 2018, pp. 113–121.

[22] A. Soltoggio, K. O. Stanley, S. Risi, Born to learn: the inspiration, progress, and future of evolved plastic artificial neural networks, Neural Networks.

[23] D. O. Hebb, The organization of behavior: A neuropsychological theory, Psychology Press, 2005.

[24] S. Nolfi, D. Parisi, Auto-teaching: networks that develop their own teaching input, in: Free university of brussels, Citeseer, 1993.

[25] E. T. Rolls, S. M. Stringer, On the design of neural networks in the brain by genetic evolution, Progress in Neurobiology 61 (6) (2000) 557–579.

[26] M. Maniadakis, P. Trahanias, Modelling brain emergent behaviours through coevolution of neural agents, Neural Networks 19 (5) (2006) 705– 720.

[27] E. Harris, M. Niranjan, J. Hare, A biologically inspired visual working memory for deep networks, arXiv preprint arXiv:1901.03665.

[28] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, S. Stolfo, A geometric frame- work for unsupervised anomaly detection, in: Applications of data mining in computer security, Springer, 2002, pp. 77–101.

[29] E. B. Ermis, V. Saligrama, P.-M. Jodoin, J. Konrad, Motion segmentation and abnormal behavior detection via behavior clustering, in: 2008 15th IEEE International Conference on Image Processing, IEEE, 2008, pp. 769– 772.

[30] S. Wu, B. E. Moore, M. Shah, Chaotic invariants of lagrangian particle trajectories for anomaly detection in crowded scenes, in: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, 2010, pp. 2054–2060.

[31] X.-X. Zhang, H. Liu, Y. Gao, D. H. Hu, Detecting abnormal events via hierarchical dirichlet processes, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, 2009, pp. 278–289.

[32] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, L. S. Davis, Learn- ing temporal regularity in video sequences, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 733–742.

[33] J. Masci, U. Meier, D. Cire¸san, J. Schmidhuber, Stacked convolutional auto-encoders for hierarchical feature extraction, in: International Conference on Artificial Neural Networks, Springer, 2011, pp. 52–59.

[34] Y. S. Chong, Y. H. Tay, Abnormal event detection in videos using spa- tiotemporal autoencoder, in: International Symposium on Neural Networks, Springer, 2017, pp. 189–196.

[35] B. Kiran, D. Thomas, R. Parakkal, An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos, Journal of Imaging 4 (2) (2018) 36.

[36] N. Abiwinanda, M. Hanif, S. T. Hesaputra, A. Handayani, T. R. Mengko, Brain tumor classification using convolutional neural network, in: World Congress on Medical Physics and Biomedical Engineering 2018, Springer, 2019, pp. 183–189.

[37] S. Roy, I. Kiral-Kornek, S. Harrer, Deep learning enabled automatic ab- normal eeg identification, in: 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), IEEE, 2018, pp. 2756–2759.

[38] A. Gumaei, M. M. Hassan, M. R. Hassan, A. Alelaiwi, G. Fortino, A hybrid feature extraction method with regularized extreme learning machine for brain tumor classification, IEEE Access.

[39] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, A. v. d. Hengel, Memorizing normality to detect anomaly: Memory-augmented

deep autoencoder for unsupervised anomaly detection, arXiv preprint arXiv:1904.02639.

[40] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473.

[41] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recog- nition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

[42] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large- scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, pp. 248–255.

[43] S. Lopez, G. Suarez, D. Jungreis, I. Obeid, J. Picone, Automated identifi- cation of abnormal adult eegs, in: 2015 IEEE Signal Processing in Medicine and Biology Symposium (SPMB), IEEE, 2015, pp. 1–5.

[44] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.

[45] L. S, Automated identification of abnormal eegs, MS thesis, Temple Uni- versity.

[46] J. Cheng, W. Huang, S. Cao, R. Yang, W. Yang, Z. Yun, Z. Wang, Q. Feng, Enhanced performance of brain tumor classification via tumor region augmentation and partition, PloS one 10 (10) (2015) e0140381.

[47] M. Sajjad, S. Khan, K. Muhammad, W. Wu, A. Ullah, S. W. Baik, Multi- grade brain tumor classification using deep cnn with extensive data augmentation, Journal of computational science 30 (2019) 174–182.

[48] J. M. Bruggemann, H. V. Stockill, R. K. Lenroot, K. R. Laurens, Mismatch negativity (mmn) and sensory auditory processing in children aged 9–12 years presenting with putative antecedents of schizophrenia, International journal of psychophysiology 89 (3) (2013) 374–380.

[49] B. Moghaddam, D. Javitt, From revolution to evolution: the glutamate hypothesis of schizophrenia and its implication for treatment, Neuropsychopharmacology 37 (1) (2012) 4.

[50] K. S. Shin, J. S. Kim, D.-H. Kang, Y. Koh, J.-S. Choi, B. F. O’Donnell, C. K. Chung, J. S. Kwon, Pre-attentive auditory processing in ultra-high-risk for schizophrenia with magnetoencephalography, Biological psychiatry 65 (12) (2009) 1071–1078.

[51] K. S. Shin, J. S. Kim, S. N. Kim, Y. Koh, J. H. Jang, S. K. An, B. F. Odon- nell, C. K. Chung, J. S. Kwon, Aberrant auditory processing in schizophrenia and in subjects at ultra-high-risk for psychosis, Schizophrenia bulletin 38 (6) (2011) 1258–1267.

[52] M. Bodatsch, S. Ruhrmann, M. Wagner, R. M¨uller, F. Schultze-Lutter, I. Frommann, J. Brinkmeyer, W. Gaebel, W. Maier, J. Klosterk¨otter, et al., Prediction of psychosis by mismatch negativity, Biological psychiatry 69 (10) (2011) 959–966.

[53] K. R. Laurens, S. Hodgins, G. L. Mould, S. A. West, P. L. Schoenberg, R. M. Murray, E. A. Taylor, Error-related processing dysfunction in children aged 9 to 12 years presenting putative antecedents of schizophrenia, Biological Psychiatry 67 (3) (2010) 238–245.

[54] I. Jolliffe, Principal component analysis, Springer, 2011.

[55] F. Chollet, et al., Keras (2015).

[56] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Des- jardins, J. Turian, D. Warde-Farley, Y. Bengio, Theano: A cpu and gpu math compiler in python, in: Proc. 9th Python in Science Conf, Vol. 1, 2010, pp. 3–10.

Designed for Accessibility and to further Open Science

Thank you Tharindu Fernando, Simon Denman, David Ahmedt-Aristizabal, Sridha Sridharan, Kristin Laurens, Patrick Johnston, Clinton Fookes, who authored Neural Memory Plasticity for Anomaly Detection 🙏 This page is the html of their arXiv pdf, with no changes made other than format. Please cite their work