Combating Corrupt Messages in Sparse Clustered Associative Memories

2014·Arxiv

Abstract

Abstract

In this paper we analyze and extend the neural network based associative memory proposed by Gripon and Berrou. This associative memory resembles the celebrated Willshaw model with an added partite cluster structure. In the literature, two retrieving schemes have been proposed for the network dynamics, namely SUM-OF-SUM and SUM-OF-MAX. They both offer considerably better performance than Willshaw and Hopfield networks, when comparable retrieval scenarios are considered. Former discussions and experiments concentrate on the erasure scenario, where a partial message is used as a probe to the network, in the hope of retrieving the full message. In this regard, SUM-OF-MAX outperforms SUM-OF-SUM in terms of retrieval rate by a large margin. However, we observe that when noise and errors are present and the network is queried by a corrupt probe, SUM-OF-MAX faces a severe limitation as its stringent activation rule prevents a neuron from reviving back into play once deactivated. In this manuscript, we categorize and analyze different error scenarios so that both the erasure and the corrupt scenarios can be treated consistently. We make an amendment to the network structure to improve the retrieval rate, at the cost of an extra scalar per neuron. Afterwards, five different approaches are proposed to deal with corrupt probes. As a result, we extend the network capability, and also increase the robustness of the retrieving procedure. We then experimentally compare all these proposals and discuss pros and cons of each approach under different types of errors. Simulation results show that if carefully designed, the network is able to preserve both a high retrieval rate and a low running time simultaneously, even when queried by a corrupt probe.

Index Terms—Associative Memory, Recurrent Neural Networks, Maximum Clique Problem, Branch and Bound Algorithm, Partite Graph

I. INTRODUCTION

A. Background

Associative memories are devices that map pairs of inputoutput patterns and retrieve information from its context directly. Thus they behave differently than traditional memory systems where explicit addresses are required to retrieve the content, There are two phases in using an associative memory: storing and retrieving. In the storing phase, the task is to store (learn) all messages of interest into the memory. In the retrieving phase, given a probe, a modified version of a particular message, one is expected to retrieve (decode) the originally stored version reliably and efficiently.

Associative memories have a variety of uses in different fields, e.g., communication networks [1], signal and image processing [2], database engines [3], anomaly detection systems [4], compression algorithms [5] and face recognition systems [6], to name a few. Some implementations provide query time independent of the number of stored messages.

Neural networks are among the most popular approaches to implement associative memories, e.g., linear associators [7], [8], Willshaw networks [9], [10] are early examples of such attempts. In early 1980’s, the seminal work of Hopfield [11], [12] on associative memories brought back research interest for the neural network community. For the history and developments of associative memories, see Palm’s recent survey [13] and the references therein.

Quite recently, Gripon and Berrou propose a new family of sparse neural networks for associative memories [14], [15] which we refer to as the Clustered Sparse Associative Memory (CSAM). In short, an CSAM is a modification of Willshaw networks with partite cluster structures. It resembles the model proposed by Moopenn et al. [16] with the original retrieving scheme, SUM-OF-SUM [15]. But it also allows for neuron self excitations, as well as a new retrieving scheme SUM-OF- MAX [17] which improves the retrieval rate by a large margin. A brief description of CSAMs and these two basic retrieving schemes will be given in Section II.

B. Related Work

Gripon and Berrou propose the network structure in [14]. In [15], they show that using the same amount of storage, CSAMs outperform Hopfield networks in diversity (the number of patterns a network can store for a targeted performance), capacity (the maximum amount of stored information in bits for a targeted performance) and efficiency (the ratio between capacity and the amount of information in bits consumed by the network when capacity reaches its maximum) simultaneously. They later interpret CSAMs using the formalism of error correcting codes [17] and propose a new decoding scheme called SUM-OF-MAX, which significantly decreases retrieval error. Jiang et al. [18] modify CSAMs to store long sequences by incorporating directed links. Aboudib et al. [19] extend the structure so that messages of different lengths can be stored in the same network. They also summarize criteria to build possible retrieving schemes and study the number of iterations required by each scheme. Yao et al. [20] discover a previously overlooked problem that the network may converge to a bogus fixed point and propose heuristics to mitigate the issue. A novel post-processing algorithms is also developed, customized to the partite structure of CSAMs, which brings notably better retrieval rates than the standard SUM-OF-MAX scheme.

Aside from the architectural and algorithmic aspects of the network mentioned above, efficient implementations and applications are being developed as well. Jarollahi et al. [21] use the field programmable gateway array (FPGA) to implement SUM-OF-SUM on a small sized network. Later in [22], they implement SUM-OF-MAX which runs faster, thanks to bitwise operations replacing the resource demanding summation and comparison units required by SUM-OF-SUM. The same group of authors also develop a content addressable memory in [23] saving 90% of the energy consumption. Larras et al. [24] develop an analog version of the network, which consumes less energy. Meanwhile, it is more efficient in terms of both speed and circuit surface, comparing with an equivalent digital circuit. After analyzing the convergence and computational properties of both SUM- OF-SUM and SUM-OF-MAX, Yao et al. [25] propose a hybrid scheme and successfully implement the network on a GPU. An acceleration of is witnessed without any loss of accuracy.

C. Contributions

Former discussions and experiments with CSAMs concentrate on the erasure scenario, where a partial message is used as a probe to the network, in the hope of retrieving the full message. In this regard, SUM-OF-MAX outperforms SUM-OF- SUM in terms of retrieval rate by a large margin [25]. However, we argue that real world data applications may include more challenging scenarios, where inputs contain errors. SUM-OF- MAX faces a severe limitation in such scenarios as its stringent activation rule prevents a neuron from being reactivated. The contributions of this manuscript are: 1) We categorize and analyze different errors when using an CSAM so that the erasure scenario and the corruption scenario can be treated consistently.

2) We make an amendment to the existing network structure to improve the retrieval rate further at the cost of an extra scalar per neuron.

3) We propose five different approaches to deal with errors, thus extend the network capability and also increase the robustness of retrieving procedure. The numerical experiments in Section VI compare all proposals and modifications under different types of errors to see the pros and cons of each approach. Both simulated data and the famous USPS dataset are tested. Experimental results show that if carefully designed, the network is able to preserve both a high retrieval rate and a low running time simultaneously, even when queried by a corrupt probe.

D. Paper Organization

The rest of the paper is organized as follows. Section II reviews the structure of CSAMs and two existing retrieving algorithms, SUM-OF-SUM and SUM-OF-MAX. We explain the reason in brief why SUM-OF-MAX should be favored over SUM-OF-SUM. Section III categorizes and analyzes different errors. We discuss three basic types of errors and how SUM- OF-SUM and SUM-OF-MAX behave against them. Section IV

Fig. 1. A toy network with 5 clusters of 4 neurons each. We number clusters and neurons clockwise. Two messages are stored.

describes the structural amendment we make to CSAMs so that retrieving bias can be mitigated and better retrieval rates can be achieved, at the cost of an extra scalar for each neuron. In Section V, five different approaches are proposed along with their pseudo code implementations. Section VI compares numerically all proposals under different errors separately in details using both simulated and real world data. The manuscript concludes in Section VII.

II. SPARSE CLUSTERED ASSOCIATIVE MEMORIES

A. Structure

The structure of an CSAM [14] is closely coupled with stored patterns. Consider a C-symbol tuple, , with each symbol taking L possible values, i.e., . We call such a tuple a message. In this setup, a network of n = CL neurons is used, with C clusters corresponding to different symbols, each having L neurons representing . If , the neuron in the cluster, neuron(c, l), activates. Since a symbol can only take one value at a time, for a given message, in each cluster, only one single neuron activates accordingly. The locations of the active neurons express a particular message.

An CSAM is a binary valued neural network in the sense that the weights on connections between a pair of neurons can either exist (1) or not (0). Initially, no edge exists. When storing a message, the corresponding active neurons and the connections in between are added into the network, forming a clique (complete sub-graph). Fig. 1 depicts a toy CSAM with C = 5 clusters and L = 4 neurons each. We label clusters and neurons clockwise. Two messages are stored in this particular network instance, and the black clique expresses the message .

Since a stored message corresponds to a clique in the network, it is equivalent to retrieve a message or its corresponding clique. More precisely, when a probe is given (either a partial message , or a corrupt one ), the network dynamic aims at converging to a clique containing similar neurons. Previous studies [15], [17], [25], [20] focus on the erasure (partial message) scenario. However, corrupt probes are more often in real world applications. Think of digit images in USPS dataset, we are more likely to work with noisy images (black pixels becomes white or vice versa) than partial images. Even if the erasure scenario is considered, it is often tricky to recognize where the missing pixels are in advance. Unfortunately, this is a prerequisite for the existing retrieving algorithms SUM-OF-SUM [15] and SUM-OF-MAX [17] to initialize correctly. Therefore, both need to be revised to extend their use.

B. Retrieving Algorithms

Active neurons are considered as energy sources emitting signals along the edges. Two basic iterative retrieving algorithms exist for CSAMs. The set of active neurons at the end of the iterative retrieval process corresponds to the network response to the probe.

1) SUM-OF-SUM: The default rule SUM-OF-SUM [14], [15] for CSAMs is also applied in the model of Moopenn et al. [16]. Initially, neurons corresponding to the remaining probes are active, transmitting signals, whereas all neurons in the missing clusters deactivate. After each iteration, neurons receive variable numbers of signals. Only the neurons with the most signals in each cluster remain active in the next iteration.

Let denote the indicator function of whether neuron(c, l) connects to , i.e.,

Let denote the indicator function of the potential for neuron(c, l) in iteration t, i.e.,

We denote by the score, i.e., the count of the number of signals neuron(c, l) receives at iteration t. The SUM-OF-SUM retrieving algorithm are given by

Essentially, Eq. (3) counts the score for each neuron. It involves summing over all clusters and all neurons within each cluster, hence the name SUM-OF-SUM. Eq. (4) finds the neurons with the most signal in each cluster, and Eq. (5) keeps them active.

2) SUM-OF-MAX: In contrast, SUM-OF-MAX [17] activates a neuron if and only if it receives signals from every other clusters plus the self excitation. Multiple signal contributions from the same cluster do not sum up. However, in order for SUM-OF-MAX to succeed, it is prerequisite to activate all the neurons in the missing clusters initially, exactly opposite to SUM-OF-SUM.

SUM-OF-MAX modifies the procedure to be

The formulation above is essentially the same as in [15], [17], with the reinforcement factor as suggested in [25].

3) Advantages of SUM-OF-MAX: Following the terminologies in [19], Eq. (3) and Eq. (6) are dynamic rules computing neuron potential scores, whereas Eq. (4), Eq. (5) and Eq. (7) are activation rules determining neuron activities. In other words, SUM-OF-SUM counts individual signals as the dynamic rule and selects the local winners with the most signals as the activation rule; SUM-OF-MAX counts clusterwise signals as the dynamic rule and eliminates losers short of signals as the activation rule.

This iterative process continues until the network converges if it ever does. As a matter of fact, we construct a counter example in [25] showing that SUM-OF-SUM might oscillate even in very basic cases. In addition, we manage to prove the convergence of SUM-OF-MAX. This is one of the reasons we prefer SUM-OF-MAX over SUM-OF-SUM. The second reason is that SUM-OF-SUM counts individual signals possibly propagating decoding errors from iteration to iteration, whereas SUM- OF-MAX counts clusterwise contributions which stops signals from the same cluster summing up. This explains why SUM- OF-MAX outperforms SUM-OF-SUM in terms of retrieval rate by a large margin, especially in challenging scenarios, e.g., either the number of stored messages or the number of erased symbols increases. For detailed performance comparisons and different initialization schemes, see [17], [25]. The last but not least reason is that SUM-OF-MAX sometimes leads to simpler hardware. For instance, the authors in [22] implement SUM-OF-MAX using AND and OR gates only, eliminating the requirement for inefficient comparison and addition units as in SUM-OF-SUM.

III. SOURCE OF ERRORS

Let us analyze the source of errors which might influence the network. We assume no error in the storing phase. Associative memories can be viewed as a decoder in a communication system. Due to the imperfect nature of communication channels, the receiver tries to recover a noisy signal from a prede-fined codebook (stored messages). Therefore, a straightforward type of errors is the channel error, which in our context is a corrupt symbol in the probe.

Interestingly, another less obvious type of errors also exists, as we look closely into the way how CSAMs work. Keep in mind that CSAMs do not manipulate symbols directly. A message has to be first transformed to a sparse binary representation, upon which the network operates. This encoding step itself and the following retrieving iterations might introduce errors as well. We call this type the retrieving error.

Retrieving error may occur in three basic ways and the combination of them:

• Insertion. Multiple neurons activate in a cluster.

• Omission. The desired neuron deactivates.

• Shift. A spurious neuron activates in place of the expected one. We can also think of a shift being an omission followed by an insertion at a wrong position. Although similar, an omission error is a more general but challenging situation than a partial probe in the erasure scenario. The latter assumes both the number and the positions of the missing symbols are known in advance to the retrieving procedure. This is not the case for omission errors. It is the responsibility of the retrieving algorithm to make reasonable decisions on the fly. Note that a shift error is indistinguishable from a channel error assuming the binary representations are flawless. Therefore, if a retrieving algorithm covers all three aspects of retrieving errors, it should perform well against channel errors as well, and in both erasure and corruption scenarios.

SUM-OF-SUM and SUM-OF-MAX behave differently towards the three aspects. The differences root in the dynamic rules, i.e., counting the individual or clusterwise signals. SUM-OF- SUM erroneously accumulates neuron scores by the insertion errors, to which SUM-OF-MAX is innately immune. This is exactly the reason that SUM-OF-MAX outperforms SUM-OF-SUM by a large margin for the erasure scenario, as verified by the experiments in [25]. To confirm this claim, think of a loaded network after the storing phase. Given a partial probe, during the initialization phase, the neurons in the missing clusters unconditionally deactivate, whereas the neurons corresponding to the remaining symbols activate. After one iteration, all the cliques containing the remaining neurons are active, among which the desired message is included. The missing clusters possibly contain multiple active neurons in this case, which is as if insertion errors happen.

SUM-OF-MAX is a monotonically decreasing procedure, in the sense that an inactive neuron can never revive back into play. It performs poorly against omission and shift errors, because as soon as the remaining active neurons fail to form a clique during the retrieving process, all neurons throughout the network will deactivate immediately due to Eq. (7). This phenomenon is proved in [25] and witnessed in [20] when applying heuristics to bypass the bogus fixed point. To visualize such a scenario, let us revisit the black clique in Fig. 1, but with a corrupt probe presented. It is easy to check that neuron(1, 2) does not have any edge connecting to other neurons, thus cannot transmit signals. Consequently, neuron(2, 4), neuron(3, 1), neuron(4, 1) and neuron(5, 2) do not receive any signal from cluster 1, and deactivate altogether. This is very disappointing because a small perturbation can easily collapse the whole retrieval.

IV. AMENDMENT TO STRUCTURE

Before making adaptations to the existing algorithms to cope with corrupt probes right away, we detour to make an amendment to the network structure. Note that a probe might associate with multiple messages (cliques), then a natural question is which clique should be favored above the others during the retrieving process. Previous work regards the number of a neuron’s incident edges as a popularity measure of its counterpart symbol. Throughout the stored messages, if symbol often takes the value, i.e., , then neuron(c, l) tends to have more edges connected to it. In [20], the heuristics proposed to bypass the bogus fixed point depend implicitly on this measure. When developing the clique finding algorithm later in the same work, we explicitly exploit this concept to determine the expanding order in the branch-and-bound recursive search, which brings us not only acceleration but also better retrieval rate. However, this measure is biased in itself. Since the binary weights on the edges indicate either existence (1) or not (0), they leave us no clue whether an edge is shared by multiple messages. This is a common situation, especially in a network with heavy loads. The authors in [19] illustrate an example showing decoding failures in the case of overlapping edges.

In this work, we propose to fix this bias. In addition to be aware of its incident edges, neuron(c, l) also keeps track of a frequency scalar during the storing phase, counting the number of times when , i.e.,

where 1 is the indicator function.

The simple construct introduces little overhead, because the number of additional variables grows linearly with the number of neurons in the network, not quadratically like the number of potential edges does. To be precise, in a network of C clusters with L neurons, the symmetric adjacency matrix is sufficient to describe the whole associate memory. Therefore,bits are consumed by the network itself. The overhead introduced by the frequency is at most where m is the number of stored messages. It has been proved in [15] that m can not be proportional to , so the overhead is neglectable. Nevertheless, the retrieving process can benefit from this construct considerably. One plausible way to use the frequency is to build preference or confidence if a probe retrieves multiple messages. We can also construct normalized measures, e.g., average frequency per neuron in a retrieved clique, to extend CSAMs to store messages of variable lengths [19].

V. RETRIEVING CORRUPT MESSAGES

For description convenience, we should be acquainted with the aggregated representation s, v and W. Here s and v are vectors of length n = CL, respectively the stacked vectors of and by the cluster index, and W is the adjacency matrix describing all with the diagonal elements being 1. Colons (:) are seen occasionally in subscripts, indicating a particular index looping over all possible values. For instance, stands for the scores of all neurons in cluster c, with the neuron index ranging from 1 to L. We will use all the notations whichever appears to be the most convenient to us. For the completeness of our discussion, we also illustrate two building modules: Joint [25] and FindClique [20] with small modifications in Appendix. In brief, Joint is to apply

one iteration of SUM-OF-SUM followed by SUM-OF-MAX, an optimized way to retrieve the fixed point state after the network converges; FindClique is to find cliques in the fixed point state corresponding to stored messages. As mentioned in Section IV, FindClique in the original work [20] determines the expanding order according to the number of a neuron’s incident edges. Since Section IV fixes this bias, we change the measure to the frequency vector f, the stacked version of all .

A. Direct Approach

Applying standard SUM-OF-SUM to the corruption scenario is able to produce meaningful retrievals only when the network is lightly loaded. In this direction, [19] discusses several more complicated variants, but they usually require user defined parameters which are not so obvious to determine1.

At this point, we argue that the performance of the direct approach degenerates drastically when the network becomes saturated. Another huge problem with this approach is that it is not guaranteed to converge after a finite number of iterations thus some iteration cap is required to terminate the algorithm.

B. Direct Plus Approach

The direct approach of applying SUM-OF-SUM is not a satisfactory solution to the corruption scenario. We see in Section III that counting individual signals makes SUM-OF- SUM sensitive to insertions. Although SUM-OF-MAX is not affected by insertion errors because its dynamic rule counts clusterwise signals, naively applying it does not work well against omission and shift errors due to its stringent activation rule, also mentioned in Section III. Therefore, a possible scheme is to hybrid clusterwise signals from SUM-OF-MAX with the activation rule of SUM-OF-SUM (winner-takes-all) in Algorithm 1. We call it the direct plus approach. Two features are different from the direct approach:

• Clusterwise signals replace individual signals.

• A clique finding procedure is applied at the end of the algorithm. Although not completely, these two modifications do mitigate the convergence problem and change the characteristics of the direct approach, making the direct plus approach a strong competitor in the comparison; see Section VI.

C. Construct Approach

As mentioned before, SUM-OF-MAX is a monotonically decreasing procedure. It finds a superset of neurons that includes the desired message and eliminates improbable neurons gradually. We propose to go along the opposite direction in Algorithm 2, adding probable neurons into a subset of neurons gradually and constructing the cliques from scratch. Instead of having the clique finding procedure at the end as a post-processing step in [20], FindClique is now served as an integrated part of the algorithm. The basic idea is to think of

all active neurons comprising an active set. Every iteration, we add some neurons into this active set, and try to find a clique within it. The retrieving process terminates once a clique is encountered. Note that FindClique is modified slightly from [20] in that only the neurons with their clusterwise signals being exactly C can go into the recursive procedure. This condition is not required there in [20] as it is ensured by SUM- OF-MAX implicitly.

Line 2–12 compute the clusterwise signal for each neuron. Line 13–19 update the effective adjacency matrix, since not all the edge weights in the original adjacency matrix W are relevant to the active set thus to the clique finding procedure. Line 23–24 look through all the inactive neurons and add one with the most signals into the active set.

Unfortunately, the plain construct approach in Algorithm 2 runs discouragingly slow due to nested levels of loops. Three tricks can help to reduce the running time significantly. The first trick lies when we compute the clusterwise signals at line

Algorithm 2: The construct approach of retrieving corrupt messages.

Input: The adjacency matrix W, the input probe v Output: The retrieved clique Q

1 repeat

2 for to n do 3 4 foreach c do

5 foreach l do

6 if then

7 8 break

9 end

11 end

12 end

13 for to n do 14 for to n do 15 if then

16 17 end

18 end

19 end

// A variant of the clique finding procedure in [20], see Appendix.

20 obtain a smaller graph that only contains neurons with their clusterwise signals being exactly C 21 split into C sets with each accommodating the neurons in a different cluster

22 FindClique(R)

// Set an inactive neuron with the most clusterwise signals active.

23 arg max24 = 1

25 until

2–12. Since the score of each neuron increases monotonically with the number of active neurons, we do not need to update the score function for every neuron. Instead, we only update the neurons whose clusterwise signal is less than the number of clusters C.

Notice that updating the effective adjacency matrix at line 13–19 inside the big loop is a time consuming operation even in a network of a reasonable size, since the adjacency matrix is quadratic in the number of neurons. In fact, it is not necessary to update the whole adjacency matrix. The active set enlarges incrementally, which means all neurons in the current iteration still exist in the next. Therefore, only the elements associated with the newly added neuron need to be updated, i.e., if neuron(c, l) is added, only the row (and column) of the adjacency matrix needs to be updated. This modification brings the time complexity down from to O(n).

The third trick is in line 23–24, rather than adding one neuron with the most clusterwise signals into the active set, we subjoin multiple neurons at a time in the hope of finding a clique in earlier iterations and quitting the algorithm as soon as possible. There are plenty of subjoining schemes that could be applied, which balance between retrieval rate and running time. For instance, we could add all neurons with the most clusterwise signals into the active set at one shot, or we could look into different clusters and add the neurons with the most signals in each cluster. The more neurons are added each time, the sooner the algorithm can find a clique, and the shorter the running time will be. However, adding too many neurons potentially increase the number of cliques that can be found in the network; the probability of retrieving the correct one decreases. Due to good empirical performance, we propose to check through all inactive neurons, subjoining it if its score is not lower than all the previous neurons.

D. Delegate Approach

The partial probe problem in erasure scenarios has already been solved elegantly by the joint scheme [25] followed by the clique finding procedure [20]. Therefore, if somehow we can transform the corrupt probe problem into the partial probe problem, Joint and FindClique can be exploited untouched. We propose such a delegate approach illustrated in Algorithm 3.

In Algorithm 3, line 2, 4, 6 are equivalent to Eq. (3), Eq. (4) and Eq. (5) respectively. Therefore, Algorithm 3 essentially runs one iteration of SUM-OF-SUM at the very beginning. Line 8–11 check if the result coincides with the probe. If the result is identical with the probe for some symbols, the algorithm builds up confidence about the correctness of these symbols. The different clusters are then treated as missing symbols in the erasure scenario, relying on Joint and FindClique to rescue. Note that checking individual signals at line 2 can also be replace by clusterwise signals instead, which is believed resilient to insertion errors.

E. Cut-and-paste Approach

The delegate approach smartly converts a more difficult problem into a known one and solved by existing techniques.

Algorithm 4: The cut-and-paste approach of retrieving corrupt messages.

Input: The adjacency matrix W, the input probe v Output: The retrieved clique Q

// Search for all cliques of different sizes in the probe by incrementally enlarging the targeted cluster.

1 // A set containing the active neurons of the probe. 2 // A set of sets containing all cliques that can be found in the probe. 3 for to n do 4 if then

8 end

18 if connected then

22 end

23 sort A according to clique sizes and the sum of the neuron frequencies if two cliques are of the same size

24 repeat

25 // The positions of the missing clusters. 26 27 28 29 foreach do 30 31

33 // Can be replaced by clusterwise signals. 34 foreach do 35 foreach l do

36 if else 0 37 end

38 end // A variant of the clique finding procedure in [20], see Appendix. // The input graph G(V, E) is determined by W and v, possibly a bogus fixed point.

39 split V into C sets with each accommodating the neurons in a different cluster 40 FindClique(R) 41 until

However, it comes with its own problem. After the first iteration, clusters different from the probe are assumed to be missing symbols. This operation does not ensure the remaining “confident” symbols can actually form a clique. It is possible that no clique is found at the end of the algorithm, thus the retrieved “message” might never be stored in the network.

The cut-and-paste approach described in Algorithm 4, although resembles the delegate approach, tries to solve this particular flaw. Initially, we cut out all cliques (not necessarily of size C) that can be found in the probe, and sort them in descending order according to the clique size and node frequency if multiple cliques are of the same size (line 1–23). Then we paste them one at a time back into the network, assuming the rest clusters are missing, and resort to Joint and FindClique (line 24–41). The purpose of the cut stage is to make sure the remaining confident symbols are always a partial stored message, thus fixing the problem of the delegate approach.

It is necessary to explicitly differentiate the cliques found in the cut stage and the cliques reported as retrieved messages. The cliques found in the cut stage, different in size, populate a candidate set. Each member (clique) is assumed to be the confident symbols in the erasure scenario. Therefore, it is not difficult to understand the rationale that the clique size and node frequency should be used to sort the candidates. A larger clique encourages the retrieved message to be similar with the probe, since more symbols are assumed to be confident. A higher node frequency increases the probability of reporting the correct message if two cliques are of the same size. Keep in mind that finding all cliques in the cut stage is essential to the success of this approach, since a clique might contain erroneous neurons, enlarging the clique size solely by chance and might not correspond to any stored message.

To better understand the cut-and-paste approach, let us revisit again the toy network in Fig. 1. A corrupt probe is given as . We first find out all cliques residing in the probe. In this example, five cliques exist , and . We sort them in descending order according to clique size, so the largest clique will be favored. We then consider it as the partial probe in an erasure scenario, the original message can be successfully retrieved. Hypothetically, if the largest clique does not retrieve a stored message, then one of the smaller cliques will be tested. Among the four cliques of the same size, the frequency scalars will be checked to determine the priority.

One maybe hesitate to search for all cliques in a probe because it is NP-hard. However, this is not completely a disaster for cut-and-paste. Qualitatively speaking, because of the sparsity that each cluster ideally accommodate one active neuron, the running time grows mildly with the number of clusters C. Moreover, since CSAMs are C-partite graphs, the maximum value of the clusterwise signals is an upper bound of all possible clique sizes. In addition, the upper bound is tight if no insertion error occurs.

VI. EXPERIMENTS

All the simulations are performed on a 2.4GHz Intel Q6600 with 8GB memory.

A. Simulated Data

We compare five approaches proposed in Section III under different types of errors separately to better understand the pros and cons of each of them. We simulate on C = 8 clusters, L = 128 neurons each, with increasing number of stored messages. 2000 messages are used for testing. Note that the running time for SUM-OF-SUM is topped by a cap of 10 iterations. Otherwise, it may run arbitrarily slow due to its convergence problem.

2000 4000 6000 8000 10000 12000 14000 16000 18000 0

2000 4000 6000 8000 10000 12000 14000 16000 18000 0

Fig. 2. Comparisons of different approaches proposed in Section III under insertion errors. For (a) and (b), the first neuron of two clusters are activated. For (c) and (d), the first neuron of all clusters are activated. For (e) and (f), all neurons of two clusters are activated.

1) Insertion Errors: Insertion errors can happen in two distinct ways: (1) A large number of symbols suffer from light insertion errors (e.g., one insertion for each cluster). (2) One or two symbols suffer from heavy insertion errors (e.g., all neurons in a cluster are activated). We show the comparison in Fig. 2. For Fig. 2a and Fig. 2b, the first neuron of two clusters are activated unconditionally. For Fig. 2c and Fig. 2d, the first neuron of all clusters are activated. For Fig. 2e and Fig. 2f, all neurons of two clusters are activated.

We see from Fig. 2a and Fig. 2c that the direct plus, construct and cut-and-paste approach perform extremely well, showing great advantages over the direct and delegate approach. In terms of running time, the cut-and-paste is the biggest winner, which runs orders of magnitude faster. The interesting part is in the case where few symbols are experiencing heavy load insertion errors as shown in Fig. 2e and Fig. 2f. This is the only case where cut-and-paste loses the competition. The direct approach stops retrieving anything useful. The cut-and-paste approach falls behind the other three alternatives and it loses its advantage in quick running time completely.

2) Omission Errors: Omission errors are equivalent to the erasure scenario without knowing the locations of the missing symbols in advance, which the algorithms are required to determine on the fly. For Fig. 3a and Fig. 3b, two symbols suffer from omission errors. For Fig. 3c and Fig. 3d, four symbols are affected.

We see from Fig. 3 that the direct plus, delegate and cut-and-paste approach run almost identically, outperforming the direct and construct approach. As omission errors increase, the direct approach is the most sensitive one so that it performs the worst in Fig. 3c. In terms of running time, the cut-and-paste approach beats every other approaches by a large margin as always. The running time of the direct approach seems linearly growing with the number of stored messages.

3) Shift Errors: We implement the shift errors by altering each symbol at a specified probability independently. Shift errors may deteriorate in two ways as well: (1) The number of affected symbols is increasing. (2) The probability of altering symbols is increasing. Fig. 4 compares all approaches under these trends. For Fig. 4a and Fig. 4b, two symbols are subjected to shift errors with probability independently. For Fig. 4c and Fig. 4d, all symbols are subjected to shift errors with probability . For Fig. 4e and Fig. 4f, two symbols are subjected to shift errors for sure.

We see from Fig. 4 that the cut-and-paste approach again tops both in retrieval rate and running time. The direct and direct plus approach perform more or less the same, with the direct plus approach winning the running time. The delegate approach does not perform well under shift errors, even worse than the direct approach.

4) All Errors Combined: In this part, we compare the proposals when different errors are mixed together. All symbols suffer from shift errors where each symbol is altered with probability independently. The first three clusters are subjected to insertion errors with the first neuron in each cluster activated. The last cluster experiences omission errors with every neuron deactivated. Both the retrieval rate and

Fig. 6. The symbol retrieval rates and message retrieval rate comparison between the CSAM and the celebrated Willshaw network on the USPS dataset. Each USPS image is treated as two messages. We consider 8 successive pixels as a symbol. Therefore, C = 16, L = 256.

running time are plotted in Fig. 5.

We see from Fig. 5a that directly applying SUM-OF-SUM performs the worst comparing with any other approaches, meanwhile the running time increases with the number of stored messages in Fig. 5b. The construct and direct plus approach performs almost the same in terms of retrieval rate, which wins over the delegate approach. The cut-and-paste approach performs excitingly well, e.g., when 12000 messages are stored in the network, it can still correctly retrieve more than 20%, whereas the other approaches retrieve nothing. To our surprise, although cut-and-paste is built purely on time demanding clique finding procedures, the running time is exceptionally fast in Fig. 5b. We vary the probability of shift errors, the result is consistent with the comparison in Fig. 5.

We conclude that the cut-and-paste approach outperforms all the rest proposals in both retrieval rate and running time notably. It should be considered as the default retrieving algorithm for CSAMs.

B. USPS Dataset

All the simulated data above are uniformly distributed, which is not a realistic assumption for real world applications. In this sub-section, we test the effectiveness of CSAMs with the celebrated Willshaw network upon which many associative memory systems are constructed, using the famous USPS dataset. In this dataset, 10 digits from 0 to 9 are taken from US postal envelopes, and each of them has 1100 image samples of the size . To make a fair comparison, we do not allow any fancy pre-processing to extract useful features except for binarizing the images to black and white. We treat each image as 2 messages, and every 8 successive pixels as a symbol. Therefore, and . This is a challenging task since the dataset itself is highly skewed, and also because one message is only regarded as successfully retrieved only when all 16 symbols are recovered.

2000 4000 6000 8000 10000 12000 14000 16000 18000 0

Fig. 3. Comparisons of different approaches proposed in Section III under omission errors. For (a) and (b), two symbols suffer from omission errors. For (c) and (d), four symbols are affected.

Out of the 11000 images, we randomly sample 5000 as our stored messages (10000 messages in total), and then corrupt 1000 images as the probes into the network. In Fig. 6, we run the experiment 10 times to average out the retrieval rate as a function of the number of corrupt clusters (symbols). The standard deviation is also plotted for each data point.

Note that CSAMs outperform the Willshaw network considerably in terms of both the message retrieval rate and the symbol retrieval rate across different numbers of corrupt symbols. For instance, when 4 out of 16 symbols are corrupt, the Willshaw network can hardly retrieve a complete image, whereas the CSAM is able to answer 40% of the queries correctly. The small deviations of the plots indicate the performance of the CSAM is stable and robust. To better visualize the results, we also plot some example images in Fig. 7. The first row contains the original images. The second row is the corrupt probes, with 5 symbols (i.e., 40 pixels) being contaminated. The third row is for the retrievals of the Willshaw network. The last row is for the cut-and-paste approach of the CSAM. It is not difficult to tell that the retrievals from CSAMs are superior than those from the Willshaw network.

VII. SUMMARY

CSAMs are recently invented recurrent associative memo- ries with a partite cluster structure featuring iterative retrieving schemes. Two basic retrieving algorithms exist, namely SUM- OF-SUM and SUM-OF-MAX. Due to its close connection to a LDPC-like sparse binary coding, it should be resilient towards noise and errors in nature. However, in former work, most attention is paid to the erasure scenario where a partial message is given as the probe to the network. In this work, we extend CSAMs to cope with corruption scenarios where a corrupt probe is present.

We categorize and analyze the source of different errors when using an CSAM, so that the treatment to both erasure and corruption scenarios can be unified. Then we make an amendment to the network structure. Previously, we assume the incident degree as a measure of a neuron’s popularity. Unfortunately, this is problematic when the network becomes

2000 4000 6000 8000 10000 12000 14000 16000 18000 0

2000 4000 6000 8000 10000 12000 14000 16000 18000 0

Fig. 4. Comparisons of different approaches proposed in Section III under shift errors. For (a) and (b), two symbols are subjected to shift errors with probability independently. For (c) and (d), all symbols are subjected to shift errors with probability . For (e) and (e), two symbols are subjected to shift errors for sure.

Fig. 5. Comparison among all proposals when different errors are mixed together. The first three symbols are subjected to insertion errors. The last symbol experiences omission errors. The rest of the symbols have shift error with probability

Fig. 7. Example plots of digit images with CSAMs and Willshaw networks. The first row are original stored images. The second row are corrupt probes with 5 symbols (40 pixels) damaged. The third row is for the Willshaw network. The last row is for the cut-and-paste rule of the CSAM.

saturated. At the cost of an extra scalar per neuron, we are able to fix this bias and improve further the retrieval rate.

Five different retrieving schemes are proposed to deal with corrupt probes. We implement all algorithms and compare them under different types of errors separately to better understand the pros and cons of each scheme. Out of all approaches, cut-and-paste performs notably better than the others in both the retrieval rate and running time simultaneously, except for the rare case when all neurons in a few clusters activate, which makes the running time significantly worse. We suggest cut-and-paste be the default retrieving algorithm for CSAMs. The USPS dataset is also tested to demonstrate the effectiveness of CSAMs, compared with the Willshaw network, where an exciting improvement is witnessed.

ACKNOWLEDGEMENT

This work was funded, in part, by the Natural Sciences and Engineering Research Council of Canada (NSERC), the Fonds Qu´eb´ecois de la recherche sur la nature et les technologies (FQRNT) and the European Research Council project NEUCOD.

REFERENCES

[1] S. Kaxiras and G. Keramidas, “Ipstash: a set-associative memory ap- proach for efficient ip-lookup,” in INFOCOM 2005. Proc. 24th Annual Joint Conference of the IEEE Computer and Communications Societies, vol. 2, Miami, FL, USA, 2005, pp. 992–1001.

[2] M. E. Valle, “A class of sparsely connected autoassociative morpholog- ical memories for large color images,” IEEE Transactions on Neural Networks, vol. 20, no. 6, pp. 1045–1050, 2009.

[3] C. S. Lin, D. C. P. Smith, and J. M. Smith, “The design of a rotating associative memory for relational database applications,” ACM Transactions on Database Systems, vol. 1, no. 1, pp. 53–65, March 1976.

[4] L. Bu and J. Chandy, “FPGA based network intrusion detection us- ing content addressable memories,” in IEEE Symposium on FieldProgrammable Custom Computing Machines, Napa, CA, USA, 2004, pp. 316–317.

[5] K.-J. Lin and C.-W. Wu, “A low-power CAM design for LZ data compression,” IEEE Transactions on Computers, vol. 49, no. 10, pp. 1139–1145, October 2000.

[6] H. Zhang, B. Zhang, W. Huang, and Q. Tian, “Gabor wavelet associative memory for face recognition,” IEEE Transactions on Neural Networks, vol. 16, no. 1, pp. 275–278, 2005.

[7] J. A. Anderson and E. Rosenfeld, Neurocomputing: foundations of research, ser. Bradford Books. MIT Press, 1988, vol. 1.

[8] J. A. Anderson, A. Pellionisz, and E. Rosenfeld, Neurocomputing 2: Directions of Research, ser. Bradford Books. MIT Press, 1993, vol. 2.

[9] D. J. Willshaw, O. P. Buneman, and H. C. Longuet-Higgins, “Non- holographic associative memory.” Nature, vol. 222, pp. 960–962, 1969.

[10] D. Willshaw, “Models of distributed associative memory.” Ph.D. disser- tation, Edinburgh University, 1971.

[11] J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” Proceedings of the National Academy of Sciences, vol. 79, no. 8, pp. 2554–2558, 1982.

[12] ——, “Neurons with graded response have collective computational properties like those of two-state neurons,” Proceedings of the National Academy of Sciences, vol. 81, no. 10, pp. 3088–3092, 1984.

[13] G. Palm, “Neural associative memories and sparse coding,” Neural Networks, vol. 37, pp. 165–171, January 2013.

[14] V. Gripon and C. Berrou, “A simple and efficient way to store many messages using neural cliques,” in IEEE Symposium on Computational Intelligence, Cognitive Algorithms, Mind, and Brain (CCMB), Paris, France, 2011, pp. 1–5.

[15] ——, “Sparse neural networks with large learning diversity,” IEEE Transactions on Neural Networks, vol. 22, no. 7, pp. 1087–1096, 2011.

[16] A. Moopenn, J. Lambe, and A. Thakoor, “Electronic implementation of associative memory based on neural network models,” IEEE Transactions on Systems, Man and Cybernetics, vol. 17, no. 2, pp. 325–331, 1987.

[17] V. Gripon and C. Berrou, “Nearly-optimal associative memories based on distributed constant weight codes,” in Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 2012, pp. 269–273.

[18] X. Jiang, V. Gripon, and C. Berrou, “Learning long sequences in binary neural networks,” in International Conference on Advanced Cognitive Technologies and Applications, Nice, France, 2012, pp. 165–170.

[19] A. Aboudib, V. Gripon, and X. Jiang, “A study of retrieval algorithms of sparse messages in networks of neural cliques,” Neurocomputing Letters, submitted for publication, http://arxiv.org/abs/1308.4506v1.

[20] Z. Yao, V. Gripon, and M. Rabbat, “Improving sparse associative memories by escaping from bogus fixed points,” IEEE Transactions on Neural Networks and Learning Systems, submitted for publication, http://arxiv.org/abs/1308.6003.

[21] H. Jarollahi, N. Onizawa, V. Gripon, and W. Gross, “Architecture and implementation of an associative memory using sparse clustered networks,” in IEEE International Symposium on Circuits and Systems (ISCAS), Seoul, Korea, 2012, pp. 2901–2904.

[22] ——, “Reduced-complexity binary-weight-coded associative memories,” in IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada, 2013, pp. 2523–2527.

[23] H. Jarollahi, V. Gripon, N. Onizawa, and W. Gross, “A low-power content-addressable-memory based on clustered-sparse-networks,” in IEEE International Conference on Application-specific Systems, Architectures and Processors, Washington, DC, USA, 2013, pp. 305–308.

[24] B. Larras, C. Lahuec, M. Arzel, and F. Seguin, “Analog implementation of encoded neural networks,” in IEEE International Symposium on Circuits and Systems, Beijing, China, 2013, pp. 1–4.

[25] Z. Yao, V. Gripon, and M. Rabbat, “A massively parallel associative memory based on sparse neural networks,” IEEE Transactions on Neural Networks and Learning Systems, submitted for publication, http://arxiv. org/abs/1303.7032.

APPENDIX

TWO BUILDING MODULES FOR ALGORITHMS RETRIEVING CORRUPT MESSAGES

Input: The adjacency matrix W, the probe vector and the set of the missing clusters P. Output: A fixed point v, possibly a bogus fixed point

1 2 foreach do 3 foreach l do

4 if else 0 5 end

6 end

7 8 repeat

9 // Only missing clusters are investigated.

12 // Convert to global index. 13 if then

14 // Only active neurons are investigated; see [25]. 15 else

16 dead false 17 foreach c do

18 signal false 19 foreach l do

20 if then 21 signal true 22 break

23 end

24 end

25 if signal = false then

26 dead true 27 break

28 end

29 end

34 until

designed for accessibility and to further open science