Model-based Classification and Novelty Detection For Point Pattern Data

2017·Arxiv

Abstract

Abstract

Point patterns are sets or multi-sets of unordered elements that can be found in numerous data sources. However, in data analysis tasks such as classification and novelty detection, appropriate statistical models for point pattern data have not received much attention. This paper proposes the modelling of point pattern data via random finite sets (RFS). In particular, we propose appropriate likelihood functions, and a maximum likelihood estimator for learning a tractable family of RFS models. In novelty detection, we propose novel ranking functions based on RFS models, which substantially improve performance.

Index Terms—Classification, novelty detection, naive Bayes model, point pattern data, multiple instance data, point process, random finite set.

I. INTRODUCTION

Point patterns are sets or multi-sets of unordered points (or elements) that can be found in numerous data sources. In natural language processing and information retrieval, the ‘bag-of-words’ representation treats each document as a collection or set of words [1], [2]. In image and scene categorization, the ‘bag-of-visual-words’ representation–the analogue of the ‘bag-of-words’–treats each image as a set of its key patches [3]. In data analysis for the retail industry as well as web management systems, transaction records such as marketbasket data [4], [5] and web log data [6] are sets of transaction items. Other examples of point pattern data could be found in drug discovery [7], protein binding site prediction [8].

One simple approach to the classification problem for point patterns is via the na¨ıve Bayes (NB) classifier, see for example [2], [3], [6]. However, the broader task of learning from point pattern data is more appropriately posed as a Multiple Instance Learning (MIL) problem [9], [10], since multiple instance data or ‘bags’ are indeed point patterns. According to the recent review article [9], there are three paradigms for multiple instance classification, namely Instance-Space (IS), EmbeddedSpace (EM), and Bag-Space (BS). These paradigms differ in the way they exploit data at the local level (individual data points within each bag) or at the global level (the bags themselves as data points). IS is the only paradigm exploiting data at the local level. At the global level, the ES paradigm maps all point patterns to vectors of fixed dimension, which are then processed by standard classifiers for vectors. On the

Dinh Phung gratefully acknowledges support from the Air Force Office of Scientific Research under award number FA2386-16-1-4138.

other hand, the BS paradigm addresses the problem at the most fundamental level by operating directly on the point patterns. The philosophy of the BS paradigm is to preserve the information content of the data, which could otherwise be corrupted through the data transformation process. However, existing methods in the BS paradigm are confined to distancebased approaches [9], while statistical modelling tools and model-based approaches have been overlooked.

In this paper, we introduce statistical models for point pattern data using Random Finite Set (RFS) theory [11], [12], [13]. In particular, we propose appropriate likelihood functions, and a maximum likelihood estimator for learning (from training data) a tractable family of models, called iid-cluster RFSs. Further, in novelty detection where observations are ranked according to their likelihoods, we show that the standard RFS densities are not suitable for point patterns and proposed novel ranking functions that substantially improve performance.

II. A MOTIVATING EXAMPLE

The objective of a classifier is to assign a class label to an unseen data point X that consists of m feature vectors . In the Bayesian framework, the optimal class label is given by argmax, where is the class posterior probability, p(y) is the prior probability of class y, and

is the data model or data likelihood.

The data model used by the na¨ıve Bayes (NB) classifier [14, pp. 718] imposes a conditional independence assumption among the features so that

where is the conditional probability of the i-th feature given class y (see also [15, pp. 82–89], [16, pp. 380– 381]).

In novelty detection or semi-supervised anomaly detection [17], the data likelihood plays an even more important role. In this approach, data are ranked according to a data likelihood (learned from normal data), and data points with likelihoods lower than a threshold are considered as anomalies [18].

Consider the following example on anomalous patterns of daily fallen apples. The apples land on the ground independently from each other, and the probability distribution of the landing positions of the apples is shown in Fig. 1 (the thick solid line). The number of apples and their landing positions for each day are recorded and we are interested in detecting anomalous behavior of the daily apple landing pattern.

Figure 1. Distribution of landing positions. Position is 3 times less likely than which are equally likely. Credit: clipartbest.com (apple tree clipart)

Suppose that on day 1 we observed one apple landing at , and on day 2 we observed two apples landing at and , which of the patterns on these two days is more likely to be an anomaly? To detect anomalies in this setting, it is natural to rank the observed patterns in order of their likelihoods. In this illustration, we follow [2], [3], [6] and use the NB likelihood1

where is pdf of landing positions shown in Fig. 1.

more likely to be an anomaly than the pattern observed on day 2. However, if we measure distance in centimeters, then

and hence the pattern observed on day 2 is more likely to be an anomaly than that of day 1. The likelihood (2) yields contradictory results on the same scenario with different units of measurement!

Note that in the above analysis, the units of the likelihoods were overlooked because is measured in units of or while is measured in units of or . The observed inconsistency arises from the incompatibility in the unit of measurement in the likelihoods and , i.e., we are not “comparing apples with apples”. In general, the unit of the likelihood (2) depends on number of features in X, i.e. the cardinality of X. Hence, comparing the likelihood (2) of different observations is not meaningful, unless they have the same number of features or the features are intrinsically unitless.

Apart from the inconsistency with unit of measurement, such point pattern likelihood also suffers from another problem associated with cardinality. Let us revisit the fallen apples example, however to eliminate the effect of the unit mismatch, this time we restrict ourselves to a finite number of landing positions, by discretizing the interval into 201 points and round the landing positions to the nearest of these points (Fig. 2). Thus, instead of a probability density of the landing positions on the interval we now have a (unitless) probability mass function (pmf) on the discrete set , see Fig. 2.

Figure 2. Distribution of discrete landing positions.

Fig. 3a shows 4 ‘normal’ patterns of fallen apples, each with about 10 locations i.i.d. from the pmf of Fig. 2. Two new observations and whose features are also i.i.d. from the same pmf are shown in Fig. 3b.

(b) New observations. Note that, by NB likelihood, we have

Figure 3. Examples of normal data and anomaly.

Since has only 1 feature whereas the ‘normal’ observations each has around 10 features, it is intuitively obvious that is anomalous. However, its likelihood is much higher than that of – a normal datum (0.009 versus ). This counter intuitive behavior cannot be attributed to the measurement unit inconsistency because the pmf of the features is unitless.

III. MODELS FOR POINT PATTERN DATA

The likelihood (2) was used in the above discussions to illustrate discrepancies with measurement unit and cardinality. However, these discrepancies arise even in the full joint likelihood (1). In this section, we propose models for point pattern data using random finite set, which could address these issues.

A. Random Finite Set

Point patterns can be modeled as random finite sets (RFSs), or simple finite point processes. Point process theory, in general, is concerned with abstract random counting measures. RFSs are geometrically more intuitive and thus better suited for the type of discussions in this article. The likelihood of a point pattern of discrete features is straightforward since this is simply the product of the cardinality distribution and the joint probability of the features given the cardinality. The difficulties arise in continuous feature spaces. In this work, we only consider continuous feature spaces.

A random finite set (RFS) X of X is a random variable taking values in F(X) [19], [11], [12], [20], [13]. In essence, an RFS is a finite-set-valued random variable that is random in the number of elements, as well as the values of the elements. An RFS X can be completely specified by a discrete (or categorical) distribution that characterizes the cardinality |X|, and a family of symmetric joint distributions that characterizes the distribution of the points (or features) of X, conditional on the cardinality. Analogous to random vectors, the probability density of an RFS (if it exists) is essential in the modeling of point pattern data. The probability density of an RFS is the Radon-Nikodym derivative of its probability distribution relative to the dominating measure , defined for each (measurable) , by [19], [12], [21], [22]:

where U is the unit of hyper-volume in X, and is the indicator function for T . The measure is the unnormalized distribution of a Poisson point process with unit intensity u = 1/U when X is bounded. Note that is unitless and consequently the probability density p is also unitless.

In general the probability density of an RFS, with respect to , evaluated at can be written as

where Pr(|X| = m) is the cardinality distribution, and is a symmetric joint probability density of the points given the cardinality, see [23, p. 27] ((Eqs. (1.5), (1.6), and (1.7)), [12], [21].

B. Likelihoods for point pattern data

Instead of a random vector X, we propose to model each point pattern as an RFS X. A general form for the likelihood of X is given by (4), which can capture the cardinality information as well as the dependence between the features. The RFS data model also avoids the unit of measurement inconsistency since the probability density with respect to is unitless.

Imposing the ‘na¨ıve’ conditional independence assumption among the features on the model in (4) reduces to the iid-cluster RFS model [11]

where is a probability density on X, referred to as the feature density, and , with by convention, is the finite-set exponential notation.

When is a Poisson distribution we have the celebrated Poisson point process (aka, Poisson RFS)

where is the mean cardinality. The Poisson model is completely determined by the intensity function [12],[21], [22]. Note that the Poisson cardinality distribution is described by a single non-negative number , hence there is only one degree of freedom in the choice of cardinality distribution for the Poisson model.

Given the training data, a key task in learning is to compute estimates of the underlying parameters of the model. Learning the general model (4) is computationally intensive. The iid-cluster model (5), on the other hand, provides a good trade-off between tractability and flexibility.

C. Maximum Likelihood Estimation

This subsection presents a solution to learning the parameters of an iid-cluster RFS model using maximum likelihood (ML) estimation.

Given a finite list of observations and a parametrized probability density on Z, we denote

Learning the underlying parameters of an iid-cluster model amounts to estimating from training data. Furthermore, the MLE of the iid-cluster model parameters separates into the MLE of the cardinality distribution parameters and MLE of the feature density parameters . This is stated more concisely in the following Proposition.

Proposition 1. Let be N i.i.d. realizations of an iid-cluster RFS with parametrized cardinality distribution and feature density . Then the MLE of is given by

where is the disjoint union of . Proof. Using (8), we have

Hence, to maximize the likelihood we simply maximize the first and last bracketed terms in (11) separately. This is achieved with (9) and (10). QED.

Observed from Proposition 1 that the MLE of the feature density parameters is identical to that used in NB. For example, if the feature density is a Gaussian , then the parameters of its ML estimates are:

Consequently, the iid-cluster model requires only one additional task of computing the MLE of the cardinality distribution parameters, which is relatively inexpensive.

, the MLE of the cardinality distribution is given by

Since there are K parameters , we require a suf-ficiently large dataset (significantly larger than K). For a small dataset, a cardinality distribution with a small number of parameters should be used to avoid over-fitting, e.g. Poisson, i.e. where its MLE is given by

Conceptually, Proposition 1 can be extended to the general RFS model (4), which relaxes the na¨ıve independence assumption. The MLE of the cardinality distribution parameters is computed as for the iid-cluster RFS model. However, for the feature distribution, instead of we need to compute the MLE of the parameters for the joint densities, i.e.

which is far more complex in general. Imposing additional assumptions such as TAN may provide some simplifications. Alternative models such as mixture of iid-cluster RFSs [24] are also promising. However, these are topics for future research.

D. Numerical experiments

This subsection presents two classification experiments with simulated and real-world data. These experiments involve learning from training data, and then use the learned models to classify new observations. The results are bench-marked against the NB model. The first experiment uses simulated data so as to illustrate the benefit of cardinality information when the features between the classes are similarly distributed. The second experiment uses real-world data from the Texture images dataset [25].

1) Classification with simulated data: In this experiment,

Fig. 4 plots the features and cardinalities of data sampled from these models.

Figure 4. Simulated data for the experiment in section III-D1. Left: Features of the data. Right: Cardinality histogram of the data.

Both the NB and RFS models are trained via ML estimation on a fully observed training dataset set consisting of 900 data points (300 per cluster). For NB, the data models are three 2-D Gaussians (one for each cluster). For the RFS models, we use three Poisson RFSs with 2-D Gaussian feature distributions (which are in fact the same as the Gaussians learned from the NB models).

After training, we evaluate the learned models by classifying a test set consisting of 1500 data points (500 per cluster, also simulated from the described model). The evaluation are run 10 times with 10 different test sets, and the average accuracies2 are shown in Fig. 5a. Observed that the RFS model can exploit the cardinality information and delivers better performance than NB.

Figure 5. Performance of classification by NB and Poisson RFS. Note that the error-bars (on the peak of each column) are standard deviations of accuracies, which show that Poisson RFS works more stably than NB.

2) Classification with real data: The second experiment

involves classification of the two classes “T14 brick1” and “T15 brick2” from Texture images dataset [25]. Fig. 6 show some example images from these classes.

Figure 6. Example images from 2 classes “T15 brick2” and “T14 brick1” in Texture dataset. Circles mark detected SIFT keypoints.

Features are extracted from each image by applying the SIFT algorithm3, followed by Principal Component Analysis (PCA) to convert the 128-D SIFT features into 2-D features. Thus each image is compressed into a point pattern of 2-D features. Fig. 7 plots the 2-D features of the images in the Texture images dataset.

Figure 7. Extracted data from images of the Texture dataset. Left: 2- D features (after applying PCA to the SIFT features). Right: histogram of cardinalities of the extracted data.

The model parameters are learned by using MLE on a training dataset containing 30 images per class. For the NB model, we use 2-D Gaussian distributions to model the data; and for the RFS model, we use Poisson RFSs with 2-D Gaussian feature distributions. After training, the learned models are evaluated on a test set containing all remaining images of the classes (10 images per class). The performance is evaluated with 4-fold cross validation, and the average results are shown in Fig. 5b. Observe that the RFS model outperforms NB, since it can exploit cardinality information of the data.

IV. RANKING OF DATA

The previous section shows that RFS models avoid the unit inconsistency and improve the classification performance by exploiting cardinality information in point pattern data. However, using the RFS density as the data ranking function for novelty detection does not result in a good performance4. Due to the non-uniformity of the reference measure, the probability densities for RFSs presented in the previous section do not provide a consistent ranking of the data. In this section we propose a consistent ranking function for the iid-cluster model.

A. Ranking Function

where C is an unknown constant.

To determine a suitable C, consider first the special case where is uniform on a bounded state space X. In such case, all finite set subsets of X with the same cardinality are equally likely, and a consistent ranking function should satisfy , given |X| = m. This condition can be generalized to non-uniform by replacing with its expected value, given |X| = m

Combining (17) with (18) and using the i.i.d. property of the features in iid-cluster models yields:

where is the squared -norm of , which has units of , and hence is unitless.

B. Numerical experiments

In this section we compare the performance of the proposed ranking function (19) with the NB likelihood and Poisson RFS likelihood in novelty detection. The threshold is set at the 10-quantile of the ranking function values5 of the training dataset (consisting of only normal data). Observations ranked below this threshold are classified as anomalies.

1) Novelty detection with simulated data: In this experi-

ment, normal data are samples, having cardinalities between 40 and 60, drawn from a Poisson RFS with mean cardinality 48 and a 2-D Gaussian feature distribution

The same dataset containing 500 normal data points is used to train the NB and Poisson RFS models with Gaussian feature distributions via MLE (subsection III-C).

Three types of anomalies are considered: low-cardinality anomaly (cardinality ), high-cardinality anomaly (cardinality ), and feature anomaly which has the same cardinality with normal data, but the Gaussian feature distribution now has a mean of . Novelty detection is performed on a test set containing 200 normal observations and 300 anomalies (100 for each type). The tests are run 10 times with 10 different randomly sampled test sets. The averaged results shown in Fig. 8a indicated that the proposed ranking function yields superior performance.

(a) Simulated data (b) Texture data Figure 8. Novelty detection results by 3 ranking functions: NB likelihood, Poisson RFS likelihood, and proposed ranking function.

2) Novelty detection with real data: This experiment uses

the real dataset from the second experiment in subsection III-D. Normal data are taken from the “T14 brick1” class and anomalous test data are taken from the “T15 brick2” class. A 4-fold cross-validation is used: for training we use 30 images from the normal class and for testing we use the remaining images from normal class (10 images) and 10 images from abnormal class (different at each time).

As shown in Fig. 8b, ranking the data using the NB likelihood and Poisson RFS likelihood could not detect any anomalies, while the proposed ranking function achieved an F1 score6 around 0.91. Moreover, Fig. 9 verified that only the proposed ranking function provides a consistent ranking, while the NB and Poisson RFS likelihoods even rank anomalies higher than normal data.

Figure 9. Boxplots of likelihoods computed by 3 models, namely NB likelihood, RFS density, and the proposed ranking function, for a fold of Texture dataset. The solid line going through each graph is the threshold (the 10-quantile).

V. CONCLUSIONS

In this paper, we have introduced statistical models for point pattern data using Random Finite Set theory. Such models provide the means for developing model-based classification and novelty detection for point pattern data. In particular we proposed a maximum likelihood method for learning the parameters of an iid-cluster RFS–the analogue of the na¨ıve Bayes model for point patterns. For novelty detection, we proposed novel ranking functions based on RFS models, which substantially improve performance. Our results also contribute to the Bag-Space paradigm in multiple instance learning where statistical models are not available.

REFERENCES

[1] T. Joachims, “A probabilistic analysis of the rocchio algorithm with tfidf for text categorization.” DTIC Document, Tech. Rep., 1996.

[2] A. McCallum and K. Nigam, “A comparison of event models for naive Bayes text classification,” in AAAI-98 workshop on learning for text categorization, vol. 752. Citeseer, 1998, pp. 41–48.

[3] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual categorization with bags of keypoints,” in Workshop on statistical learning in computer vision, ECCV, 2004.

[4] S. Guha, R. Rastogi, and K. Shim, “Rock: A robust clustering algorithm for categorical attributes,” in Data Engineering, 1999. Proceedings., 15th International Conference on. IEEE, 1999, pp. 512–521.

[5] Y. Yang, X. Guan, and J. You, “Clope: a fast and effective clustering algorithm for transactional data,” in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2002, pp. 682–687.

[6] I. V. Cadez, S. Gaffney, and P. Smyth, “A general probabilistic frame- work for clustering individuals and objects,” in Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2000, pp. 140–149.

[7] T. G. Dietterich, R. H. Lathrop, and T. Lozano-P´erez, “Solving the multiple instance problem with axis-parallel rectangles,” Artificial intelligence, vol. 89, no. 1, pp. 31–71, 1997.

[8] F. Minhas and A. Ben-Hur, “Multiple instance learning of calmodulin binding sites,” Bioinformatics (Oxford, England), vol. 28, no. 18, pp. i416–i422, 2012.

[9] J. Amores, “Multiple instance classification: Review, taxonomy and comparative study,” Artificial Intelligence, vol. 201, pp. 81–105, 2013.

[10] J. Foulds and E. Frank, “A review of multi-instance learning assump- tions,” The Knowledge Engineering Review, vol. 25, no. 01, pp. 1–25, 2010.

[11] D. J. Daley and D. Vere-Jones, An introduction to the theory of point processes. Springer, 1988, vol. 2.

[12] J. Moller and R. P. Waagepetersen, Statistical inference and simulation for spatial point processes. Chapman & Hall CRC, 2003.

[13] R. P. Mahler, Advances in statistical multisource-multitarget information fusion. Artech House, Inc., 2014.

[14] S. Russell and P. Norvig, Artificial intelligence: A modern approach. Prentice Hall, 2003.

[15] K. P. Murphy, Machine learning: a probabilistic perspective. MIT press, 2012.

[16] C. M. Bishop, Pattern recognition and machine learning. Springer, 2006.

[17] V. J. Hodge and J. Austin, “A survey of outlier detection methodologies,” Artificial Intelligence Review, vol. 22, no. 2, pp. 85–126, 2004.

[18] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Computing Surveys (CSUR), vol. 41, no. 3, p. 15, 2009.

[19] D. Stoyan, W. S. Kendall, and J. Mecke, Stochastic geometry and its applications. John Wiley & Sons, 1995.

[20] R. P. Mahler, Statistical multisource-multitarget information fusion. Artech House, Inc., 2007.

[21] B.-N. Vo, S. Singh, and A. Doucet, “Sequential monte carlo methods for multitarget filtering with random finite sets,” Aerospace and Electronic Systems, IEEE Transactions on, vol. 41, no. 4, pp. 1224–1245, 2005.

[22] H. G. Hoang, B.-N. Vo, B.-T. Vo, and R. Mahler, “The Cauchy- Schwarz divergence for Poisson point processes,” IEEE Trans. Inf. Theory, vol. 61, no. 8, pp. 4475–4485, 2015.

[23] M. van Lieshout, Markov point processes and their applications. Imperial College Press, 2000.

[24] D. Phung and B.-N. Vo, “A random finite set model for data clustering,” in Proc. 17th Annu. Conf. Information Fusion, Salamanca, Spain, 2014.

[25] S. Lazebnik, C. Schmid, and J. Ponce, “A sparse texture representation using local affine regions,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, no. 8, pp. 1265–1278, 2005.

[26] A. Vedaldi and B. Fulkerson, “Vlfeat: An open and portable library of computer vision algorithms,” http://www.vlfeat.org/, 2008.

designed for accessibility and to further open science