Shape and Centroid Independent Clustring Algorithm for Crowd Management Applications

2016·Arxiv

Abstract

Abstract

Clustering techniques play an important role in data mining and its related applications. Among the challenging applications that require robust and real-time processing are crowd management and group trajectory applications. In this paper, a robust and low-complexity clustering algorithm is proposed. It is capable of processing data in a manner that is shape and centroid independent. The algorithm is of low complexity due to the novel technique to compute the matrix power. The algorithm was tested on real and synthetic data and test results are reported.

trajectory; matrix power

I. INTRODUCTION

Clustering is the process of grouping similar objects together in an unsupervised way. Clustering algorithms play vital roles in data mining applications including group trajectory and crowd management. Those applications feature some unique challenges include unpredictable shapes of groups and real-time processing requirement. In crowd management, things change rapidly and response has to be very fast to avoid any catastrophic consequences in case of crowd crush. For those reasons, a clustering algorithms developed for crowd management application showed neutralize the effect of shape and it should feature low computational complexity.

Clustering is an NP-hard problem [1]. This has motivated the development of simplified algorithms that sacrifice performance to save time. Clustering algorithms can be grouped into different categories, the main two groups of them are:

partitioned into non-overlapped clusters, and usually the number of clusters is predefined. This group includes k-means family [2,3], provide simple and fast clustering, but it suffers from its random seeding techniques

2. Hierarchical clustering: which perform a multilevel

clusteing either additively (bottom-up) by merging the nearest clusters; or divisively (top-down) by breaking down large cluster that contains sparse data. Although methods belong to this group achieve near-optimum results, it requires distance computation between every pair of samples, which is unreasonable in the case of large data [4].

The clustering is a function of similarity, cluster shape and centroid selection. [5]. The effect of cluster shape is very clear in the shape of data that forms a ring shape, in which the cluster centroid will exist in the middle where no data actually exist. While the selection of cluster centroid should guarantee that every group of data has a one and only one head, centroid selection methods (e.g. random generation, lowest ID or Nodal degree based) will never guarantee that as a centroid might be generated away from data or two centroid might be selected in the same cluster.

In this paper we present a method that can cancel the effect of the cluster head and cluster shape. Our method guarantees that a spatially close data of any shape will be grouped in one and only one cluster, and that cluster will include nothing but those data. The paper also involves proposing a novel method to calculate the matrix power which is essential to find the connection of a graph. Then that novel technique is used in the proposed clustering algorithm.

II. THE PROPOSED CLUSTRING ALGORITHM

A clustering algorithm is proposed. It returns a class label for each one of the N input nodes of the set. It also ranks clusters upon their node count. This outcome of cluster size ranking is driven by the requirement of crowd management problem, where recognizing dense and big groups is of great interest since they are usually associated with accidents and crowd crush.

The algorithm parameter is the radius distance r that limits

the size of the region of searching about neighbors. The algorithm proposed in this paper searches only within radius r and returns the found neighbors if any. The steps of the algorithm are explained in the following

A. Adjacency Matrix Calculation

The algorithm starts by calculating the adjacency matrix A of the binary matrix where an element

distance r from one another. The Euclidian distance is used in calculating A.

B. A Low-Complexity Novel Method for Matrix Power Calculation

From the adjacency matrix, the matrix power G of size

is calculated. An ordinary way to calculate G is

This step is important in order to find the connection of

graph at a certain node

indirect neighbors. That is, after rising to the power k, the algorithm will be able, in the upcoming steps, to determine the direct neighbor of a node

neighbors of the 1st neighbor (2nd neighbor), the direct neighbor of those 2st neighbor (3rd neighbor), and so on until the neighbor. The worst case is when the N nodes are in a form of a chain such that each node

neighbors of each other as illustrated in Fig. 1. The dashed lines in the figure represent indirect neighborhood when k=1. In such case, it is enough for the first and the last nodes to have at least one common indirect neighbor. In this example, both

Hence, the algorithm will be able to place

same cluster. After calculating G, all elements of G that are nonzero are set to 1 causing G to be a binary matrix.

Calculating G using (1) implies computational complexity

of , which is a high complexity. Here, a novel less

complex method of finding G is proposed. Instead realizing (1) by means of repeatedly multiplying A by itself k times, a shortcut is applied by utilizing partial (intermediate) products. The first partial product that is produced is . Instead of proceeding to multiply , we would utilize and multiply it by itself producing . Likewise, is produced immediately from . The method proceeds in calculating the square of the last partial product until the smallest product greater than is reached. That implies a number repeating multiplication 2/log2 times where. Hence the

complexity of the proposed method and the ordinary one mentioned earlier is depicted in Fig. 2. The proposed method can be realized by the pseudocode in Fig. 3.

It is important to note that the results of (1) and the new method are not necessarily identical where the new method could produce results that are equal to greater than (1)'s. in fact, that acceptable in clustering applications. Recall that the value of k in (1) is the minimum exponent to guarantee and common indirect neighbor between the first and the last nodes in a worst case scenario of a chain (Fig. 1). Raising A to a power greater than k yet satisfies the above condition of common neighbor and leads to the same algorithm outcomes without introducing any errors. However, in (1), increasing the value of k increases the algorithm complexity further. On the other hand, the new

C. Clustering Step

Fig. 1: Example of a node chain connection for

Fig. 2: Complexity of calculating matrix power

Fig. 3: Pseudocode of the proposed method to calculate

1. For a node , check if it has not been assigned to a cluster

yet, i.e., its corresponding is equal to 0. If that is true, then set and proceed to 2. Otherwise, go to 4.

2. For is used as mask vector M. 3. For all nodes that have not been assigned to a cluster

yet, i.e., their corresponding is equal to 0, the vector R, which is the , is masked by M. If the result is a zero vector, then do not fall in the same cluster c. Otherwise, they belong to the same cluster and, hence, set . Do this step for all whose label

4. Increment i and repeat steps 1 to 4 as long as there is at

least one element in L still equal to 0.

Finally, after all nodes are assigned to clusters, frequency of occurrence of each cluster label value is calculated and assigned to vector F. That is, each element integer showing how many nodes are in cluster c. From there can keep track of the largest cluster.

The pseudocode of the proposed algorithm is listed in Fig. 4. The bottleneck of this algorithm is the step of calculating the matrix power. Hence, the algorithm has complexity of

III. TESTING AND RESULTS

The algorithm was tested on real data and on synthetic data. Some synthetic data were generated to mimic some real-life scenarios of crowd behavior. On the other hand, the real data is GPS a set of coordinates of cars moving in a motorcade made for the purpose of this study. Clustering results presented in this section are directed graphically and color-coded upon cluster size. The largest cluster is marked in red. The next largest one is marked in green. The third largest is in blue. Other colors are used for smaller clusters, but we will focus on the three largest clusters.

Among the synthetic data, a hand-crafted data set of groups of points of various shapes was generated as depicted in Fig. 5. In Fig. 5 (a), a number of 7 clusters are returned. The largest cluster (red) is more like a crowd of 19 nodes. The second largest cluster (green) has 14 nodes in the worst case form discussed earlier, which is chain of single nodes. Cluster 3 (blue) has 13 nodes forming a shape of thicker chain than cluster 2.

The data of Fig. 5 (b) consists of 4 cluster. The largest was generated to span long distance and to contain branches that fork and then meet again. The remaining clusters are chains of various lengths.

Fig. 5 (c) illustrates a very important scenario in crowd management, which is highly dense crowd surrounded by lightly scattered nodes. That could represent many possible scenarios in real-life such as a group of people rushing to an exit, a mas fight, pedestrians who got their way blocked unexpectedly, …etc. it is clear how the algorithm managed to spot the danger zone which forms the largest cluster marked in red.

Fig. 4: The proposed clustering algorithm

Fig. 5 (d)–(f) illustrates the algorithm outcomes of randomly scattered nodes of low, medium and high density respectively. In the low density crowd, there are many clusters of small number (around 3 nodes per cluster for the first few largest clusters). When the density increases, that ratio becomes around 7 per cluster. For the high density data, the number clusters is less despite the high total node count. That is because nodes are within very short distance from each other.

Samples of the real motorcade GPS data are illustrated in Fig. 5 (g)–(i). Selected time instances showing when some nodes join and disjoin the group. At the initial time instance all cars started form rest. At instance , two cars have been in a distance that is long enough form others to be recognized by the algorithm as a separate cluster. Later at , they rejoined again forming one cluster. An animated illustration of full trip clustering can be downloaded at

ACKNOWLEDGMENT

REFERENCES

[1] T. F. Gonzalez, “On the computational complexity of clustering and related problems,” in System Modeling and Optimization: Proceedings of the 10th IFIP Conference New York City, USA, August 31 --September 4, 1981, R. F. Drenick and F. Kozin, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 1982, pp. 174–182.

[2] H. Steinhaus, “Sur la division des corp materiels en parties,” Bull. Acad. Polon. Sci, vol. 1, pp. 801–804, 1956.

[3] Guojun Shi, Bingkun Gao and Li Zhang, "The optimized K-means algorithms for improving randomly-initialed midpoints," Measurement, Information and Control (ICMIC), 2013 International Conference on, Harbin, 2013, pp. 1212-1216.

[4] L. Rokach and O. Maimon, “Data Mining and Knowledge Discovery Handbook,” O. Maimon and L. Rokach, Eds. Boston, MA: Springer US, 2005, pp. 321–352.

[5] M. Gerla and J. Tzu-Chieh Tsai, “Multicluster, mobile, multimedia radio network,” Wirel. Networks, vol. 1, no. 3, pp. 255–265, 1995.

Fig. 5: Clustering algorithm test results

designed for accessibility and to further open science