In this paper, we study predictive pattern mining. The goal of predictive pattern mining is discovering a set of patterns from databases that are needed for constructing a good predictive model. Predictive pattern mining problems can be interpreted as feature selection problems in supervised machine learning tasks such as classifications and regressions. The main difference between predictive pattern mining and ordinal feature selection is that, in the former, the number of possible patterns in databases are extremely large, meaning that we cannot naively search over all the patterns in databases. We thus need to develop algorithms that can exploit some structures among patterns such as trees or graphs for efficiently discovering good predictive patterns.
To be concrete, suppose that there are D patterns in a database, which is assumed to be extremely large. For the i-th transaction in the database, let represent the occurrence of each pattern.
We consider linear predictive model in the form of
where is a set of patterns that would be selected by a mining algorithm, and
and b are the parameters of the linear predictive model. Here, the goal is to select a set of predictive patterns in A and find the model parameters
and b so that the predictive model in the form of (1) has good predictive ability.
Existing predictive pattern mining studies can be categorized into two approaches. The first approach is two-stage approach, where a mining algorithm is used for selecting the set of patterns A in the first stage, and the predictive model is fitted by only using the selected patterns in A in the second stage. Two-stage approach is computationally efficient because the mining algorithm is run only once in the first stage. However, two-stage approach is suboptimal as predictive model building procedure because it does not directly optimize the predictive model. The second approach is direct approach, where a mining algorithm is integrated in a feature selection method. An advantage of direct approach is that a set of patterns that are useful for predictive modeling is directly searched for. However, the computational cost of existing direct approach is usually much greater than two-stage approach because the mining algorithm is run multiple times. For example, in a stepwise feature selection method, the mining algorithm is run at each step in order for finding the pattern that best improves the current predictive model.
In this paper, we study a direct approach for predictive pattern mining based on sparse modeling. In the literature of machine-learning and statistics, sparse modeling has been intensively studied in the past two decades. An advantage of sparse modeling is that the problem is formulated as a convex optimization problem, and it allows us to investigate several properties of solutions from a wide variety of perspectives [1]. In addition, many efficient solvers that can be applicable to high-dimensional problems (although not as high as the number of patterns in databases as we consider in this paper) have been developed [2].
Predictive pattern mining algorithms based on sparse modeling have been also studied in the literature
[3, 4, 5]. All these studies rely on a technique developed in the context of boosting [6]. Roughly speaking, in each step of the boosting-type method, a feature is selected based on a certain criteria, and an optimization problem defined over the set of features selected so far is solved. Therefore, when the boosting-type method is used for predictive pattern mining tasks, one has to search over the database as many times as the number of steps in the boosting-type method.
Our main contribution in this paper is to propose a novel method for sparse modeling-based predictive pattern mining. Denoting the set of patterns that would be used in the optimal predictive model as , the proposed method can find a set of patterns ˆ
, i.e., ˆA contains all the predictive patterns that are needed for the optimal predictive model. It means that, if we solve the sparse modeling problem defined over the set of patterns ˆA, then it is guaranteed that the resulting predictive model is optimal. The main advantage of the proposed method over the above boosting-type method is that a mining algorithm is run only once for finding the set of patterns ˆA.
The proposed method is inspired by recent safe feature screening studies [7, 8, 9, 10, 11, 12, 13, 14, 15]. In ordinary feature selection problems, safe feature screening allows us to identify a set of features that would never be used in the optimal model before actually solving the optimization problem It means that these features can be safely removed from the training set. Unfortunately, however, it cannot be applied to predictive pattern mining problems because it is computationally intractable to apply safe feature screening to each of extremely large number of patterns in a database for checking whether the pattern can be safely removed out or not.
In this paper, we develop a novel method called safe pattern pruning (SPP). Considering a tree structure defined among patterns in the database, the SPP method allows us to prune the tree in such a way that, if a node corresponding to a pattern in the database is pruned out, then it is guaranteed that all the patterns corresponding to its descendant nodes would never be needed for the optimal predictive model. The SPP method can be effectively used in predictive pattern mining problems because we can identify an extremely large set of patterns that are irrelevant to the optimal predictive model by exploiting the tree structure among patterns in the database, A superset ˆcan be obtained by collecting the set of patterns corresponding to the nodes that are not pruned out by the SPP method.
1.1 Notation and outline
We use the following notations in the rest of the paper. For any natural number n, we define [n] := {1, . . . , n}. For an n-dimensional vector v and a set represents a sub-vector of v whose elements are index by I. The indicator function is written as
), i.e., I(z) = 1 if z is true, and I(z) = 0 otherwise. Boldface 0 and 1 indicate a vector of all zeros and ones, respectively.
Here is the outline of the paper. 2 presents problem setup and existing methods.
3 describes our main contribution where we introduce safe pattern pruning (SPP) method.
4 covers numerical experiments for demonstrating the advantage of the SPP method.
5 concludes the paper.
Figure 1: Two examples of tree structures defined among patterns in databases.
We first formulate our problem setting.
2.1 Problem setup
In this paper we consider predictive pattern mining problems. Let us consider a database with n records, and denote the dataset as , where
is a labeled undirected graph in the case of graph mining, while it is a set of items in the case of item-set mining. The response variable
is defined on R and on
for regression and classification problems, respectively. Let T be the set of all patterns in the database, and denote its size as D := |T |. For example, T is the set of all possible subgraphs in the case of graph mining, while T is the set of all possible item-sets in the case of item-set mining. Alternatively,
is represented as
a D-dimensional binary vector whose t-th element is defined as
The number of patterns D is extremely large in all practical pattern mining problems. It implies that any algorithms that naively search over all D patterns are computationally infeasible.
In order to study both regression and classification problems in a unified framework, we consider the following class of convex optimization problems:
where is a gradient Lipschitz continuous loss function and
0 is a tuning parameter. We refer the problem (2) as primal problem and write the optimal solution as
. When f(z) :=
and
:=
,
:= 1,
:=
], the general problem (2) is reduced to the following
-penalized regression
problem defined over D + 1 variables:
On the other hand, when f(z) := max
and
:=
:=
:= 0
], the general problem (2) is reduced to the following
-penalized classification problem defined over D + 1 variables:
Remembering that D is extremely large, we cannot solve these -penalized regression and classification problems in a standard way.
The dual problem of (2) is defined as
where for regression problem in (3), and
= 0 for classification problem in (4). The dual optimal solution is denoted as
.
The key idea for handling an extremely large number of patterns in the database is to exploit the tree structure defined among the patterns. Figure 1 shows tree structures for graph mining (left) and item-set mining (right). As shown in Figure 1, each node of the tree corresponds to each pattern in the database. Those trees are constructed in such a way that, for any pair of a node t and one of its descendant node , they satisfy the relation
, i.e., the pattern
is a superset of the pattern t. It suggests that, for such a
pair of t and ,
and, conversely
2.2 Existing method
To the best of our knowledge, except for the boosting-type method described in 1 and its extensions or modifications [3, 4, 5], there is no other existing method that can be used for solving the convex optimization problem (2) for predictive pattern mining problems defined over an extremely large number of patterns D. The boosting-type method solves the dual problem (5). The difficulty in the dual problem is that there are extremely large number of constraints in the form of
. Starting from the optimization problem (5) without these constraints, in each step of the boosting-type method, the most violating constraint is added to the problem, and an optimization problem only with the constraints added so far is solved. In optimization literature, this approach is generally known as the cutting-plane method, for which its effectiveness has been also shown in some machine learning problems (e.g., [16]). The key computational trick used by [3, 4, 5] is that, for finding the most violating constraint in each step, it is possible to efficiently search over the database by using a certain pruning strategy in the tree as depicted in Figure 1. This method is terminated when there is no violating constraints in the database.
In each single step of the boosting-type method, one first has to search over the database by a mining algorithm, and then run a convex optimization solver for the problem with the newly added constraint. Boosting-type method is computationally expensive because these steps must be repeated until all the constraints corresponding to all the predictive patterns in are added. In the next section, we propose a novel method called safe pattern pruning, by which the optimal model is obtained by a single search over the database and a single run of convex optimization solver.
In this section, we present our main contribution.
3.1 Basic idea
It is well known that penalization in (2) makes the solution
sparse, i.e., some of its element would be zero. The set of patterns which has non-zero coefficients are called active and denoted as
, while the rest of the patterns are called non-active. A nice property of sparse learning is that the optimal solution does not depend on any non-active patterns. It means that, after some non-active patterns are removed out from the dataset, the same optimal solution can be obtained. The following lemma formally states this well-known but important fact.
Lemma 1. Let ˆA be a set such that ˆ
, and
(
) be the objective function of (2) in which
= 0 is substituted:
Then, the optimal solution of the original problem (2) is given by
Lemma 1 indicates that, if we have a set of patterns ˆ, we have only to solve a smaller optimization problem defined only with the set of patterns in ˆA. It means that, if such an ˆA is available, we do not have to work with extremely large number of patterns in the database.
In the rest of this section, we propose a novel method for finding such a set of patterns ˆby searching over the database only once. Specifically, we derive a novel pruning condition which has a property that, if the condition is satisfied at a certain node, then all the patterns corresponding to its descendant nodes and the node itself are guaranteed to be non-active. After traversing the tree, we simply define ˆA be the set of nodes which are not pruned out. Then, it is guaranteed that ˆA satisfies the condition in Lemma 1. The proposed method is inspired by recent studies on safe feature screening. We thus call our new method as
safe pattern pruning (SPP).
3.2 Main theorem for safe pattern pruning
The following theorem provides a specific pruning condition that can be used together with any search strategies on a tree. Let be a set of nodes in a subtree of T having t as a root node and containing all descendant nodes of t. We derive a condition for safely screening the entire
) out, which is computable at the node t without traversing the descendant nodes. This means that, our rule, called safe pattern pruning rule, tells us whether a pattern
) has a chance to be active or not based on the information available at the root node of the subtree t. An important consequence of the condition below is that if the condition holds, i.e., any
) cannot be active, then we can stop searching over the subtree (pruning the subtree).
Theorem 2 (Safe pattern pruning (SPP) rule).
Given an arbitrary primal feasible solution ( ˜) and an arbitrary dual feasible solution ˜
, for any node
), the following safe pattern pruning criterion (SPPC) provides a rule
where
for ], and
The proof of Theorem 2 is presented in 3.3.
SPPC(t) depends on three scalar quantities and
. The first two quantities
and
are obtained by using information on the pattern t, while the third quantity
does not depend on t. Noting that all these three quantities are non-negative, the SPP rule would be more powerful (have more chance to prune the subtree) if these three quantities are smaller. The following corollary is the consequence of the simple fact that the first two quantities
and
at a descendant node are smaller than those at its ancestor nodes.
Corollary 3. For any node ),
The proof of Corollary 3 is presented in Appendix. This corollary suggests that the SPP rule would be more powerful at deeper nodes.
The third quantity represents the goodness of the pair of primal and dual feasible solutions measured by the duality gap, the difference between the primal and dual objective values. It means that, if sufficiently good pair of primal and dual feasible solutions are available, the SPP rule would be powerful. We will discuss how to obtain good feasible solutions in
3.4.
3.3 Proof of Theorem 2
In order to prove Theorem 2, we first clarify the condition for any pattern to be non-active by the following lemma.
Proof of Lemma 4 is presented in Appendix. Lemma 4 indicates that, if an upper bound of is smaller than 1, then we can guarantee that
= 0. In what follows, we actually show that SPPC(t) is an upper bound of
for
).
In order to derive an upper bound of , we use a technique developed in a recent safe feature screening study [15]. The following lemma states that, based on a pair of a primal feasible solution ( ˜
) and a dual feasible solution ˜
, we can find a ball in the dual solution space
in which the dual optimal solution
exists.
Lemma 5 (Theorem 3 in [15]). Let ( ˜) be an arbitrary primal feasible solution, and ˜
be an arbitrary dual feasible solution. Then, the dual optimal solution
is within a ball in the dual solution space
with the center ˜
and the radius
:=
( ˜
(˜
.
See Theorem 3 and its proof in [15]. This lemma tells that, given a pair of primal feasible and dual feasible solutions, we can bound the dual optimal solution within a ball.
Lemma 5 can be used for deriving an upper bound of . Since we know that the dual optimal solution
is within the ball in Lemma 5, an upper bound of any
can be obtained by solving the
following convex optimization problem:
Fortunately, the convex optimization problem (7) can be explicitly solved as the following lemma states.
Lemma 6. The solution of the convex optimization problem (7) is given as
Proof of Lemma 6 is presented in Appendix.
Although UB(t) provides a condition to screen any , calculating UB(t) for all
is computationally prohibiting in our extremely high dimensional problem setting. In the next lemma, we will show that SPPC(
UB(
) for
), i.e., SPPC(t) in Theorem 2 is an upper bound of UB(
), which enables us to efficiently prune subtrees during the tree traverse process.
Lemma 7. For any ),
Finally, by combining Lemmas 4, 5, 6 and 7, we can prove Theorem 2.
Proof of Theorem 2.
3.4 Practical considerations
Safe pattern pruning rule in Theorem 2 depends on a pair of a primal feasible solution ( ˜) and a dual feasible solution ˜
. Although the rule can be constructed from any solutions as long as they are feasible, the
power of the rule depends on the goodness of these solutions. Specifically, the criterion SPPC(t) depends on the duality gap ( ˜
(˜
) which would vanish when these primal and dual solutions are optimal. Roughly speaking, it suggests that, if these solutions are somewhat close to the optimal ones, we could expect that the SPP rule is powerful.
3.4.1 Computing regularization path
In practical predictive pattern mining tasks, we need to find a good penalty parameter based on a model selection technique such as cross-validation. In model selection, a sequence of solutions with various different penalty parameters must be trained. Such a sequence of the solutions is sometimes referred to as a regularization path [17]. Regularization path of the problem (2) is usually computed from larger
to smaller
because more sparse solutions would be obtained for larger
. Let us write the sequence of
s as
. When computing such a sequence of solutions, it is reasonable to use warm-start approach where the previous optimal solution at
is used as the initial starting point of the next optimization problem at
. In such a situation, we can also make use of the previous solution at
as the feasible solution for the safe pattern pruning rule at
.
In sparse modeling literature, it is custom to start from the largest possible at which the primal solution
is given as = 0 and
= ¯y, where ¯y is the sample mean of
. The largest
is given as
In order to solve this maximization problem over the database, for a node t and ), we can use the
and this upper bound can be exploited for pruning the search over the tree.
Algorithm 1 shows the entire procedure for computing the regularization path by using the SPP rule.
In this section, we demonstrate the effectiveness of the proposed safe pattern pruning (SPP) method through numerical experiments. We compare SPP with the boosting-based method (boosting) discussed in 2.2.
4.1 Experimental setup
We considered regularization path computation scenario described in 3.4.1. Specifically, we computed a sequence of optimal solutions of (2) for a sequence of 100 penalty parameters
evenly allocated between
and 0
in logarithmic scale. For solving the convex optimization problems, we used coordinate gradient descent method [18]. The optimization solver was terminated when the duality gap felled below 10
. In both of SPP and boosting, we used warm-start approach. In addition, the solution at the previous
was also used as the feasible solution for constructing the SPP rule at the next
. We used gSpan algorithm [19] for mining subgraphs. We wrote all the codes (except gSpan part in graph mining experiment) in C++. All the computations were conducted by using a single core of an Intel Xeon CPU E5-2643 v2 (3.50GHz) with 64GB MEM.
4.2 Graph classification/regression
We applied SPP and boosting to graph classification and regression problems. For classification, we used CPDB and mutagenicity datasets, containing n = 648 and n = 4377 chemical compounds respectively, for which the goal is to predict whether each compound has mutagenicity or not. For regression, we used Bergstrom and Karthikeyan datasets where the goal is to predict the melting point of each of the n = 185 and n = 4173 chemical compounds. All datasets are downloadable from http://cheminformatics.org/datasets/.
We considered the cases with 10}, where maxpat indicates the maximum number of edges of subgraphs we wanted to find.
Figure 2 shows the computation time of the two methods. In all the cases, SPP is faster than boosting, and the difference gets larger as maxpat increases. Figure 2 also shows the computation time taken in traversing the trees (traverse) and that taken in solving the optimization problems (solve). The results
Figure 2: Computation time comparison for graph classification and regression. Each bar contains compu- tational time taken in the tree traverse (traverse) and the optimization procedure (solve) respectively.
indicate that traverse time of SPP are only slightly better than that of boosting. It is because the most time-consuming component of gSpan is the minimality check of the DFS (depth-first search) code, and the traverse time mainly depends on how many different nodes are generated in the entire regularization path computation process. In terms of solve time, there are large difference between SPP and boosting. In SPP, we have only to solve a single convex optimization problem for each
. In boosting, on the other hand, convex optimization problems must be repeatedly solved every time a new pattern is added to the working set. Figure 4 shows the total number of traversed nodes in the entire regularization path computation process. Total number of traversed nodes in SPP is much smaller than those of boosting, which is because one must repeat searching over trees many times in boosting.
4.3 Item-set classification/regression
We applied SPP and boosting to item-set classification and regression problems. For classification, we used splice dataset (n = 1000 and the number of items d = 120) and a9a dataset (n = 32561 and d = 123). For regression, we used dna dataset (n = 2000 and d = 180) and protein dataset (n = 6621 and d = 714). All datasets were obtained from LIBSVM Dataset site [20]. We considered the cases with
, where maxpat here indicates the maximum size of item-sets we wanted to find.
Figure 3 compares the computation time of the two methods. In all the cases, SPP is faster than boosting. Here again, Figure 3 also shows the computation time taken in traversing the trees (traverse) and that taken in solving the optimization problems (solve). In contrast to the graph mining results, traverse time of SPP are much smaller than that of boosting because it simply depends on how many nodes are traversed in total. Figure 5 shows the total number of traversed nodes in the entire regularization path computation process. Especially when is small where the number of active patterns are large, boosting needed to traverse large number of nodes, which is because the number of steps of boosting is large when there are
Figure 3: Computation time comparison for item-set classification and regression. Each bar contains computational time taken in the tree traverse (traverse) and the optimization procedure (solve) respectively.
large number of active patterns.
In this paper, we introduced a novel method called safe pattern pruning (SPP) method for a class of predictive pattern mining problems. The advantage of the SPP method is that it allows us to efficiently find a superset of all the predictive patterns that are used in the optimal predictive model by a single search over the database. We demonstrated the computational advantage of the SPP method by applying it to graph classification/regression and item-set classification/regression problems. As a future work, we will study how to integrate the SPP method with a technique for providing the statistical significances of the discovered patterns [21, 22].
Figure 4: # of traverse nodes for graph classification and regression.
Figure 5: # of traverse nodes for item-set classification and regression.
[1] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, 2015.
[2] Peter B¨uhlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.
[3] Hiroto Saigo, Tadashi Kadowaki, and Koji Tsuda. A linear programming approach for molecular qsar analysis. In International workshop on mining and learning with graphs (MLG), pages 85–96. Citeseer, 2006.
[4] Hiroto Saigo, Takeaki Uno, and Koji Tsuda. Mining complex genotypic features for predicting hiv-1 drug resistance. Bioinformatics, 23(18):2455–2462, 2007.
[5] Hiroto Saigo, Sebastian Nowozin, Tadashi Kadowaki, Taku Kudo, and Koji Tsuda. gboost: a mathe- matical programming approach to graph classification and regression. Machine Learning, 75(1):69–89, 2009.
[6] Ayhan Demiriz, Kristin P Bennett, and John Shawe-Taylor. Linear programming boosting via column generation. Machine Learning, 46(1-3):225–254, 2002.
[7] Laurent El Ghaoui, Vivian Viallon, and Tarek Rabbani. Safe feature elimination for the lasso and sparse supervised learning problems. Pacific Journal of Optimization, 8(4):667–698, 2012.
[8] Zhen J Xiang, Hao Xu, and Peter J Ramadge. Learning sparse representations of high dimensional data on large scale dictionaries. In Advances in Neural Information Processing Systems, pages 900–908, 2011.
[9] Jie Wang, Jiayu Zhou, Peter Wonka, and Jieping Ye. Lasso screening rules via dual polytope projection. In Advances in Neural Information Processing Systems, pages 1070–1078, 2013.
[10] Antoine Bonnefoy, Valentin Emiya, Liva Ralaivola, and R´emi Gribonval. A dynamic screening principle for the lasso. In Signal Processing Conference (EUSIPCO), 2014 Proceedings of the 22nd European, pages 6–10. IEEE, 2014.
[11] Jun Liu, Zheng Zhao, Jie Wang, and Jieping Ye. Safe Screening with Variational Inequalities and Its Application to Lasso. In Proceedings of the 31st International Conference on Machine Learning, 2014.
[12] Jie Wang, Jiayu Zhou, Jun Liu, Peter Wonka, and Jieping Ye. A safe screening rule for sparse logistic regression. In Advances in Neural Information Processing Systems, pages 1053–1061, 2014.
[13] Zhen James Xiang, Yun Wang, and Peter J Ramadge. Screening tests for lasso problems. arXiv preprint arXiv:1405.4897, 2014.
[14] Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. Mind the duality gap: safer rules for the lasso. In Proceedings of the 32nd International Conference on Machine Learning, pages 333–342, 2015.
[15] Eugene Ndiaye, Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. Gap safe screening rules for sparse multi-task and multi-class models. In Advances in Neural Information Processing Systems, pages 811–819, 2015.
[16] Thorsten Joachims. Training linear svms in linear time. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 217–226, 2006.
[17] Mee Young Park and Trevor Hastie. L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4):659–677, 2007.
[18] Paul Tseng and Sangwoon Yun. A coordinate gradient descent method for nonsmooth separable mini- mization. Mathematical Programming, 117(1-2):387–423, 2009.
[19] Xifeng Yan and Jiawei Han. gspan: Graph-based substructure pattern mining. In Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on, pages 721–724. IEEE, 2002.
[20] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011.
[21] Shinya Suzumura, Kazuya Nakagawa, Mahito Sugiyama, Koji Tsuda, and Ichiro Takeuchi. Selective inference approach for statisticall sound predictive pattern mining. submitted to KDD2016.
[22] Aika Terada, Mariko Okada-Hatakeyama, Koji Tsuda, and Jun Sese. Statistical significance of combi- natorial regulations. Proceedings of the National Academy of Sciences, 110(32):12996–13001, 2013.
[23] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
[24] Taku Kudo, Eisaku Maeda, and Yuji Matsumoto. An application of boosting to graph classification. In Advances in neural information processing systems, pages 729–736, 2004.
Proof of Corollary 3
Proof. For any pair of a node t and ),
First consider the case where =
. When
, from (9),
. When
Next, consider the case where . When
, from (9),
When , from (10),
. Furthermore, it is clear that
. Since
0, SPPC(
SPPC(
Proof of Lemma 4
Proof. Based on the convex optimization theory (see, e.g., [23]), the KKT optimality condition of the primal
problem (2) and the dual problem (5) is written as
Proof of Lemma 6
Proof. Let := [
. First, note that the objective part of the optimization problem (7) is
rewritten as
Thus, we consider the following convex optimization problem:
Let us define the Lagrange function
and then the optimization problem (12) is written as
The KKT optimality conditions are summarized as
where note that 0 because the problem does not have a minimum value when
= 0. Differentiating the
Lagrange function w.r.t. and using the fact that it should be zero,
By substituting (15) into (13),
Since the objective function is a quadratic concave function w.r.t. , we obtain the following by considering the condition (14c):
By substituting this into (15),
Then, from (16), the solution of (12) is given as
and the minimum objective function value of (12) is
Then, substituting (17) into (11), the optimal objective value of (7) is given as
Proof of Lemma 7
Proof. First, using the bound introduced in [24],
Next, it is clear that
By combining them,