The ABACOC Algorithm: a Novel Approach for Nonparametric Classification of Data Streams

2015·Arxiv

Abstract

Abstract

Stream mining poses unique challenges to machine learning: predictive models are required to be scalable, incrementally trainable, must remain bounded in size (even when the data stream is arbitrarily long), and be nonparametric in order to achieve high accuracy even in complex and dynamic environments. Moreover, the learning system must be parameterless —traditional tuning methods are problematic in streaming settings— and avoid requiring prior knowledge of the number of distinct class labels occurring in the stream. In this paper, we introduce a new algorithmic approach for nonparametric learning in data streams. Our approach addresses all above mentioned challenges by learning a model that covers the input space using simple local classifiers. The distribution of these classifiers dynamically adapts to the local (unknown) complexity of the classification problem, thus achieving a good balance between model complexity and predictive accuracy. We design four variants of our approach of increasing adaptivity. By means of an extensive empirical evaluation against standard nonparametric baselines, we show state-of-the-art results in terms of accuracy versus model size. For the variant that imposes a strict bound on the model size, we show better performance against all other methods measured at the same model size value. Our empirical analysis is complemented by a theoretical performance guarantee which does not rely on any stochastic assumption on the source generating the stream.1

I. INTRODUCTION

As pointed out in various papers —see, e.g., [15], [21]— stream mining poses unique challenges to machine learning: examples must be efficiently processed one at a time as they arrive from the stream, and an up-to-date predictive model must be available at all times. Incremental learning systems are well suited to address these requirements: the key difference between a traditional (batch) learning system and an incremental one is that the latter learns by performing small adjustments to the current predictor. Each adjustment uses only the information provided by the current example in the stream, allowing an efficient and timely update of the predictive model. This is unlike batch learning, where training typically involves a costly global optimization process involving multiple passes over the data.

Another important feature of stream mining is that the true structure of the problem is progressively revealed as more data are observed. In this context, nonparametric learning methods, such as decision trees or nearest neighbour (NN), are especially effective, as a nonparametric algorithm is not committed to any specific family of decision surfaces. For this reason, incremental algorithms for decision trees [7], [22], [19], [8], [4] and nearest neighbour [28] are extremely popular in stream mining applications.

Since in nonparametric methods the model size keeps growing to fit the stream with increasing accuracy, we seek a method able to improve predictions while growing the model as slowly as possible. However, as the model size cannot grow unbounded, we also introduce a variant of our approach that prevents the model size from going beyond a given limit. In the presence of concept drift [15], [26], bounding the model size may actually improve the overall predictive accuracy, provided the data point supporting the model are selected in the right way.

A further issue in stream mining concerns the way prediction methods are evaluated —see, e.g., [13] for a discussion. In this paper, we advocate the use of the online error (also called sequential risk, prequential risk, or prequential error [13]). This quantity measures the average of the errors made by the sequence of incrementally learned models, where one first tests the current model on the next example in the stream and then uses the same example to update the model. The sequential risk is therefore measured on each individual stream and does not specifically require stochastic assumptions on the way the stream is generated.

In this paper, we propose a novel incremental and nonparametric approach for the classification of data streams. We present four different instances of our approach (called BASE, BASE-ADJ, AUTO, and AUTO-ADJ) characterized by an increasing degree of adaptivity to the data. In particular, AUTOADJ is fully parameterless, a feature especially important in streaming settings where tuning is a hard task. Even though our algorithms are instance-based like nearest neighbour, the learned models are significantly smaller than those produced by competing baselines and more accurate when the online performance is measured against the model size. Finally, our methods (except BASE) are natively multiclass and can dynamically accommodate new classes as they appear in the stream.

In a nutshell, our algorithms work by incrementally covering the input space with balls of possibly different radii. Each new example that falls outside of the current cover becomes the center of a new ball. Examples are classified according to NN over the ball centers, where each ball predicts according to the majority of the labels of previous examples that fell in that ball. The set of balls is organized in a tree structure [17], so that predictions can be computed in time logarithmic in the number of balls. In order to increase the ability of the model to fit new data, the radii of the balls shrink, thus making room for new balls. The shrinking of the radius may depend on time or, in the more sophisticated variants of our algorithms, on the number of classification mistakes made by each ball classifier. Similarly to decision trees, where leaves are split according to their impurity, our method locally adapts the complexity of the model by allocating more balls in regions of the input space where the stream is harder to predict. A further improvement concerns the relocation of the ball centers in the input space: as our methods are completely incremental, the positioning of the balls depends on the order of the examples in the stream, which may result in a model using more balls than necessary. In order to mitigate this phenomenon, while avoiding a costly global optimization step to reposition the balls, we also consider a variant in which a K-means step is used to move the center of a ball being updated towards the median of the data points that previously fell in that ball. A further modification which we consider is aimed at keeping the model size bounded even in the presence of an arbitrarily long stream. This is achieved by introducing a randomized mechanism for discarding balls when the size bound is reached. Specifically, the mechanism discards a ball with probability proportional to the mistake rate of the ball classifier. The underlying idea is to get rid of the model parts that contribute the most to the global error and may replaced by a better arrangement of balls.

In summary, we introduce a simple and flexible approach for nonparametric classification of data streams. Our approach is fully modular: we predict using majority voting, but a fully trainable classifier could be used instead. The simplest version of our approach, applicable to streams with binary labels, enjoys strong theoretical guarantees: its mistake rate on any arbitrary stream converges to that of the best classification function that satisfies a certain regularity condition. The more complex versions of our approach learn multiclass classifiers without knowning the number of distinct labels in advance. We empirically show that our methods are excellent at trading-off classification accuracy with model size. Our most sophisticated method is fully parameterless. Finally, we show that a simple modification of our approach allows to keep the model size bounded, outperforming other methods measured at the same value of model size.

The rest of the paper is organized as follows. Section II discusses related work. In Section III, we define the problem setting. In Section IV, we present our nonparametric classifi-cation approach. In Section IV-A, we discuss the theoretical properties of our approach and derive a formal performance guarantee for the simplest algorithm. We then introduce three more sophisticated versions that are empirically more effective. In Section VI, we test the behaviour of our algorithms against state-of-the-art baselines. In Section VII, we introduce a simple modification of our approach to keep the model size bounded. Finally, Section VIII concludes the paper.

II. RELATED WORK

Within the vast area of stream mining [11], we focus our analysis of related work on the subarea that is most relevant to this study: nonparametric methods for stream classification. The most important approaches in this domain are:

Incremental decision and rule tree learning systems, such as Very Fast Decision Tree (VFDT) [7] and Decision Rules (RULES) [12] which use an incremental version of the split function computation —see also [22], [19], [8], [4].

Incremental variants of NN, such as Condensed Nearest Neighbour (CNN) [27] that stores only the misclassified instances, Lazy-Tree (L-Tree) [28] condensing historical stream records into compact exemplars, and IBLStreams [23], an instance-based learning algorithms removing outliers or examples that have become redundant.

Incremental kernel-based algorithms, such as the kernel Perceptron [10] with Gaussian kernels.2

Note that our methods do not belong to any of the above three families: they do not perform a recursive partition of the feature space as decision trees, they do not allocate (or remove) instances based on the heuristics used by IBLStreams, and they do not use kernels.

As we explain next, our most basic algorithm is a variant for classification tasks of the algorithm proposed in [16] for nonparametric regression in a streaming setting. A similar algorithm was previously proposed in [14] and analyzed without resorting to stochastic assumptions on the stream generation. A preliminary instance of our approach, without any theoretical analysis, was developed in [5] for an action recognition application in video feeds.

III. PROBLEM SETTING

Our analysis applies to streams of data points belonging to an arbitrary metric space and depends on the metric dimension of data points in the stream. This notion of dimension extends to general metric spaces the traditional notions of dimension (e.g., Euclidean dimension and manifold dimension) [2]. The metric dimension of a subset S of a metric space is d if there exists a constant such that, for all , S has an -cover of size at most (an -cover is a set of balls of radius whose union contains S). In practice, the metric dimension of the stream may be much smaller than the dimension of the ambient space X. This is especially relevant in case of nonparametric algorithms, which typically have a bad dependence on the dimensionality of the data. Note that our algorithms do not require knowledge of d: the metric dimension of the stream is automatically estimated from the data.

The learner receives a sequence of examples, where each data point is annotated with a label from a set Y = {1, . . . , K} of possible class labels, which may change over time. The learner’s task is to predict each label minimizing the overall number of prediction mistakes over the data stream.

We derive theoretical performance guarantees for BASE, the simplest algorithm in our family (Algorithm 2), without making stochastic assumptions on the way the examples in the stream are generated. Note that this is a very strong type of guarantee: our results hold on any individual stream of annotated data points.

IV. ADAPTIVE BALL COVERING

The adaptive ball covering at the roots of our method was previously used in a theoretical work [16]. Here, we distillate the main ideas behind that approach in a generic algorithmic approach (the template Algorithm 1) called ABACOC (Adaptive BAll COver for Classification). We then present our methods as specific instances of this generic template.

A. The BASE Algorithm

Our first instance of ABACOC is BASE (Algorithm 2), a randomized variant for binary classification of the ITBR (Incremental Tree-Based Regressor) algorithm proposed in [16]. BASE shrinks the radius (line 28) of the balls depending on (1) an estimate of the metric dimension of the stream and (2) the number of data points so far observed from the stream. This implies that the radii of all the balls shrink at the same rate. In the prediction phase, the ball nearest to the input example is considered and a randomized binary prediction is made based on the class distribution estimate locally computed in the ball. Laplace estimators (line 5) and randomized predictions (lines 6–8) are new features of BASE that were missing in ITBR.

We now analyze the performance of BASE using the notion of regret [1]. The regret of a randomized algorithm is defined as the difference between the expected number of classification mistakes made by the algorithm over the stream and the expected number of mistakes made by the best element in a fixed class of randomized classifiers. A randomized binary classifier is a mapping , where f(x) is the probability of predicting label +1. We consider the class of L-Lipschitz predictors w.r.t. the metric of the space. Namely,

Hence, a predictor is Lipschitz if, when we perturb the data point x, the prediction changes by an amount linear in the perturbation size. Lipschitz functions are a standard reference in the analysis of nonparametric algorithms.

The regret of BASE generating randomized predictions

is defined by (see also [14])

For the BASE algorithm we can prove the following regret bound against any Lipschitz randomized classifier, without any assumption on the way the stream is generated. Moreover, similarly to ITBR, the regret upper bound depends on the unknown metric dimension d of the space, automatically estimated by the algorithm.

The proof is in the next Section V. Note that the algorithm does not know L, hence the regret bound above holds for all values of L simultaneously. This theorem tells us that BASE is not an heuristic, but rather a principled approach with a specific performance guarantee. The performance guarantee implies that, on any stream, the expected mistake rate of BASE converges to that of the best L-Lipschitz randomized classifier at rate of order .

Next, we generalize the BASE algorithm to multiclass clas-sification, and make some modifications aimed at improving its empirical performance.

V. PROOFS

We use the following well-known fact: if 1) for predicting using a randomized label {0, 1}, then .

Even if our algorithm is different from ITBR, we can still use the following lemma from ITBR analysis [16]. In the following, we say that a phase ends each time condition in line 15 of BASE is verified and use to denote the time steps included in phase i. Finally, denotes the maximum number of balls used in phase i.

Lemma 1 ([16]): Suppose BASE is run with parameter . The following invariants hold throughout the procedure for all phases :

• .

• For any we have .

Define . Unlike the analysis in [16], here we cannot use a bias-variance decomposition. So, the key in the proof is to decompose the regret in two terms with behaviour similar to the bias and variance terms in the stochastic setting.

data points in the stream. Assume that . Then, in any phase i and for any we have that

Proof: We use the notation to say that is assigned to a ball with center . We also denote by the number of points assigned to a ball of center . Define

For each in , we proceed by upper bounding the error as a sum of two components

Using the definition of and the Lipschitz property of f, we have

The prediction strategy in each ball is equivalent to the approach followed in [9] (see also Exercise 8.8 in [1]). The only important thing to note is that the first prediction of the algorithm in a ball is made using the probability of the closest ball, even if it is further than , instead of at random as in the original strategy in [9]. It is easy to see that this adds an additional 0.5 to the regret stated in [9]. So we have

To bound we use Lemma 1, while to bound the last term, we have

We finish with the proof of Theorem 1.

Proof: Let I denote the number of phases up to time T. Let . We use Lemma 2 in each phase and sum over the phases, to have

where in the second inequality we use Jensen’s inequality, and in the second to last inequality the first statement of Lemma 1.

A. The BASE algorithm with ball adjustment

A natural way of generalizing the BASE algorithm to the multiclass case is by estimating the class probabilities in each ball. Note that this approach is naturally incremental w.r.t. the number of classes: new bins for counting are created on the fly as data points of new classes arrive.

Recall that the BASE algorithm greedly covers the input space. In particular, balls are always centered on input points. However, constraining the centers on data points is an intuitively sub-optimal strategy: it might be possible to cover the same region with a smaller number of balls if we could freely move their centers. As a full optimization of the position of the centers is not realistic in a streaming scenario, we introduce the BASE-ADJ variant which makes a partial optimization by using a step of the K-means algorithm [18]. More precisely, BASE-ADJ (Algorithm 3, only the main changes w.r.t. BASE are shown) moves the center of each ball towards the average of the correct classified data points falling into it. In this way, the center of the ball tends to move towards the centroid of a cluster of points of a certain class. We expect this variant to generate less balls and also to have a better empirical performance.

We drop from BASE-ADJ the Laplace correction of class estimates and the randomization in the computation of the predicted label. Although these ingredients were used in the theoretical analysis, we noticed that they do not significantly affect the empirical results. Hence, BASE-ADJ always predicts the class with the largest class probability estimate (majority voting on the collected labels) within the ball closest to the current data point.

B. The AUTO algorithm: automatic radius

One of the biggest issues with BASE (and ITBR) is the use of a common radius for all the balls. In fact, in line 28 of Algorithm 2 we have that the radii shrink uniformly with time t at rate , where is the estimated

Algorithm 4 AUTO and AUTO-ADJ

Input: procedure 2: // wait until at least two different labels fed

10: end if

11: end procedure 12: procedure ) 13: total class counts 14: 15: Predict 16: end procedure 17: procedure ) 18: // shrink radius on errors

22: // update ball centre if correct prediction

26: Updates label counts in the ball using

27: end procedure 28: procedure ) 29: 30: ball mistakes count 31: center updates count (for AUTO-ADJ) 32: Initialize label counts in the ball using

metric dimension. However, we would like the algorithm to use smaller balls in regions of the input space where labels are more irregularly distributed and bigger balls in easy regions, where labels tend to be the same.

In order to overcome this issue, in this section we introduce two other instances of ABACOC: AUTO and AUTO-ADJ. In these variants we let the radius of each ball shrink at a rate depending on the number of mistakes made by each local ball classifier, lines 20 and 36 in Algorithm 4. Moreover, in order to get rid of the parameter used to estimate the metric dimension, we initialize the radius of each ball to the distance to its closest ball, line 29 in Algorithm 4. In other words, everytime a new ball is added its radius is set equal to the distance to the nearest already-existing ball.

AUTO-ADJ differs from AUTO because it implements the same strategy, introduced in BASE-ADJ, for updating the position of the centers. Note that this strategy, coupled with the shrinkage depending on the number of mistakes, makes a ball stationary once it is covering a region of the space that contains data points always annotated with the same label.

Using balls of different radii makes it impossible to work with the automatic estimate of the metric dimension used in BASE, BASE-ADJ and ITBR. For this reason, we further simplify the algorithms by resorting to a fixed estimate of the intrinsic dimension d as an input parameter.

VI. EXPERIMENTS

In this section, we describe baselines and datasets used in the experiments and report on the obtained results. We conducted an extensive evaluation on standard machine learning datasets for the streaming setting. Generally, in real applications for high-speed data streams, when the system cannot afford to revise the current model after each observation of a data point, stream sub-sampling is used to keep the model size and the prediction efficiency under control. In order to emphasize the distinctive features of our approaches (i.e., good trade-off between accuracy and model size), we tested the online (prequential) performance using sub-sampling —see Algorithm 5. In this setting, the algorithms have access to each true class label only with a certain probability. By varying this probability, we can explore different model sizes for each baseline algorithm and compare the resulting performances. Note also that, while in this work we only consider random sub-sampling, different and more active sampling schedules could be also envisioned.

A. Baseline and datasets

We considered eleven popular datasets for stream mining listed in Table I.

TABLE I. DATASETS USED FOR BENCHMARKING.

As indicated in the table, datasets are from the Stream Data Mining repository (SDMR) [29], the Data Sets with Concept Drift repository (DF) [25], the Massive Online Analysis (MOA) collection3, and the LIBSVM classification repository4. In all experiments, we measured the online accuracy (prequential error in [13] or “Interleaved Test-Then-Train” validation in MOA5). This is the average performance when each new example in the stream is predicted using the classifier trained only over the past examples in the stream —see Algorithm 5 (line 6).

In a pre-processing phase, the categorical attributes were binarized. BASE and BASE-ADJ received normalized input instances (Euclidean norm) allowing the input parameter (space diameter) to be set to 1. We compared our ABACOC methods BASE6 (Algorithm 2), BASE-ADJ (Algorithm 3), AUTO and AUTO-ADJ (Algorithm 4) against some of the most popular incremental nonparametric baselines (see Section II) in the stream mining literature: K-NN with parameter K = 3 (NN3) (see next paragraph for a justification of this choice), Condensed Nearest Neighbor [27] (CNN), a streaming version of NN which only stores mistaken points, the multiclass Perceptron with Gaussian kernel [3] (K-PERC), a decision tree algorithm for streaming data [7] (VDFT), and a recent algorithm for learning decision rules on streaming data [12] (RULES). For VDFT and RULES we used the implementation available in MOA, while K-PERC was run using the code in DOGMA [20]. The ABACOC algorithms were implemented in MATLAB7. We did not consider the LTree [28] and IBLStreams [23] methods described in Section II as L-Tree is an efficient approximation of NN (outperformed by NN, see [28]) and IBLStreams never performs better than RULES (both implemented in MOA) on our datasets.

Where necessary, the parameters of the competitor methods were individually tuned on each dataset using an algorithm-specific grid of values in order to obtain the best online performance. Hence, the results of the competitors are not worse than the ones obtainable with a tuning of the parameters using standard cross-validation methods. For our methods, we used the Euclidean distance as metric . Based on preliminary experiments, we noticed that the parameter does not affect significantly the performance in AUTO and AUTO-ADJ, so we set it to 2. With fixed to this value, our methods are essentially parameterless, which is a very attractive feature in a streaming setting where cross-validation can not be easily applied.

B. Comparison among our methods

First, we compared the empirical behaviour of all our algorithms on the two-dimensional dataset banana,8 in Figure 1. The simplicity of this dataset allows us to show visually the difference between the four algorithms. BASE is seen to have many overlapping balls. On the other hand, AUTO has balls

Fig. 1. Empirical behaviours of all versions of ABACOC algorithm on 2000 datapoints of the banana dataset. The intensity of the colour of each ball is proportional to the conditional class probability of the two classes.

Fig. 2. Model size and online performance averaged over all datasets in Table I of our four methods. Performances are computed by normalizing each performance relative to the best performer for each dataset, and then averaging over the datasets.

of different radii and not so overlapping. Finally, BASE-ADJ and AUTO-ADJ, the variants of BASE and AUTO that update the centers of the balls, have a smaller number of balls than BASE and AUTO respectively. Also, note how the use of a varying shrinking radius in AUTO and AUTO-ADJ results in bigger balls that cover very large regions of the space. To verify the intuition emerged from Figure 1, we empirically tested the performance of our methods on the entire benchmark of Table I, running Algorithm 5 with rate = 1. In Figure 2(a), we show the resulting model sizes in terms of the stream length percentage used to represent the models (fraction of input samples used as ball centers) of each method averaged over all datasets in our benchmark suite. Figure 2(b) shows the average normalized accuracy of each method as a fraction of the accuracy of the best-performing method on each dataset. Note that, due to the adjustment procedure added to BASEADJ and AUTO-ADJ, they use a small fraction of data to represent their models while achieving a performance better than, respectively, BASE and AUTO. Finally, we observe that AUTO-ADJ simultaneously achieves the smallest model and the best performance.

C. Comparison against baselines

We now turn to describing the sub-sampling experiments. In a streaming setting, the model size and thus the computational efficiency of the prediction system is a key feature. The goal of the experiments is to show the trade-off between online performance and model size for each algorithm. The

Fig. 3. Online performance against model size averaged over the datasets. The model size is relative to the stream length, whereas the online performance is measured relative to the top-performing method on each dataset without restriction on model size.

model size is measured by: the number of balls used to cover the feature space (ABACOC), the number of stored instances (K-PERC, NN, CNN), the number of leaves (VFDT) or rules (RULES) used to partition the feature space.

We ran all the methods using values rate = {1%, 3%, 5%, 10%} and the same random seeds for all algorithms.9 In Figure 3, we plot the normalized online performance against model size, averaged over the datasets. The model size is relative to the stream length, whereas the online performance is measured relative to the top-performing method on each dataset without restriction on model size. As we can see from the plot, NN3 saturates the model size and achieves a slightly better overall performance on the larger model sizes. However, it suffers at low budget values and small model sizes. CNN works better than K-PERC and decision trees. VFDT and RULES use very little memory but have a worse performance than the other methods. BASE-ADJ improves on the performance of BASE. AUTO attains a better performance than BASE and AUTO-ADJ achieves the overall best trade-off between accuracy and model size. In fact, as we can see in Figure 3, the AUTO-ADJ curve dominates the other ones. Moreover, it attains 90% of the best full-sampling methods while using only 1.5% of the data to represent the model. Because of the better performance exhibited by our methods with respect to the baselines at the same model size values, we can infer that our methods have a better way of choosing the data points that define their models.

VII. CONSTANT MODEL SIZE

In this section we propose a simple method for making the memory footprint bounded, even in the presence of an arbitrarily long data stream. When the model size reaches a given limit, the algorithm starts to discard the examples supporting the model that are judged to be less informative for the prediction task. More precisely, it is reasonable to discard the local classifiers that are making the largest number of mistakes. This happens essentially for two reasons: 1) the optimal decision surface in that region is complex and/or the noise rate is high; 2) there is concept drift [26], that is the optimal decision surface is locally changing over time. Removing local classifiers with a high mistake rate may then help because: we are discarding classifiers that are making essentially random decisions; moreover, we make room for new classifiers that rely on fresh statistics (good in case of concept drift) and are possibly better positioned to capture a complex decision surface. Thus, in order to curb the memory footprint, we propose a simple approach based on deleting existing balls whenever a given budget parameter is attained. This is crucial for real-time applications, as NN search in the prediction phase is logarithmic on the number of balls. The probability of deleting any given ball is proportional to the number of mistakes made so far by the associated classifier. Namely, after the budget is reached, whenever a new ball is added an existing ball i is discarded according to the Laplacecorrected probability

where is the number of mistakes made by ball . We run the experiments in the same setting of Section VI-C, where we did not make any restriction on the sub-sampling rate (rate = 1 in Algorithm 5). We added to AUTO-ADJ a constant model size bound. With respect to sub-sampling, here the algorithm has more control over the data points that support the model. We report in Table II and in Table III the performance with budget 10% and 1% of the method AUTOADJ with constant budget, called AUTO-ADJ FIX, compared to NN3 and AUTO-ADJ which performed the best in the previous experiments using the same final model sizes. As

TABLE II. SUMMARY OF THE ONLINE PERFORMANCE (LEFT) AND MODEL SIZE (RIGHT) ON THE FULL BENCHMARK SUITE OF THE BEST THREE ALGORITHMS RUN WITH BUDGET OF THE TOTAL STREAM LENGTH (MODEL SIZE IS ALSO EXPRESSED AS A FRACTION OF THE STREAM LENGTH).

TABLE III. SUMMARY OF THE ONLINE PERFORMANCE (LEFT) AND MODEL SIZE (RIGHT) ON THE FULL BENCHMARK SUITE OF THE THREE BEST ALGORITHMS RUN WITH BUDGET OF THE TOTAL STREAM LENGTH (MODEL SIZE IS ALSO EXPRESSED AS A FRACTION OF THE STREAM LENGTH).

we can observe from these tables, AUTO-ADJ FIX generally outperforms the other methods at the same model sizes. This is very evident on the datasets with drift, such as electricity, and when the budget limit is very small (1% of the total stream length). Along the same lines of Figure 3, we show in Figure 4 the overall performance of the compared methods using all the budget/rate values {1%, 3%, 5%, 10%}. AUTOADJ FIX clearly outperforms all the other methods. This is not surprising, as AUTO-ADJ FIX has a better way of choosing the data points supporting the model as opposed to the random selection imposed on the other methods.

VIII. CONCLUSION AND FUTURE WORKS

We presented an intuitive and easy to implement approach for nonparametric classification of data streams. Our more sophisticated algorithms feature the most appealing traits in stream mining applications: nonparametric classification, incremental learning, dynamic addition of new classes, small model size, fast prediction at testing time (logarithmic in the model size), essentially no parameters to tune. We empirically

Fig. 4. Online performance against model size, averaged over the datasets. The model size is relative to the stream length, whereas the online performance is measured relative to the top-performing method on each dataset without restriction on model size.

showed the effectiveness of our approach in different scenarios and against several standard baselines. In addition, we proved strong theoretical guarantees on the online performance of the most basic version of our approach.

Further research will focus on finding a confidence measure for the prediction scores, which could be used in a semisupervised framework (e.g., active learning). Another interesting line of research is concerned with finding a more sophisticated and theoretically justified strategy to keep the model size bounded. A further, very challenging research line is in the direction of taming the curse of dimensionality problem that affects all nonparametric approaches. For instance, we plan on investigating notions of local dimensions that allow to perform dimensionality reduction locally and incrementally.

REFERENCES

[1] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.

[2] K. Clarkson. Nearest-neighbor searching and metric space dimensions. Nearest-Neighbor Methods for Learning and Vision: Theory and Practice, 2005.

[3] K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. The Journal of Machine Learning Research, 3:951–991, 2003.

[4] R. De Rosa and N. Cesa-Bianchi. Splitting with confidence in decision trees with application to stream mining. In Neural Networks (IJCNN), The 2015 International Joint Conference on. IEEE, 2015.

[5] R. De Rosa, N. Cesa-Bianchi, I. Gori, and F. Cuzzolin. Online action recognition via nonparametric incremental learning. In Proceedings of the 25th British Machine Vision Conference (BMVC 2014), 2014.

[6] R. De Rosa, F. Orabona, and N. Cesa-Bianchi. The abacoc algorithm: a novel approach for nonparametric classification of data streams. In Data Mining (ICDM), 2015 IEEE International Conference on. IEEE, 2015.

[7] P. Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 71–80. ACM, 2000.

[8] P. Duda, M. Jaworski, L. Pietruczuk, and L. Rutkowski. A novel application of hoeffding’s inequality to decision trees construction for data streams. In Neural Networks (IJCNN), 2014 International Joint Conference on. IEEE, 2014.

[9] M. Feder, N. Merhav, and M. Gutman. Universal prediction of individual sequences. IEEE Transactions on Information Theory, 38(4):1258– 1270, 1992.

[10] Y. Freund and R. E. Schapire. Large margin classification using the Perceptron algorithm. Machine learning, 37(3):277–296, 1999.

[11] M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy. A survey of classification methods in data streams. In Data Streams, pages 39–59. Springer, 2007.

[12] J. Gama, P. Kosina, et al. Learning decision rules from data streams. In IJCAI Proceedings-International Joint Conference on Artificial Intelligence, volume 22, page 1255. Citeseer, 2011.

[13] J. Gama, R. Sebastiao, and P. P. Rodrigues. On evaluating stream learning algorithms. Machine Learning, 90(3):317–346, 2013.

[14] E. Hazan and N. Megiddo. Online learning with prior knowledge. In Proceedings of 20th Annual Conference on Learning Theory, pages 499–513. Springer, 2007.

[15] G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 97–106. ACM, 2001.

[16] S. Kpotufe and F. Orabona. Regression-tree tuning in a streaming setting. In C. J. C. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger, editors, NIPS, pages 1788–1796, 2013.

[17] R. Krauthgamer and J. R. Lee. Navigating nets: Simple algorithms for proximity search. In Proceedings of the Fifteenth Annual ACMSIAM Symposium on Discrete Algorithms, SODA ’04, pages 798– 807, Philadelphia, PA, USA, 2004. Society for Industrial and Applied Mathematics.

[18] J. MacQueen et al. Some methods for classification and analysis of multivariate observations. 1967.

[19] P. Matuszyk, G. Krempl, and M. Spiliopoulou. Correcting the usage of the hoeffding inequality in stream mining. In Advances in Intelligent Data Analysis XII, pages 298–309. Springer, 2013.

[20] F. Orabona. DOGMA: a MATLAB toolbox for Online Learning, 2009. Software available at http://dogma.sourceforge.net.

[21] J. Read, A. Bifet, G. Holmes, and B. Pfahringer. Scalable and efficient multi-label classification for evolving data streams. Machine Learning, 88(1-2):243–272, 2012.

[22] L. Rutkowski, L. Pietruczuk, P. Duda, and M. Jaworski. Decision trees for mining data streams based on the McDiarmid’s bound. IEEE Transactions on Knowledge and Data Engineering, 25(6):1272–1279, 2013.

[23] A. Shaker and E. H¨ullermeier. Iblstreams: a system for instance-based classification and regression on data streams. Evolving Systems, 3(4):235–249, 2012.

[24] I. Steinwart. On the influence of the kernel on the consistency of support vector machines. The Journal of Machine Learning Research, 2:67–93, 2002.

[25] Tsymbal. Data sets with concept drift, 2006. Available online at http: //www.win.tue.nl/mpechen/data/DriftSets/.

[26] A. Tsymbal. The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin, 106, 2004.

[27] D. R. Wilson and T. R. Martinez. Reduction techniques for instance-based learning algorithms. Machine learning, 38(3):257–286, 2000.

[28] P. Zhang, B. J. Gao, X. Zhu, and L. Guo. Enabling fast lazy learning for data streams. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pages 932–941. IEEE, 2011.

[29] X. Zhu. Stream data mining repository, 2010. Available online at http://www.cse.fau.edu/xqzhu/stream.html.

designed for accessibility and to further open science