Due to the now ubiquitous presence of massive unlabeled datasets, recent years have seen an explosion in the search for computationally efficient, noise tolerant learning strategies that minimize the required amount of labeled data to learn a classifier. Active learning is a formalization of the PAC-learning paradigm for unlabeled data. In active learning, the learning algorithm has access both to either a stream or pool of unlabeled data, and an oracle which can label the data on request. The complexity of learning certain classes is then defined by their query complexity, the number of oracle calls required to almost learn the classifier with high probability. The goal in active learning is to adaptively choose data to send to the oracle in such a way that one uses much fewer queries than in the labeled case.
While active learning saw initial success in the noise-free regime with simple concept classes such as thresholds in one dimension, lower bounds [1] soon showed that important classes such as linear separators gave no improvement over PAC-learning, even in only two dimensions. However, subsequent work showed that slight tweaks to the model could overcome this barrier. Balcan and Long [2] showed that by assuming that the data was drawn from a log-concave distribution – a wide set of distributions including Gaussian distributions and uniform distributions over convex sets, learning homogeneous (through the origin) linear separators could be done in exponentially fewer queries than in the PAC model. Later, Balcan and Zhang [3] extended this to the more general class of s-concave distributions, a generalization of log-concavity that includes fat-tailed distributions as well. Rather than restricting the power of the adversary, Kane, Lovett, Moran, and Zhang [4] studied the effect on query complexity of empowering the learner. By allowing the learner to ask more complicated questions of the oracle, such as comparing two points, Kane et al. [4] showed that non-homogeneous linear separators in two-dimensions can be learned in exponentially fewer labeled samples than the PAC case. Later, Kane, Lovett, and Moran [5] extended this to higher dimensions using a complicated set of queries, and Hopkins, Kane, and Lovett [6] did the same by assuming weak concentration and anti-concentration on the distribution – conditions once again satisfied by s-concave distributions.
While query efficient algorithms in high dimensions are an important step towards the use of active learning on real world data, it is equally important that algorithms be computationally efficient and noise tolerant. In an early work, Castro and Nowak [7] provided query efficient algorithms for thresholding in one dimension in the presence of bounded (Massart [8]) and unbounded (Tsybakov [9]) noise under the uniform distribution on [0, 1]. Soon after, Balcan, Broder, and Zhang [10] extended these results to d-dimensional homogeneous hyperplanes over a uniform distribution on a ball. Years later, Hanneke [11] offered a more general analysis for Tsybakov noise based off of the distributional complexity measure the disagreement coefficient, and later with Yang provided a distribution-free analysis [12]. In another vein of work, Balcan and Long [2] provided an algorithm for learning d-dimensional homogeneous hyperplanes over nearly isotropic log-concave distributions with optimal query complexity for Tsybakov noise [13], a result which was later extended by Awasthi, Balcan, Haghtalab and Urner [14] to be computationally efficient for Massart noise when the distribution is restricted to uniform over the unit ball. Similarly, Balcan and Zhang [3] gave a computationally efficient algorithm for learning the more difficult adversarial noise model over s-concave distributions. Concurrently, Xu et al. [15] proposed using comparison queries as a sub-routine in previous algorithms to deal with noise in a computationally efficient manner, improving the overall query complexity along the way.
The comparison based methods of Xu et al. [15], however, do not carry over to the algorithmic technique proposed by Kane et al. [4] for learning non-homogeneous linear separators. Kane et al.’s technique is based upon logical inference. Viewing concept classes as the sign of an underlying family of functions, they build a learner via a linear program with constraints given by query solutions. As a result, the learners created by Kane et al.’s method actually fall into a stronger model than PAC-learning called Reliably and Probably Useful (RPU)-learning [16], variants of which have been studied more recently under a variety of names (e.g. KWIK learning [17], perfect selective classification [18], or confident learning [4]). In this model, the learner is not allowed to err, but may instead output “I don’t know” a small fraction of the time. While Kane et al.’s RPU-learner is computationally efficient, it is not tolerant to noise – the linear program is sensitive to errors in both labels and comparisons. This raises a natural question: can the inference based algorithms of Kane et al. be extended to noisy scenarios, and if so, does a strong reliability guarantee remain? In this work we answer these questions in the positive for Massart and Tsybakov noise. In both cases our algorithms satisfy a noisy version of RPU-learning: with high probability the learner makes no errors at all. Due to their similarity to RPU-learners, we call learners that satisfy this property Almost Reliable and Probably Useful (ARPU). Indeed, taking the limit of our reliability condition returns exactly the RPU model. Our work provides the first query and computationally efficient algorithm for PAC or ARPU-learning non-homogeneous linear separators in the presence of Massart noise over s-concave distributions, as well as more generally for hypothesis classes with finite inference dimension or small average inference dimension. In addition, we provide the first algorithm for ARPU-learning non-homogeneous linear separators under the Tsybakov Low Noise Condition.
Similar to how Xu et al. [15] use comparisons as a subroutine for correcting label errors, we use an approximate sorting scheme (modified from a seminal work from Braverman and Mossel [19] on sorting with noisy comparisons) to create a small set of points whose labels and comparisons are correct with high probability. We then feed this cleaned set into an inference LP, and repeat the process in a boosting style algorithm based off of the framework of [4]. By carefully curating the cleaned set at each step, we are able to use a symmetry argument from [4] to prove that our learners have good coverage, while the guarantees of [19] and the inference framework give reliability.
Our algorithms require the use of comparison queries, an addition which we show is necessary in many cases for active PAC and ARPU-learning. Along with recalling lower bounds from [6] which show comparisons are necessary for efficiently active learning non-homogeneous hyperplanes, we show that in the noiseless case it is impossible to ARPU-learn the uniform distribution over in a finite number of label queries. Further, even with the addition of a margin assumption we show the existence of simple distributions which require a number of label queries that is exponential in dimension. Because Massart noise and certain instantiations of Tsybakov noise subsume the noiseless case, these results prove the existence of a large gap between labels and comparisons for noisy ARPU-learning.
Our paper proceeds as follows. In Sections 1.1, 1.2, and 1.3 we cover preliminaries, our main results, and our main techniques respectively. In Section 2 we present query and computationally efficient algorithms for ARPU-learning hypothesis classes with finite inference dimension or super exponential average inference dimension under the Massart noise model, as well as a lower bound for ARPU-learning using only labels. In Section 3 we present algorithms for ARPU-learning linear separators with margin and finite inference dimension or over distributions with weak distributional conditions under the Tsybakov Low Noise Condition, as well as a lower bound for ARPU-learning a corresponding distribution with margin using only labels
1.1 Preliminaries
1.1.1 Basic definitions
A hypothesis class is a pair (X, H), where X is a set, and H is a class of functions . Each function
is called a hypothesis. We refer to
as the associated concept class. For example, when H is the class of
affine functions, then the associated concept class
is the class of d-dimensional half-spaces.
We consider the binary classification problem, where we want to predict the binary label y for each instance x. We assume access to an underlying unknown distribution over X and a label oracle
. Querying
with unlabeled
generates a label
, drawn from unknown distribution
Note that querying
on the same point again would generate the same answer. We use notation
to denote the joint distribution over examples x and labels from
1.1.2 PAC-Learning
Probably Approximately Correct (PAC) learning is a probabilistic framework due to Valiant [20] and Vapnik and Chervonenkis [21] for learning adversarially chosen classifiers and input distributions. In this model, given a set X and a set H of hypotheses , an adversary first chooses distribution
with the marginal distribution
, we call this realizable case learning. With no knowledge of the choice of distribution the learner draws labeled samples from
with the goal of outputting c = sign(h) for some hypothesis
which minimizes loss over
In the realizable case, a hypothesis class (X, H) is called PAC-learnable if , there exists a learner A, where no matter the choice of the adversary, outputs a concept A(S) such that:
Here is called the sample complexity, and must be
to be PAC-learnable.
1.1.3 RPU-Learning
Reliable and Probably Useful (RPU) learning is an alternative learning framework in which the learner is not allowed to make errors, but may instead respond “I don’t know”, notated by “”. Introduced by Rivest and Sloan [16], RPU learning was later studied under the name of Perfect Selective Classification by El-Yaniv and Weiner [18], and confident learning by Kane, Lovett, Moran, and Zhang [4]. Since it is easy to make a reliable learner by simply always outputting “
”, our learner must be useful, and with high probability cannot output “
” more than a small fraction of the time. Let A be a reliable learner and A(S) be the concept returned by the learner A on training sample S, then we define the loss of A(S) as the measure of unlearned samples:
We will commonly refer to . Sample complexity and learnability are then defined analogously to PAC-learning. Note that any point which is not labeled “
” by an RPU-learner is labeled correctly.
1.1.4 Comparison Queries
Following the framework of [4], our learner will have access to more information than just the label of a point. We focus on one particularly natural additional query, the ability to compare points. A comparison query measures the relative distance of two points to the decision boundary. In other words, say that our goal is to identify photographs of diseased vs healthy patients. A comparison query asks: “which patient looks healthier?”. Formally, given an underlying function and two points
, a comparison query asks which one of
is bigger. Equivalently:
Similar to our label oracle , we define a comparison oracle
with two points
generates a comparison result
, which is drawn from an unknown distribution
Along with their added theoretical power [4, 6], comparison queries are already used in practice in recommender systems [22] and ranking systems [19], and in some scenarios have better accuracy than label queries [15].
1.1.5 Inference Dimension
Inference dimension is a combinatorial complexity measure introduced by Kane et al. [4] to characterize the query complexity in active learning when the learner is allowed to ask a more complicated set of questions. Given a set of binary queries Q, let Q(S) denote the answers to all such queries on the sample be an unlabeled sample. For
denote the statement that answers to binary queries from Q on the sample S determine the label of x, when the learned concept is sign(h(x)), corresponding to an hypothesis h. We will often say for shorthand that S “infers” x, and sometimes drop the underlying classifier h. In this case the underlying function is assumed to be the Bayes optimal classifier. Inference dimension with respect to some query set Q is defined as follows.
Definition 1.1 (Inference dimension). The inference dimension of (X, H) is the minimal number k such that for every , and every
there exists
If no such k exists then the inference dimension of (X, H) is defined as
Inference dimension is a worst case measure. Since we will be dealing with varying levels of distribution dependence, we will also take advantage of an average case version of inference dimension introduced in [6].
Definition 1.2 (Average Inference Dimension). We say has average inference dimension g(n), if:
Average inference dimension is used to prove that the inference dimension of a finite sample drawn from cannot be too large with high probability. This allows us to build query efficient algorithms for hypothesis class with infinite inference dimension by proving that large finite samples do not take too many queries to learn with high probability.
1.1.6 Noisy Learning
Before we discuss our relaxation of RPU-learning, we formalize the presence of noise in our distributions. Given a hypothesis class (X, H), we assume the Bayes optimal classifier is some hypothesis with decision boundary
. Note that
itself can have non-zero error. To measure the noise in our model we define the conditional probability distributions
Note that for all the noise models discussed below, querying on the same point again (and similarly querying
with the same pair of points again) would generate the same answer. This is a realistic model for the case where the oracle is a human expert who may err with some probability across different inputs, but will always return the same answer on the same input.
Massart Noise Massart, or bounded noise, is a well studied model of noise throughout statistics and learning theory [2, 8, 15]. Massart noise is a tractable and realistic generalization of the standard random classification noise model [23], where the oracle flips its response with probability p < 1/2. Similar to [14, 15], we say “noisy” oracles and
satisfy Massart noise with parameter
if the conditional label and comparison distributions are such that
Equivalently, we say that ) satisfies Massart noise with parameter
, if an adversary constructs
(resp.
) by first taking the “clean” oracle
(resp.
) and then flipping the result of the oracle with probability at most
Tsybakov Low Noise Condition Massart error is restrictive in that the distributions and
are bounded away from
– in reality, this may not be the case as examples approach the decision boundary. The Tsybakov Low Noise Condition (TNC) [9] offers an alternative: the closer an example is to the decision boundary, the closer its error to 1/2. There is a natural extension of this intuition to comparison queries as well: comparisons made between arbitrarily close points should be arbitrarily noisy. A number of variants of TNC have been studied in the literature. Here we will follow the variant studied in [7, 24]. Let
be the Bayes optimal classifier. We say
satisfies the Tsybakov Low Noise Condition with parameters, m < M,
In other words, far away from the decision boundary satisfies Massart noise, but approaches 1/2 at a polynomial rate as x approaches the decision boundary. Similarly,
satisfies the Tsybakov Low Noise Condition with parameters,
Similar to the label case, satisfies Massart noise for pairs of points
which differ greatly with respect to
and approaches 1/2 at a polynomial rate as
approaches 0.
Generalized Tsybakov Low Noise Condition The Tsybakov Low Noise Condition upper and lower bounds correctness by a particular function of distance. We will consider the direct generalization of this model where these bounds are replaced with arbitrary monotone increasing functions . We say
satisfies the Generalized Tsybakov Low Noise Condition with parameters
Similarly, we say satisfies the Generalized Tsybakov Low Noise Condition with parameters
For notational convenience, we will sometimes write for
. In addition, since we will often need to compose
, we will use the simplified notation:
where c is some constant.
1.1.7 ARPU-Learning
RPU learning suffers from an inability to deal with noise. We introduce the learning framework Almost Reliable and Probably Useful Learning (ARPU-Learning), a relaxation of RPU-learning that allows for noise, but keeps stronger reliability guarantees than PAC-learning. Recall that given a distribution for a reliable learner A and sample S, we define the loss of A(S) as the measure of unlearned samples:
We will commonly refer to as the coverage of A(S). A model is a pair
where Q is a set of oracles
and
is a set of distributions over X. In ARPU-Learning, given a hypothesis class (X, H) and a model
, an adversary chooses a distribution
and the “noisy” oracles
, which induces a distribution
Definition 1.3 (ARPU-Learnable). We say that a hypothesis class (X, H) is ARPU-learnable under model , there exists a learner A which is
Note that in both Equations (5) and (6) the probability is over the randomness of the algorithm, sample S, and noisy oracles chosen by the adversary. Also, in comparison to PAC learning, all point which are not labeled “
” by an ARPU-learner are labeled correctly with high probability and setting
reduces exactly to RPU learning. Sample complexity and learnability are then defined equivalently to PAC-learning. Finally, we will refer to learners that satisfy condition (5) as
-useful, and learners that satisfy condition (6) as
-reliable. While the logical inference technique previously used to build RPU learners [4, 6] are very sensitive to noise, we show in later sections how to modify those techniques to build ARPU-learners.
1.1.8 Passive vs Active learning
PAC-learning traditionally is applied to supervised learning, where the learning algorithm receives pre-labeled samples. We call this paradigm passive learning. In contrast, active learning refers to the case where the learner receives unlabeled samples and may adaptively query a labeling or comparison oracle. Similar to the passive case, for active learning we study the query complexity as the minimum number of queries to learn some pair (X, H) in either the PAC, RPU or ARPU-learning model. In general, passive learners learn concept classes up to error in
samples. We add to a long line of work [2–5, 7, 14] showing that active learning can achieve such learning in only polylog
queries on important concept classes.
1.2 Our Results
In this work, we study ARPU-learning (Section 1.1.7) under two widely studied noise models: Massart Noise and the Generalized Tsybakov Low Noise Condition.
1.2.1 Notation
We use notation where X is the instance space, H is the set of hypothesis from is the class of linear separators in
(corresponding to affine functions
), and
is the class of linear separators in
with margin
. Since previous work [2, 14] refers to the class of homogeneous linear separators as simply “linear separators,” we will often refer to
as “non-homogeneous linear separators” to differentiate our results. For noise models,
is the set of all oracles which satisfy Massart noise with parameter
is the set of all oracles which satisfy the Generalized Tsybakov Low Noise Condition with parameters
is the set of all oracles satisfying the Tsybakov Low Noise Condition with parameters (
). A model is a pair
where Q is a set of oracles
is a set of distributions over X. For distributions over instance space
1. is the class of all continuous distributions over X,
2. is the class of all log-concave distribution on
3. is the class of all s-concave distributions on
4. is the class of all isotropic s-concave distributions on
5. is the class of all continuous distributions D which satisfy the following concentration and anti-concentration inequalities:
6. For hypothesis class is the class of all continuous distributions
over X such that
has average inference dimension
We will call an algorithm sample (respectively time) efficient if it uses samples (respectively time), and query efficient if it uses
queries. Finally, for some parameter n (e.g. dimension, error) and function
, for the sake of readability we will often use the notation
to ignore multiplicative factors that are logarithmic in f(n).
1.2.2 Massart Noise
To begin, we show that under the Massart noise model, finite inference dimension (Definition 1.1) implies computationally efficient ARPU-learning with exponentially better query complexity than any passive PAClearneris the set of all oracles which satisfy Massart noise with parameter
is the class of all continuous distributions over X, and a model is a pair
where Q is a set of oracles
and
is a set of distributions over X. Note that in the ARPU-Learning model (Definition 1.3), given a hypothesis class (X, H) and a model
, an adversary chooses a distribution
from
and the “noisy” oracles
Theorem 1.4 (Finite Inference Dimension ARPU-Learning under Massart Noise). Let the hypothesis class
, have inference dimension k with respect to comparison queries. Then, (X, H) is ARPUlearnable under model
, uses only
unlabeled samples, and has a query complexity of
for
To put this result into context, we note two lower bounds which together with Theorem 1.4 show a separation between passive and active learning, and label only and comparison based ARPU-learning. In the case of passive, comparison based PAC-learning, we recall the lower bound from [6]. For label only APRU-learning, we present a lower bound novel to this work:
Lemma 1.5. The query complexity of 1/4-reliably, 1/8-usefully ARPU learning under model
is infinite:
Together, these bounds show that comparison based active learning provides not only an exponential improvement in query complexity over any passive PAC-learner, but also an infinite improvement over any active ARPU-learner using only labels. Further, Theorem 1.4 provides the first algorithm for learning noisy non-homogeneous linear separators in two dimensions which is time, sample, and query efficient in the sense of Section 1.2.1, since the inference dimension of is 5 [4]. If the instance space has bounded bit-complexity or minimal-ratio, the result also implies an efficient learner for higher dimensional
non-homogeneous linear separators.
Bounded bit-complexity and minimal-ratio, however, are assumptions that may not hold on real-world data. Instead, we will take a path inspired by the recent explosion of work in data science [25] that focuses on weakly restricting the distribution over data to beat lower bounds based off of improbable adversarial examples. While inference dimension itself is not applicable in this scenario, we will employ its average case variant, average inference dimension (Definition 1.2). In particular, we provide a computationally efficient algorithm for learning under Massart noise under the assumption that the hypothesis class and distribution have super-exponential average inference dimension, a fact true for non-homogeneous linear separators and comparison queries across a wide range of distributions [6]. Given a hypothesis class (X, H), recall is the class of all continuous distributions
has average inference dimension
and function of dimension f(d). Then,
Theorem 1.6 (Average Inference Dimension ARPU-Learning under Massart Noise). Consider any hypothesis class
, and corresponding class of distributions
. Then, (X, H) is ARPU-learnable under model
in time
, uses only
unlabeled samples, and has a query complexity of
for small enough
To see the applicability of Theorem 2.8, we note that Hopkins et al. [6] proved that a wide range of distributions lie in . In particular, following [6], we say two distributions
affinely equivalent if there is an invertible affine map
. Hopkins et al. [6] proved that distributions which may be affinely transformed to a distribution with anti-concentration and concentration (i.e. to a distribution in
) lie in
, a condition satisfied by s-concave distributions
. Recall that
is the class of all s-concave distribution,
, on
and
is the class of both homogeneous and non-homogeneous linear separators in
. Then, as a direct corollary to Theorem 1.6, we have
Corollary 1.7. The hypothesis class is ARPU-learnable under model
for small enough
Previous work showed a similar result for homogeneous linear separators over nearly isotropic log-concave distributions [26] and isotropic s-concave distributions [3] with label queries. However, their techniques cannot be extended to the non-homogeneous case due to a lower bound on the query complexity of active label-only learners [6]. Thus it is only by leveraging the additional power of comparison queries that we extend efficient learning to non-homogeneous linear separators over (not necessarily isotropic) s-concave distributions.
1.2.3 Generalized Tsybakov Low Noise Condition
While Massart noise is a clean theoretical model, its assumption that the noise is bounded away from 1/2 is not necessarily reminiscent of the real world. This motivates us to study a variant of the Tsybakov Low Noise Condition, a model in which noise is unbounded as data approaches the Bayes optimal classifier. However, learning in this unbounded regime is harder, as evidenced by the polynomial query lower bounds of [12, 13, 15]. In order to ARPU-learn in this regime, we need to introduce several restrictions not present for our Massart algorithms. First, instead of allowing any hypothesis class with finite inference dimension, we will only consider (non-homogeneous) linear separators. Second, we will either assume some margin , or that the distribution satisfies certain weak concentration and anti-concentration bounds. To begin, we consider learning hypothesis classes over any continuous distribution with finite inference dimension and margin. Recall
is the set of all oracles which satisfy the Generalized Tsybakov Low Noise Condition with parameters
is the class of linear separators in
with margin
from X, and
is the class of all continuous distributions over X.
Theorem 1.8 (Finite Inference Dimension and Margin ARPU-Learning under GTNC). Let
and
have inference dimension k with respect to comparison queries. Then for small enough
,
is ARPU-learnable under model
with query complexity:
Where
We prove in addition that while ARPU-learning may no longer be impossible using only labels when margin is introduced, it still suffers from query inefficiency due to the curse of dimensionality.
-dimensional hypercube
modified to have a ball of radius
about each point. The query complexity of ARPU-learning under model
is at least:
In the above example, has inference dimension
by a minimal-ratio argument from [4]. Theorem 1.4 thus gives an algorithm using only poly(d) queries, demonstrating the exponential gap in query complexity between label only and comparison based ARPU-learning with Tsybakov noise. Due to margin causing bounded error in label queries, another way to view this result is the statement that comparison queries with unbounded error exponenentially improve the query complexity of ARPU-learning using only labels with bounded error.
Similar to the case of Massart noise, we may drop the restrictive assumptions of finite inference dimension and margin by assuming weak distributional requirements. Unlike in the case of Massart, here we deal with the requirements directly rather than assuming average inference dimension. Recall is the class of all continuous distributions D with the following properties:
1.
2.
Theorem 1.10 (Concentration and Anti-Concentration ARPU-learning under GTNC). For small enough
, the hypothesis class
is ARPU-learnable under model
query complexity:
where
Since isotropic s-concave distributions satisfy these conditions [3, 6], we get the immediate corollary for TNC noise under isotropic s-concave distributions. Recall that is the class of all isotropic (0 mean, identity variance) s-concave distribution on
is the class of non-homogeneous linear separators in
Corollary 1.11. The hypothesis class is ARPU-learnable under model
with query complexity:
Where
This result similarly extends previous work on homogeneous linear separators over isotropic log-concave distributions [2, 13] to the non-homogeneous case. In comparison to Hanneke and Yang’s [12] distribution free algorithm for label only PAC-learning, Corollary 1.11 provides an improved query complexity for and more importantly provides the reliability guarantees of the ARPU-learning model.
Finally, note that unlike Theorems 1.4, 1.6, and 1.8, Corollary 1.11 has polynomial rather than polylogarithmic dependence on . This is unavoidable, as we prove a lower bound also polynomial in
Lemma 1.12. The query complexity of actively PAC-learning under model
is at least
Thus the main advantage of comparisons in this regime is their added reliability.
1.3 Techniques
1.3.1 Inference Dimension
Our algorithms will follow the form of the learning technique for hypothesis classes with finite inference dimension (Definition 1.1) introduced in [4]. Drawing and querying a subsample S, Kane et al. build a weak learner by defining a Linear Program (LP) with constraints given by the query responses Q(S), and objective function defined by the input point to be labeled. Through a symmetry argument, Kane et al. are able to show that if S is large enough with respect to the inference dimension, the coverage of this weak learner will be at least 3/4. Since we will rely on this argument throughout our paper, we offer a brief description here.
The expected coverage of the learner may be viewed as the probability that a randomly drawn point from the distribution is inferred by the LP. Since our weak learner is built from some finite sample from the same distribution, symmetry gives that this is equivalent to the probability that any of |S| + 1 points can be inferred from the other |S|. Kane et al. then provide the following observation for |S| = n and inference dimension k which proves that setting n = 4k gives coverage at least 3/4.
Observation 1.13 (Observation 3.4 [4]). Let the hypothesis class (S, H), |S| = n, have inference dimension k for the set of binary queries Q. Then , there exists a subset
of size
such that
Inference dimension on its own, however, is restrictive. Using only comparisons and labels, the inference dimension of linear separators in three or more dimensions is infinite, which implies the existence of realizable distributions with query complexity [4]. To get around this barrier, we will introduce weak distributional assumptions and instead employ the framework of average inference dimension introduced in [6]. Average inference dimension (Definition 1.2) allows us to build algorithms for hypothesis classes with infinite inference dimension, as long as the distribution it is over is sufficiently nice. We will take advantage of a reduction from average to worst case inference dimension to prove such results:
Observation 1.14 (Observation 3.6 [6]). Let (D, X, H) have average inference dimension Then (S, H) has inference dimension k with probability:
1.3.2 Noisy Sorting
The linear program used as a weak learner relies heavily on the correctness of Q(S), making noisy oracles a challenging problem. To retain correctness and reliability, we rely on using extra points outside of the linear program to help identify the true answers Q(S). This idea is not all together new. Contemporaneously with Kane et al., Xu, Zhang, Singh, Miller and Dubrawski [15] suggested using noisy comparisons as a sub-routine in older active learning algorithms to correct for noise in labels. However, as Xu et al. [15] point out, this technique does not work for Kane et al.’s [4] algorithm which requires corrected comparisons as well. Instead, we adopt and adapt a noisy sorting algorithm from Braverman and Mossel [19].
Braverman and Mossel [19] study the problem of recovering the best possible ranking from an ordered set with access to a noisy comparison oracle . In particular, given a ground set S of size n, Braverman and Mossel aim to find an order
that minimizes the number of discrepancies with the measured comparisons
, denoted by the order relation
If the oracle flips comparisons with probability exactly p < 1/2 and the true ordering has a uniform prior, Braverman and Mossel [19] note that this scoring function has a nice probabilistic interpretation: it is a Maximum Likelihood ordering
Braverman and Mossel [19] call finding such an ordering the Noisy Signal Aggregation (NSA) problem, and provide a randomized algorithm that uses only comparisons for oracles satisfying Massart noise with parameter
. Further, they provide an important structural insight into MLE orderings: with high probability, no point in an MLE order has moved further than
from its position in the true order.
Theorem 1.15 (Optimal Ranking [19]). Let S be a set of size n with underlying order 1, . . . , n and an MLE order for S under comparisons given by an oracle
satisfying Massart noise with parameter
with probability at least
as long as is at least exponential in
This pointwise movement allows us to determine with high probability comparisons between points that are well-separated throughout an MLE order. By using only such separated points to build our inference LP, our algorithms are almost reliable – a point can only be mislabeled if some well-separated comparison is wrong, a low probability event.
While Braverman and Mossel’s algorithm is query efficient and has a strong pointwise movement guarantee, its exponential time complexity in the error parameter is the main limiting factor in the computational efficiency of our algorithm for Massart noise. The existence of an efficient (polynomial in error) sorting scheme that retains some sub-linear (not necessarily logarithmic) point-wise movement bound under Massart noise would immediately imply computationally and query efficient algorithms for Massart noise for poly-logarithmic in
, rather than for
as we require. Follow up works on Braverman and Massart’s algorithm [27–29] made progress in this direction, providing algorithms with significantly improved time complexity, but only work for
bounded from below by some constant. Providing an algorithm that remains efficient while
goes to 0 is an open problem.
1.3.3 Cluster Detection and Inference
Braverman and Mossel’s noisy sorting algorithm works well in the case of bounded error, but noise models with unbounded error require a different approach. The particular model we examine in this case, the Generalized Tsybakov Low Noise Condition, is a distance based error metric. This means that as points approach each other in function value, their comparisons “look random”. We can use this fact to detect clusters of points close in function value by testing whether comparisons between them look like they have been drawn at random. In particular, we define a natural measure of randomness that we call equitability:
Definition 1.16 (Equitability). Let S be a set with comparisons denoted by on each pair of elements. For an element
denote the number of elements
-equitable if
We prove a bi-directional equivalence between clusters and equitable sets: any cluster is equitable with high probability, and any equitable set contains a large cluster with high probability.
If a sample has no cluster, we will prove that a modified version of noisy sorting is sufficient to learn. On the other hand, if we detect a cluster, another approach is required. To handle this case we prove a novel structural lemma regarding the inference power of clusters, showing that any cluster of size contain a point that can be inferred from the rest.
2.1 Lower Bounds
In this section we provide two lower bounds: the first to separate comparisons from label only ARPU learning, and the second to explain our restriction to continuous distributions. Our label only lower bound uses the same distribution that shows an exponential gap in active PAC learning between labels and comparisons [1, 4], a circle, except in the case of ARPU learning, the gap is infinite.
Lemma 2.1 (Restatement of Lemma 1.5). The query complexity of 1/4-reliably, 1/8-usefully ARPU learning -coverage under model
is infinite:
Proof. By Yao’s minimax principle it is enough to show that the adversary may pick a distribution over hyperplanes such that no learner can 1/4-reliably and 1/8-usefully learn with coverage 1/2. In particular, assume that the adversary picks a uniform distribution over all tangent hyperplanes to the circle. This may be equivalently thought of as the adversary picking a single point on the circle to be negative, and the rest to be positive. Note that the probability that a learner which queries a finite number of points finds the negative point is 0.
Let the learner fix an optimal strategy, querying whatever points they desire. With probability 1, the learner will always query the same set of points since they receive all positive labels. The learner is then left to label 1/2 the measure of the circle blind, since all points except a measure 0 set (the queried points) are equally likely to be the negative point. No matter which set the learner chooses, the probability that it mislabels a point is at least 1/2, violating the ARPU-learning requirement that the learner must must label at least 1/2 of the points with probability at least 5/8 while making no errors.
Note that this lower bound holds even in the noiseless case, which is strictly weaker than Massart as the adversary may simply choose no noise.
Second, we justify why our upper bounds are only for continuous distributions, as the inference dimension framework was initially developed for the worst case rather than distributional model. However, with the introduction of noise, we observe that learning up to arbitrary error is no longer possible over some distributions.
Observation 2.2. Let (X, H) be a hypothesis class, and D a distribution on X whose support consists of a single point x. Let the corresponding noisy label and comparison oracles . If there exist
, then no learner can correctly label x with probability more than
This lower bound holds as well across a wide range of distributions containing points with non-zero measure. Take, as an example, a distribution which samples uniformly from the unit ball with probability 1/2, and some disjoint point x with probability 1/2. Setting the error parameter low enough would force the learner to correctly label x, and since the adversary can pick a classifier such that the point cannot be inferred from comparisons, a similar lower bound holds. In order to avoid such examples, we will restrict our consideration to continuous distributions.
2.2 Finite Inference Dimension
With the lower bound out of the way, we prove that hypothesis classes with finite inference dimension are efficiently ARPU-learnable under Massart noise. Recall is the set of all oracles which satisfy Massart noise with parameter
is the class of all continuous distributions over X and a model is a pair
is a set of oracles
is a set of distributions over X. Note that in the ARPU-Learning model (Definition 1.3), given a hypothesis class (X, H) and a model
, an adversary chooses a distribution
and the “noisy” oracles
. The following is our learning algorithm.
Before proving the lemmas necessary to show the coverage of Step 2 from Algorithm 1, we will restate our theorem of the efficient learnability of hypothesis classes with finite inference dimension under Massart noise.
Theorem 2.3 (Restatement of Theorem 1.4). Let the hypothesis class , have inference dimension k with respect to comparison queries. Then, (X, H) is ARPU-learnable under model
in time
, uses only
unlabeled samples, and has a query complexity of
for small enough
See Algorithm 1. The proof of this theorem lies in the combination of Braverman and Mossel’s [19] approximate ordering with Kane et al.’s [4] inference based algorithm. The idea is as follows:
Step 1: Draw a sample , and sort it into an MLE order by the algorithm of Braverman and Mossel [19]. Draw another m points, and independently slot them into the ordering on S near their true position (again by an algorithm from [19]).
Step 2: From the m points, create a clean subset of points with correct labels and comparisons by selecting a chain of points separated by in the MLE order of S to build an inference LP. This LP correctly infers points with high probability by [19], and has large coverage due to the space’s finite inference dimension [4].
Step 3: Restrict D (by rejection sampling) to points un-inferred by the LP in step 2, and repeat steps 1 and 2 until coverage has reached
The main challenge of the proof then comes down to proving the correctness and coverage of Step 2. First, we need to show that points separated in are correctly ordered. Since our sample size n will not be exponential in
, we need to slightly modify Theorem 1.15 for this result.
Observation 2.4 (Point-wise Movement). Let S be a set with underlying order is an MLE order for S under noisy label and comparison oracles
, then with probability at least
as long as is polynomial in
Proof. Braverman and Mossel define a parameter during their proof as:
The requirement on size of of Theorem 1.15 then comes from the final equation of Lemma 28 [19]:
Increasing by a factor of
removes the need for exponential dependence on
, but increases the pointwise movement bound by the same factor.
Note that points in S separated by 2 max in an MLE order are thus correctly ordered with high
probability. As a result, picking a chain of points each separated by twice Equation (7) gives an entire set of points with correct comparisons with high probability.
However, using an MLE order itself is challenging. Recall from Section 1.3 that we compute the expected coverage of our learner by the probability that it infers an additionally drawn point. If we use an MLE order, we cannot directly appeal to the symmetry argument of [4], as adding an additional point to S might change the MLE order we have picked. To get around this, our learner is not built off of S itself, but , a set of additional points which we place into the order on S independently of each other. This independence allows us to directly appeal to the argument of [4].
Our method of finding a clean subset, however, is currently for S – we need to modify the method to find a subset of with correct labels and comparisons. We do this in two steps. First, we adopt a method from [19] for inserting points into a previously sorted set such that they cannot be too far away from their true position. This implies that if points in
are separated by enough points in S in the underlying order on
, we will be able to correctly compare them with high probability. Second, we show that because the underlying true order on
is uniform, there exists a chain of such points in
with constant probability from which we can build our cleaned set.
Lemma 2.5 (Slotting [19])be ordered sets with noisy label and comparison oracles
. Divide an MLE order
of size at least:
There exists an algorithm placing points in into
such that, with probability at least
, any pair of points separated by 4 blocks are in the correct order.
Proof. This lemma is a slight modification of part of [19, Theorem 30]. Assume some lies in the i-th block
in the true order. By Observation 2.4, with probability at least
must be bigger than all elements before
and smaller than all elements past
. To find which side of a block B x lies in, we measure whether x is greater than, or less than a majority of elements in the block. A standard Chernoff bound gives that the probability the majority is incorrect is at most
, and union bounding over blocks and
gives that all elements will be slotted up to an error of two blocks on either side. Note further that this slotting procedure may be performed by binary search, and thus uses at most O(log(b)m|B|) queries in total. Finally, since elements must be slotted within two blocks of their true position, any pair of elements separated by at least 4 full blocks must be in the correct order.
Since we can safely compare points separated by 4 blocks in the MLE order, and points slot within 2 blocks of their true position, points separated in the underlying order by 8 blocks can be correctly compared with high probability. It is left to show that there is a large enough chain of points in separated by 8 blocks in S.
be ordered sets with noisy label and comparison oracles
. Let the size of S satisfy:
Then with constant probability we can find a subset of 4k points from which can be labeled and compared correctly with probability
Proof. Consider the true order on the set
be the special point whose comparison to another point x is given by x’s label
. Lemma 2.5 provides an algorithm for determining the labels and comparisons of points in
separated by more than
elements in S and not within c of 0. Consider dividing the order restricted to S (denoted by
) up into b = 32k + 16 equal blocks
of size at least c. Since any two points in
which are separated by more than a block in
will be correctly ordered by Lemma 2.5 with probability
non-contiguous blocks can be adjacent to 0, it is sufficient to find a chain of non-contiguous blocks of size 4k + 2 that all contain a point in
. To simplify this, consider the set of every other block (
), and let Y be the random variable denoting the number of blocks in
without a point in
. To upper bound the value of Y , we bound its mean and variance and apply Chebyshev’s inequality. Note that since
are drawn i.i.d., the ordering on
is uniform at random. We can write Y as the sum of indicator variables
where
denotes the event that
does not have a point in
. Since the ordering is uniform, the probability that a point in
lies in any given block is
. Using this, we can bound the expectation of Y by:
and similarly the variance of Y by:
Here the second to last inequality follows from the assumption that (or equivalently that
). Noting that the number of blocks with a point from
is
, Chebyshev’s inequality then gives that a constant fraction of the blocks must have a point from
with constant probability:
Lemma 2.6 allows us to build a clean set of points with correct comparisons and labels. By feeding this set of points into an inference LP, we create a weak learner that infers a constant fraction of the space with constant probability.
Lemma 2.7 (Weak Learner). Let (X, H) have inference dimension k, and let the label and comparison oracles . Then there exists a constant
such that for any
, there exists a weak learner that
-reliably learns (X, H), has coverage
with probability
, makes at most
queries, and runs in time
Following Lemmas 2.5 and 2.6, we we will slot a second, i.i.d. drawn set of points into our MLE order where
. Then with constant probability we can find a subset of
which may be
correctly ordered and labeled with probability at least
We are now in position to apply the symmetry argument from [4] to show that this subset gives constant coverage with constant probability. The expected coverage is given by the probability that an additional, independently drawn point is inferred:
Since are drawn randomly, the right hand side is equivalent to the probability that any point in the sample can be inferred from the rest:
Recall that with probability at least we can find and, with probability
, correctly order and label a subset of 4k points from
. By Observation 1.13, at least 3k of these can be inferred from the rest, bounding the right hand side by:
where we have assumed . Then for any constant
which for small enough
Accounting for the fact that we have assumed our comparisons and labels are correct, our weak learner has coverage with probability at least
Query Complexity: Now, we compute the number of queries made by the weak learner. Let where
is the point-wise movement as defined in Observation 2.4. Using the same notation as [19], we let (setting
Using [19, Lemmas 31 and 32], the number of queries made in the sorting n points (which includes dynamic programming step on n points and slotting n points) and slotting additional points are
with probability . Since we do not want our number of queries to be probabilistic, if our learner does not complete after
queries, we stop and output all 0’s. This increases our error probability by
Time Complexity: Using an algorithm from [19, Theorem 30], we can sort n points with noisy comparisons in time where
with probability
. Since slotting a point in worst case takes O(n) time, we can slot O(k) points in time O(kn). This gives us the total time taken by the weak learner as
Therefore, the time complexity of the algorithm is . Once again taking the strategy of outputting all 0’s if the algorithm does not complete in time
, we lose another error factor of
, making the algorithm all together
With our weak learner in hand, all that is left for the proof of Theorem 2.3 is Step 3: stringing together copies of the weak learner through rejection sampling.
be reliability and usefullness parameters for our weak learner. Recall that Lemma 2.7 gives a
-reliable weak learner with coverage
with probability
. Applying this weak learner
then amplifies this probability to at least
Restricting to the distribution of un-inferred points via rejection sampling, we repeat the above process until our coverage reaches . Assume each repetition is successful, then after t steps our coverage is:
Setting is then sufficient to set the right hand side to
. However, each repetition in this process degrades the overall probability of usefulness. In order to get an overall guarantee of
, we must adjust our initial
Similarly, since we apply the weak learner times, we adjust our
Query Complexity: In total, we run our weak learner at most times, giving a query complexity of:
Sample Complexity: At each step of our algorithm, we restrict to the distribution of un-inferred points through rejection sampling. By itself, this poses a problem: what if we have inferred much of the space early and our algorithm continually rejects points? To combat this, we note that we can estimate the measure of remaining un-inferred points by how many samples we have to draw before finding one. Formally, if at any step we draw inferred points in a row, then by a Chernoff bound the coverage of our learner is
with probability at least
be the sample size as defined in Lemma 2.7. Since our algorithm
our algorithm stops after rejecting points in a row. This means that we can bound the total number of samples drawn by
Time Complexity: The time complexity of our algorithm has two main components: the complexity of finding an MLE order in the weak learner, and the complexity of rejection sampling. We already computed the time complexity of the weak learner in Lemma 2.7 as . Since,
we run our weak learner at most
It remains to compute the time complexity of rejection sampling. Recall that the we sample at most points total in our process. For each point, we run an LP in d + 1 variables with constraints detailed by our previous queries that round. Since the queries our weak learner uses in each round only involve
points, the time complexity of sampling is at most:
Since the total time complexity is order of the sum of sampling and sorting, we get an algorithm that runs in time
2.3 Average Inference Dimension
While inference dimension allows us to work over arbitrary continuous distributions, as a complexity parameter it is rather restricting, barring for instance the learning of linear separators in dimensions above two. To generalize to a broader range of classifiers, we will use the framework of average inference dimension introduced in [6]. In particular, we show that any hypothesis class and distribution with super-exponential average inference dimension may be efficiently learned under Massart noise. As a result, we provide the first computationally and query efficient learner for non-homogeneous linear separators over s-concave distributions with Massart noise.
Theorem 2.8 (Restatement of Theorem 1.6). Consider any hypothesis class (X, H) and corresponding class of distributions . Then, (X, H) is ARPU-learnable under model
in time
, uses only
unlabeled samples, and has a query complexity of
for small enough
Average inference dimension gives a high probability bound on the inference dimension of a finite sample. However, shifting our strategy to directly work with a finite samples introduces a new problem: since our algorithm corrects noise via extra helper points, we may not be able to learn the entire sample. Our first step will be to show that learning most of a finite sample in few queries with high probability is sufficient to learn the entire distribution.
Lemma 2.9. Let (X, H) be a hypothesis class, and a distribution over X. Let A be an active, inference based learner taking in finite samples
with the property that for sufficiently large n, A learns a
fraction of S with probability
, while querying at most an
fraction of the points. The expected coverage of A over the entirety of X is at least:
Proof. To find the expected coverage of A over the entire distribution based on samples S of size n, we look at the probability that an additional randomly drawn point is inferred:
We can bound the right hand term by looking at A applied to samples of size n + 1. In particular, if
is learned but not queried by A, then because A is an inference based learner, it must be the case that
infer
. Since the points of
are drawn i.i.d from
, the probability that A queries or learns any given point
is the same for all
. Because a
fraction of points are learned with probability
and only an
fraction of points are queried, the probability that a point is learned but not queried is at least
by a union bound, which gives the desired bound on A’s coverage.
Proof of Theorem 2.8. We will argue that the learner presented in Theorem 2.3 satisfies the properties of Lemma 2.9 for a large enough sample size. To prove this, we first examine learning a specific sample with small inference dimension. The coverage over all samples will then follow from the fact that almost all samples have small inference dimension due by Observation 1.14 [6] and our assumption on average inference dimension.
Because we are considering a fixed sample S, the weak learner draws uniformly without replacement from S (denoted ) rather than from the distribution itself. All required symmetry arguments still hold in this regime, as the order that points are pulled is still uniformly random. The expected coverage of our learner over S is thus the same as for X in Lemma 2.7 adjusted for the fact that we sample without replacement:
and hence
Assume for now that |S| is large enough that the subtracted term is negligible. To analyze the remaining coverage probability, assume that n satisfies the constraints of Lemma 2.7 with , and further that S has inference dimension k. Then by the arguments in Lemma 2.7, this probability over the sample itself and noisy oracles is constant. Further, as long as S is sufficiently large, we can get coverage
with probability
by applying the same argument restricted to the subset of un-inferred points
times. This argument only fails when there are no longer n remaining points for our weak learner to use, but as long as
, this will not affect our coverage. Since Lemma 2.9 also only allows the learner to query a
fraction of points, we set S to:
which also validates our assumption that is negligible (we lose less than
over all iterations). To apply Lemma 2.9, it is sufficient to have a learner A such that:
Because , S has inference dimension k with probability at least
by Observation 1.14 [6]. Combining this with the fact that our algorithm has a
probability of achieving
coverage when the inference dimension is k proves this claim.
Finally, by Lemma 2.9, our learner has expected coverage is over the entire space. To get the desired coverage probability, we run the algorithm over
samples, setting
to amend the degradation of correctness over repetition. Then by the same argument as Theorem 2.3, our query complexity is:
Sample and time complexity follow similarly to Theorem 2.3.
The Massart noise model does well to capture situations with adversarial bounded noise, but even in a realistic non-adversarial scenario, error may not be bounded away from 1/2. One might think, for instance, that label noise should be bounded as a function of the distance to the Bayes optimal classifier, reaching purely random labels on the decision boundary itself. Likewise, comparisons between arbitrarily close points should be difficult, with error approaching 1/2 as well. This motivates us to study the Tsybakov Low Noise condition, a popular instantiation of distance-based noise. However, learning in this unbounded regime is harder, as evidenced by polynomial query lower bounds [13, 15], and the lack of computationally efficient algorithms for the model. In order to ARPU-learn in this regime, we need to introduce more stringent restrictions than for Massart noise. First, instead of allowing any set system with finite inference dimension, we will only consider non-homogeneous linear separators. Second, we will either assume some margin , or that the distribution satisfies certain weak concentration and anti-concentration bounds, a property which implies our earlier assumption for Massart noise of super-exponential average inference dimension.
3.1 Finite Inference Dimension and Margin
In this section, we will consider ARPU-learning hyperplanes over any continuous distribution with finite inference dimension and margin. Note that in the GTNC model, introducing margin bounds the error on label queries away from 1/2. Thus our results should informally be viewed as saying the following: comparison queries with unbounded error exponentially improve query complexity over label queries with bounded error in the ARPU-learning model. Indeed, although we have picked a specific model of bounded label error in this case, trading for another model such as Massart noise on labels causes no significant change to our upper or lower bound.
As in the case of Massart noise, we will first show the gap in query complexity between label only and comparison ARPU-learning. Our previous method showed an infinite gap between the two regimes, but the assumption of a non-zero margin requires a different argument. In this case, we will show a family of examples in which comparisons provide an exponential improvement.
Lemma 3.1 (Restatement of Lemma 1.9). Let be the d-dimensional hypercube
modified to have a ball of radius
centered about each point. The query complexity of ARPU-learning
under model
is at least:
Proof. For simplicity, the adversary will pick the uniform distribution from , and the noiseless case from
. Further, by Yao’s minimax principle it is sufficient to show there is a distribution over hyperplanes in
for which no learner can achieve at least 3/4 coverage with perfect correctness
with greater than 3/4 probability. Let the adversary pick the uniform distribution over the hyperplanes which truncate corners of the hypercube, e.g.
Note that these hyperplanes have margin , so they lie in
, and that each one may be seen as selecting a single ball to be negative. Given any set strategy, the learner can only query points in
out of
balls. The probability that one of the balls the learner queries is the negative ball is at most 1/2. If the learner does not locate the negative ball, to have coverage 3/4 it must label half of the remaining space with no additional queries. However, any set strategy from the learner in this case will have an incorrect label with probability at least 1/2 since the negative ball is uniformly distributed over the remaining balls. Thus any learner that has 3/4 coverage with probability more than 3/4 must incorrectly label some point, violating the conditions of ARPU-learning.
By an argument based on minimal-ratio (margin normalized by the maximum function value) from [4], the inference dimension of the above hypothesis class is . We will prove that this implies a comparison based algorithm that only makes poly(d) queries.
Theorem 3.2 (Restatement of Theorem 1.8). Let and
have inference dimension k with respect to comparison queries. Then,
is ARPU-learnable under model
query complexity:
Where
See Algorithm 2. Unlike the Massart case, we can no longer directly rely on the sorting algorithm of [19], as the point-wise movement guarantees rely on bounded noise. Instead, we rely on the fact that we can, with high probability, check the level of noise of a drawn sample. If the sample is not too noisy, we can modify the bounds of [19] and apply the same technique. On the other hand, if the sample is very noisy, we use this to infer structural information about the sample and thus learn some fraction of the instance space. Informally, our algorithm follows a similar three step process to the Massart case:
Step 2a (high noise): If S measures as noisy, we identity a subset of points which are close with respect to the underlying hypothesis. Using additional randomly drawn points, we create an inference LP based on the structure of
to learn a fraction of the instance space.
Step 2b (low noise): If S measures as having only a small amount of noise, sort S into an MLE order, and apply the same learning strategy as for Massart.
Step 3: Restrict D (by rejection sampling) to points un-inferred by the LP in step 2a/b, and repeat steps 1 and 2a/b until coverage has reached
At the core of this technique is the ability to detect subsets with high levels of noise, and to certify that they are highly structured. With this in mind, we show that if comparisons on a subset of S look sufficiently random, then almost all points in this subset are clustered together in function value. Formally, we define a cluster as:
Definition 3.3 (Cluster). Let (X, H) be a set system. Given and a sample
with respect to h if
We will often omit “with respect to is the function underlying the Bayes optimal classifier.
We will detect clusters by a measure of randomness we term equitibility, the condition that every element is bigger than about half of the other elements.
Definition 3.4 (Equitability). Let S be a set with comparisons denoted by on each pair of elements. For an element
denote the number of elements
-equitable if
parameter:
i.e. the probability that measures less than
. Note that
In order to distinguish between steps 2a and 2b, we show that if a cluster exists, then with high probability there is a large equitable subset, and that vice versa, a large equitable subset implies the existence of a large cluster with high probability. Consider testing a sample of size 2c + m for equitability. Call the order on
induced by the underlying classifier the “true order.” To start, we examine a single such sample
and show that with high probability:
1. If is a cluster, then it is equitable
2. If the middle m elements of with respect to the true order is not a cluster, then
is not equitable.
Lemma 3.5. Consider a set denotes the middle m elements of
with respect to the true order. Then for
satisfies the following properties:
Proof. Proof of (1). For simplicity, let and assume that
is the true order of
. Recall that
is the number of elements that measure as less than
is the probability that measures less than
. We can view v(j) as a random variable given by
Thus v(j) is a Poisson binomial distribution with parameters be the bayes optimal classifier. By assumption, we have for all pairs that
, and that
. Combining these gives
Thus we are in position to apply the upper bound from the GTNC condition (Equation (3)), which gives for all pairs:
Since,
This allows us to upper and lower bound the distribution by the binomial distributions and
. In particular, for all valid
and values j, we have
Where random variables
This means that concentration lower bounds on and upper bounds on
transfer to v(j). Now we apply Chernoff bounds to
Union bounding over all values of j, the probability that there exists a coordinate outside these ranges is bounded by:
Thus our test satisfies the first condition.
Proof of (2). Assume the middle m = [i, j] points of S are not a -cluster. Since our set is or- dered, this implies
, and further that the middle point must be at least
far from either i, or j. Since our argument will be symmetric, assume this to be i without loss of generality. Our strategy will be to bound the random variable
and use an averaging argument to show that there exists a value We can decompose V (c) into
where the first term is always. Because each point left of i is at least
far away from the right half of
, we can bound the second term as for any v,
A Chernoff bound gives
Then an averaging argument shows that
We are not quite done with our cluster detection algorithm, as our goal will be to test for clusters sublinear in the size of our main sample S. Lemma 3.5 is enough to show that if such a cluster exists some subset will measure as equitable, but we need to prove that any equitable subset of S contains a cluster. For large enough c, this is true with high probability.
Corollary 3.6. Let S be a sample of size n, and . For all subsets
of size
satisfying:
the following guarantees hold:
1. If S contains a -cluster of size 2c + m, then at least one
is
-equitable with probability at least
2. For all , the middle m elements of
with respect to the true order, is a
with probability at least
Proof. Both statements follow from applying Lemma 3.5 to subsets
Proof of (1). By assumption, S contains at least one subset of size 2c + m which is a cluster. Applying statement (1) of Lemma 3.5 to
is equitable with probability at least
Proof of (2). We prove statement (2) by the contrapositive: with probability , all subsets
such that C is not a
-cluster are not equitable. This follows from statement (2) of Lemma 3.5 and union bounding over all
possible subsets.
We can now explain step 1 of our algorithm, cluster detection, in a bit more detail.
Step 1: Draw a sample corresponding to the desired cluster sizes for testing. For every subset
of size 2c + m, check whether
is
-equitable. By the contrapositive of Corollary 3.6 (1), if no such
is
-equitable, then no
-cluster exists in S. Similarly, by Corollary 3.6 (2) if
is
-equitable, then it contains a
With step 1 out of the way, we will prove that steps 2a and 2b provide reliable learners with good coverage as long as the cluster assumption from step 1 holds. Since our focus has been on clusters, we will begin by showing how to build the learner for step 2a. Recall that to apply the symmetry argument of [4] for Massart noise, we had to slot a set of extra points. We will adhere to a similar strategy for step 2a in which we slot an extra set of points and find a cluster there rather than in S itself. To find this cluster, our first goal will be to prove that additionally drawn points measure as equitable with if and only if they are in the same cluster as C.
Lemma 3.7. Let S be a -equitable set of size m + 2c satisfying the conditions of Corollary 3.6. Let C be the subset of S which is the
-cluster specified in Corollary 3.6, and let
be a set of independently drawn points. Further, choose
to satisfy:
The following guarantees hold with probability at least
Then GTNC allows us to bound
To show that , we assume the worst case – that all elements of S are smaller than x. Since
is a cluster, we can bound v(x) by the following Binomial:
The probability that is then given by a Chernoff bound as
We can bound the probability that by rehashing the same argument for
number of elements x is less than. Thus the probability that
-equitable is
Union bounding over completes the proof.
Proof of (2). Similar to the proof of statement (2) of Corollary 3.6, we prove the contrapositive: that all which are not
-clusters are not
-equitable with high probability. Assume
-cluster. Since
Since we have assumed , GTNC gives
Since is not a cluster, it must either be the case that
, or
. Assume the latter without loss of generality. We can bound v(x) by a Binomial:
Knowing that additionally drawn points which measure as equitable with S come from a cluster, we can feed them into an inference LP based on this assumption. However, to infer remaining points in the instance space, the LP must also know the label of the cluster we feed in. Since we are assuming our points have some margin , we can solve for the label of the cluster with high probability by majority vote.
Lemma 3.8 (Cluster Labeling). Assume that a set , consists entirely of one label and has margin
with respect to the decision boundary. The probability that this true label differs from the majority label measured by the oracle
is at most
Proof. This follows from applying a Chernoff bound to the fact that each point has at least a probability of being correct.
Finally, we need to show that the LP based upon the structure and label of clustered points has good coverage. We do this by an argument inspired by inference dimension: that given a -cluster C of large enough size, there exists a point in
such that the knowledge that
is a cluster is sufficient to infer the label of x. This will allow us to use the symmetry argument of [4] to show that step 2a has good coverage.
Figure 1: The above image illustrates the construction of sets which sandwich the cluster C.
Lemma 3.9. Let X be a set, and the set of hyperplanes with margin
with respect to X. Consider a query set Q containing a cluster query along with the standard label queries. Given a subset S, a cluster-query returns 1 if
-cluster, and 0 otherwise. Then for any
of size at least 24d log(d+1):
infers a point y if there is a solution to the following system of linear equations:
Informally, because -cluster and all points have margin
, it infers the labels not just of points in its convex hull, but in a d times expansion of the hull. We will show that a large enough cluster C must contain some point y s.t. C \ {y} infers y. Our strategy relies on the fact that if C does not infer y, adding y to C expands the volume of its convex hull by a multiplicative factor. Since we can upper bound the volume of the convex hull of C by the volume of the largest simplex times the size of a decomposition of C into simplices (a triangulation), this multiplicative volume expansion contradicts the upper bound for large enough C.
In order to prove that adding a point multiplicatively expands the volume of the convex hull, we will need to prove the existence of a certain affine linear function. In particular, if C and y are such that this system of equations has no solution, then there exists an affine function L such that:
ear combination of the inequalities and real linear combination of the equalities that sum to the contradiction by LP-duality. Since
and y do not appear in this contradiction, the linear combinations of Equation (8) and (9) must cancel. To see this explicitly, let the linear combination of 8 be denoted T, then the equality becomes:
Note that Equation (9) in a truly linear form is a set of equations
for
. The positive real linear combination of these terms is then of the form
for some . Since these two sums must cancel, we get that the
are in fact
. Summing the two equations then gives:
Now define , which remains an affine linear function. This sign only affects the left-hand term, and thus we get:
Noting that is at most
proves the claim.
Using the function L we can show how C expands in volume when adding an un-inferred point:
Proof of (11): Our strategy will be to sandwich the convex hull of C in the difference of two cones defined by L with apex y. For arbitrary points , let h(x, y) be the line passing through x and
be the plane given by
and
be the plane given by
. The cone which does not contain C is then defined by its apex
Likewise, we define the cone that contains both Cone() and ConvHull(C) as the cone with apex y and base
We refer to these cones respectively as Cone() and Cone(
). Note that
is similar to
and that Equation (10) bounds the ratio in volume of these cones:
Further, since C is sandwiched between the two cones we have and can bound the ratio in volume between the Convex Hull of C and the smaller cone:
Finally, because the Convex Hull of contains both Cone(
) and ConvHull(C), this allows us to
lower bound the expansion factor of including y into C:
Using Equation (11), we can build our contradiction on the volume of the convex hull for large enough C. For analysis, we denote the size of C by n. To start, we note a simple upper bound on the volume of the convex hull of any
where is the volume of the largest simplex with vertices in C. This follows from choosing any vertex
and noting that choosing every simplex which contains x is a triangulation of ConvHull(C). While this triangulation is certainly not optimal, it is sufficient for our purposes.
Since there exists some h s.t. no point in C can be inferred from the rest, every point added to C after the largest simplex multiplies the volume by . This gives a lower bound on the volume of ConvHull(C) of:
Together, these bounds give the equation:
Setting n > 24d log(d + 1) gives a contradiction.
With Lemmas 3.7, 3.8, and 3.9 in hand, we can now give a more detailed explanation of step 2a:
Step 2a (high noise): It is assumed by step 1 that we have detected an -equitable subset
. Draw an additional set of points
, and for each point test whether
-equitable. By Lemma 3.7, the points which measure as equitable with
make up a cluster. Using Lemma 3.8 to label these points, build an LP based on the labels and cluster structure. Applying Lemma 3.9 and the symmetry argument of [4] shows that this LP has good coverage.
It is left to show that step 2b has good coverage. Step 2b will follow a similar strategy to the Massart case, using points well-separated in an MLE ordering to build our LP. However, since we are still in the regime of unbounded error, we will need to exploit the fact that our sample has no large clusters to show that this LP infers correctly with high probability. Notice that a sample with no clusters consists mostly of pairs of points whose comparisons are bounded in error. With this in mind, we modify the pointwise movement bounds of [19] to differentiate between pairs of points with bounded and unbounded comparison error.
Definition 3.10. Let S be a set with a noisy comparison oracle . We call a comparison between points
-far if the probability that
returns the correct comparison is at least
. Otherwise we call the comparison
To prove a point-wise movement bound, we will follow exactly the strategy of [19]. First, we prove that it is unlikely that an ordering which disagrees on many far comparisons from the true order is an MLE ordering. Second, we use this to upper bound the total number of wrong far comparisons in any MLE order with high probability. Finally, we prove that as long as no large cluster exists, a single point cannot move too far without contradicting the upper bound on total far errors.
be a permutation which differs from the true order on
-close comparisons, and
-far comparisons. The probability that
is an MLE order is
Proof. To be an MLE order, must beat the true order on half or more of the comparisons on which they differ. We can bound this probability by the Poisson Binomial:
A Chernoff bound then gives the desired result.
Using this upper bound, we show that any order which disagrees with the true ordering on more than comparisons is not an MLE ordering with high probability.
Lemma 3.12 (Total Far Movement). The probability that an MLE order disagrees with the identity on -far comparisons, where
is
Proof. For a given permutation . By Lemma 3.11, the probability that
order is at most:
Finally, we show a bound on point-wise movement by proving that any point which moves more than from its true position creates
total far errors.
Lemma 3.13 (Point-wise Far Movement). Given a sample S of size n and , assume that the sample does not have a
-cluster of size
. Then with probability at least
no point moves by further than
in an MLE order, where
Proof. Assume without loss of generality that the true order on S is the identity 1, . . . , n. Denote by the event that
in an MLE order,
, and at most l elements from outside the range
. Note that if more than l of such elements map into [i, j] then the order must differ on at least
-far comparisons from the identity. This follows from the fact that each such element must shift m + l places towards [i, j], but has at maximum
-close comparisons in that direction, and that each comparison is counted at most twice.
For i to be in slot j in an MLE order, it must beat the identity on more than half of elements in between. Since we have assumed all but l of the elements between i and j in the order are from then this range must contain at least
incorrect comparisons with i. This further implies that at least
comparisons with i must be incorrect in the range [i, j + l + m]. By our assumption on cluster size, all but m of these comparisons are
-far, so we can bound the probability of
by the Poisson Binomial:
Combining our assumptions on with a Chernoff bound then gives:
Union bounding over pairs i, j then gives that if any point moves by more than in an MLE ordering, the total number of wrong
-far comparisons are more than
with probability
. By Lemma 3.12, the probability that this occurs is
, giving the desired result.
With a point-wise movement bound in hand, step 2b essentially follows the same strategy as Lemma 2.7 with a different set of parameters.
Step 2b (low noise): Draw an additional sample of m points, and use the labels and comparisons of all pairs of points separated by to build an inference LP. This LP correctly infers points with high probability by Lemma 3.13, and has large coverage due to the space’s finite inference dimension.
All that remains is step 3, which repeats steps 1 and 2 until reaching the desired coverage. Tying all of these together, we present the proof of Theorem 3.2: learning margin , finite inference dimension non-homogeneous linear separators with GTNC noise.
Proof. (Proof of Theorem 3.2)
Let S be the subsample described in step 1 of size n, and c and m the parameters defining the size of subsets we check for -equitability. Further, in the case that some subset tests as equitable, let
be the additionally drawn points. To begin, we set
such that if we measure an equitable subset
, points
s.t.
-equitable make up a
-cluster (see Lemma 3.7):
Note that this also satisfies the requirement on from Lemma 3.7. To satisfy Lemmas 3.6, 3.7, and 3.8, we set
Note that is a simplified (and somewhat larger) version of the parameter from Lemma 3.12 where
has been set to
. We must further set parameters
to satisfy Lemma 3.13:
Finally, we must select the sample size n itself. To employ the same slotting strategy as Theorem 2.3, we need blocks of size
. This gives the requirement on n:
To satisfy this condition, it is enough let n be:
where the additional factor in d ensures that satisfy the constraints of Lemmas 3.7 and 3.13.
We will now structure our analysis as in the 3 step informal explanation.
Step 1: Draw the sample , where in later iterations D is restricted to un-inferred points by rejection sampling. Check
-equitable subsets of size 2c + m.
Step 2a (high noise): Assume that at least one subset, , is equitable with true cluster C. Draw an additional set
and test for each
-equitable. With probability
can identify by Lemma 3.7 and correctly label by Lemma 3.8 at least
which are in a cluster. We build our learner based off of this cluster. Recall that the expected coverage of the learner is given by the probability that an additional point is inferred. To compute this, we first note that the probability an additional point lands inside the cluster is at least
. Assuming this occurs, Lemma 3.9 and the symmetry argument of [4] give the point a 3/4’s probability of being inferred. Together with our high probability assumptions, this gives an expected coverage of
for small enough
the probability that the coverage of our weak learner is
is at least
by the Markov inequality.
Step 2b (low noise): Assume instead that no subset was -equitable. By statement 1 of Corollary 3.6, this implies that no
-cluster of size 2c + m exists in S. Sort S into an MLE order. By Lemma 3.13, no point in S has moved by further than
from its true position with probability at least
the appropriate size to apply the argument from Lemma 2.6, so slotting O(k) extra points gives constant coverage with constant probability.
Step 3: Steps 1 and 2 build a weak learner which we must string together to get coverage Theorem 2.3. Our worst case per-step coverage is
with probability
. After repeating the learner t times, the coverage becomes:
Denoting the reliability and usefullness parameters again as and
, setting
is then sufficient to give this coverage with probability at least
Restricting to the distribution of un-inferred points via rejection sampling, repeating the above
times will have coverage with probability
, and correctness
Thus setting
of our weak learner to:
gives the desired coverage and error by union bounding over the number of applications.
Query Complexity: Now we compute the Query complexity of our algorithm. Because we check equitability for every subset, at each iteration our algorithm must make comparisons. This is dominated by the slotting complexity, which we upper bound as
for simplicity. The worst-case number of iterations for our algorithm is
giving a total query complexity of:
For sample complexity, we follow the same argument of Theorem 2.3, ending our algorithm if we reject too many samples in a row. Letting , the sample complexity is then:
Our time complexity, however, diverges from the Massart case due to our need to test all subsets for equitability. In particular, we check allsubsets, which is exponential in inference dimension and noise parameters, and quasi-polynomial in the error parameter
. Further, with unbounded error we cannot employ the sorting algorithm from [19], making sorting an exponentially expensive step as well.
As a direct corollary, we show that this gives us a query efficientalgorithm for the special case of TNC.
Corollary 3.14. Let the hypothesis class have inference dimension k. Then
is ARPUlearnable under model (TNC
) with query complexity:
As an example of an explicit concept class, consider the query complexity of half-spaces with fixed minimal-ratio (the ratio between the closest and furthest points from the decision boundary), a case studied in [4].
Example 3.15. Let be an instance space, and
the class of hyperplanes with margin
and minimal ratio
with respect to
is ARPU-learnable under model
with query complexity:
3.2 GTNC with Weak Distributional Conditions
Our algorithm for learning with GTNC noise introduced an additional restrictive condition on the set system: margin . We will show that this assumption and the assumption of finite inference dimension may be replaced with weak concentration and anti-concentration conditions on the distribution. In this case, however, it is difficult to show a gap between label only and comparison ARPU-learning for two reasons. The first is that learning in this regime in simply harder–it is the first case we show where comparisons do not provide an exponential improvement in the active PAC setting over its passive counterpart. The second is that in the membership query setting, label queries in the TNC model can give comparison like information, making it difficult to apply our lower bounding techniques. We will begin by proving this first statement by showing a lower bound polynomial in
for active PAC learning with labels and comparisons.
Lemma 3.16. Let , and
. The query complexity of actively PAC-learning
under model
is at least
Proof. The adversary begins by choosing the distribution over to be uniform over the square
We will use (a, b) to denote points in
. Consider two parallel hyperplanes
defined as:
We denote the region between the two hyperplanes by , and twice the region as
By Yao’s minimax principle it is enough to show that the adversary may pick a distribution over hyperplanes such that no learner can learn the labels with error with probability
. In particular, the adversary considers a uniform distribution over hyperplanes h and
. Note that any algorithm which correctly labels more than half of the points between h and
(i.e. at least
mass of S) can be seen as identifying the hyperplane h or
. We now show how to lower bound the number of label or comparison queries needed to identify the target hyperplane
Given a set of n query responses from the learner, we argue that the learner cannot succeed with probability greater than:
since it can do no better than simply picking the more likely hyperplane given the set of queries. Taking the maximum over all possible sets of query responses then gives a lower bound on the number of samples. In other words, to show that the learner must make at least n queries, it suffices to show that this maximum is less than 7/8:
Using Bayes theorem, we can rewrite these probabilities as:
Note in this case that query response which rolls together both the point or pair of points being queried and the value which the oracle returns, is dependent on
due to being in an active setting–the chosen point or pair is dependent on the previous responses
. We can now rewrite Equation (13) as:
To analyze this, note that each term in the product is simply the ratio of probabilities that a label query on some point x or comparison on pair of points (where x, y are determined by
) will return
. Then we can bound this product from above and below by looking at the maximum and minimum such ratio across all points and pairs in S. Recall that these probabilities are chosen by the adversary from a range defined by the GTNC parameters. For simplicity, when the ranges on a query for
overlap, we let the adversary choose the same probability, but otherwise always choose the lower bound
or comparison for h, as this will always have the larger ratio. For a point , the ratio for the correct label
is given by:
For comparisons, we only have to consider pairs , since the adversary will otherwise pick a ratio of 1. In this case, the maximum is given by the correct comparison with ratio:
Thus we can bound the product of the ratios from above by:
To bound the ratio from below, we look at the probability for the incorrect label or comparison. For labels, this is:
Likewise, the minimum ratio for comparisons is:
Thus we can also bound the product of the ratios from below as:
Recalling that due to the initial values of
satisfies this and in turn Equation (13), completing the proof.
Note that for notational simplicity the adversary has chosen a non-isotropic distribution, but the bound is easily modified to hold for a distribution in . Specifying to the Tsybakov Low Noise condition gives the following lower bound.
Corollary 3.17 (Restatement of Lemma 1.12). The query complexity of actively PAC-learning under model
is at least
Proof. Observe that for . By the mean value theorem,
Specifying to the TNC model from GTNC, we have , and thus that
for
, and
. Plugging this into Lemma 3.16 then gives the desired bound.
Note that this bound is tight with respect to for
, and not far off for
, as Hanneke and Yang [12] provide a label only active PAC-learning algorithm with
queries and
queries respectively. However, while comparison queries alone may not enough to exponentially improve the query complexity over passive PAC-learning (which is also polynomial in
]), we will show that they are sufficient for ARPU-learning.
Theorem 3.18 (Restatement of Theorem 1.10). The hypothesis class is ARPU-learnable under model
with query complexity:
for small enough
The margin condition is necessary for Lemmas 3.8 and 3.9–we cannot reliably label points or infer from clusters lying close to the decision boundary. If we were only interested in keeping our guarantee on coverage, it would be enough to set a fake margin such that anti-concentration gives that the set of points with such a margin has
probability mass. However, we also require that our algorithm is reliable, and thus with high probability cannot err on points close to the decision boundary. This suggests the following strategy: if a cluster is found in step 1, before using it for inference, test whether it is too close to the decision boundary. Because the error on our labels is proportional to their distance from the decision boundary, we can build a test similar to Lemma 3.5 to detect this by measuring the relative sizes of the subsets with different labels.
Lemma 3.19 (Margin Detection). Let C be a -cluster with respect to the hyperplane f of size at least
Further, let denote the difference in size between the sets
0}. With probability at least
Proof. Assume without loss of generality that the true label of the majority of points in C is 1.
Proof of (1): If there exists a point with
, then the entire entire cluster lies within margin
. By assumption
, so the probability that a point measures as 1 is at most
using Equation (1). The probability that more than
points label as 1 is then given by a Chernoff bound:
Since we have assumed the majority label is 1, the probability that more than label as 0 is upper bounded by this as well.
Proof of (2): Assume we have
. Since
by assumption, the probability that any point measures as 1 is at least
using Equation (1). The probability that less than
points label as 1 is then given by a Chernoff bound:
The idea is now to follow the structure of Theorems 3.2 and 2.8 with the one exception that we will check the closeness of every cluster to the decision boundary by checking whether a cluster measures as too close, we will avoid labeling the points, preserving the reliability of the algorithm.
Proof. (Proof of Theorem 3.18)
To ensure that our coverage is wide enough, we will need to set the margin parameter such that for any hyperplane, the probability mass of points within margin is at most
. By our anti-concentration bound, it is enough to let
Our goal is to learn the rest of the space up to error via Theorem 3.2, assuming for the moment that the modification from Lemma 3.19 will cause at most an overall loss of
coverage. Noting that our space has good average inference dimension, i.e.
], we will achieve this by applying Lemma 2.9. Thus we need to prove that Thereom 3.2 can be used to learn a
fraction of random samples S with probability at least
while querying only
To begin, we must set
and set the size of S such that the algorithm in Theorem 3.2 only queries an fraction of points. Letting N be the total number of points queried as given in Theorem 3.2, it is then sufficient for |S| to satisfy:
Note that due to the distributional conditions, the inference dimension k of our sample is O(d log(d) log(|S|)) with probability at least [6]. Applying the same argument from Theorem 2.8 then gives that the learner of Theorem 3.2 satisfies the conditions of Lemma 2.9. Thus to have coverage
with probability
and reliability
, it is sufficient to set our
to
and run the algorithm
from Theorem 3.2
We have ignored, up until now, the modification to Theorem 3.2 in the cluster step. If a subset measures as equitable, after slotting our extra points to obtain the , we use Lemma 3.19 to test the margin of C. If the cluster has margin at least
, the test passes with high probability. Likewise, if the cluster has margin less than
, the test fails with high probability. If the test fails, we skip the iteration of the weak learner.
How does this modification affect our reliability and coverage? A point can only be mislabeled if the test passes on a cluster with margin less than . Over all iterations of the learner, the probability of this occurring is less than
, so our reliability guarantee is maintained up to a constant. To analyze coverage, note that Lemma 3.9 only infers points within the d-convex hull of the cluster C. Thus, if C is
-cluster which does not have margin
, it infers points within at most a
margin. Since we set
such that this region has at most
probability mass, we lose at most
coverage for skipping clusters with margin less than
. We are left then with the loss in coverage caused by our test failing on a cluster with margin at least
. Since this only occurs with probability
by Lemma 3.19, for small enough
this only changes the constant on the coverage probability of our weak learner, and thus has no asymptotic affect.
The total query complexity is then given by the complexity for running Theorem 3.2 times with the appropriate parameters:
Since s-concave distributions satisfy the requisite distributional properties [3],
Corollary 3.20. The hypothesis class is ARPU-learnable under model
with query complexity:
When compared with the query complexity of the label only PAC-learning algorithm of [12], Corollary 3.20 only shows improvement for a small range of parameters . However, it is not clear to the authors that Hanneke and Yang’s algorithm can be extended to an ARPU learner without substantially increasing the query complexity with respect to dimension.
[1] Sanjoy Dasgupta. Analysis of a greedy active learning strategy. In Advances in neural information processing systems, pages 337–344, 2005.
[2] Maria-Florina Balcan and Phil Long. Active and passive learning of linear separators under log-concave distributions. In Conference on Learning Theory, pages 288–316, 2013.
[3] Maria-Florina F Balcan and Hongyang Zhang. Sample and computationally efficient learning algorithms under s-concave distributions. In Advances in Neural Information Processing Systems, pages 4796–4805, 2017.
[4] Daniel M Kane, Shachar Lovett, Shay Moran, and Jiapeng Zhang. Active classification with comparison queries. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 355–366. IEEE, 2017.
[5] Daniel Kane, Shachar Lovett, and Shay Moran. Generalized comparison trees for point-location problems. In International Colloquium on Automata, Languages and Programming, 2018.
[6] Max Hopkins, Daniel M Kane, and Shachar Lovett. The power of comparisons for actively learning linear classifiers. arXiv preprint arXiv:1907.03816, 2019.
[7] Rui M Castro and Robert D Nowak. Upper and lower error bounds for active learning.
[8] Pascal Massart, Élodie Nédélec, et al. Risk bounds for statistical learning. The Annals of Statistics, 34 (5):2326–2366, 2006.
[9] Enno Mammen, Alexandre B Tsybakov, et al. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808–1829, 1999.
[10] Maria-Florina Balcan, Andrei Broder, and Tong Zhang. Margin based active learning. In International Conference on Computational Learning Theory, pages 35–50. Springer, 2007.
[11] Steve Hanneke et al. Rates of convergence in active learning. The Annals of Statistics, 39(1):333–361, 2011.
[12] Steve Hanneke and Liu Yang. Minimax analysis of active learning. The Journal of Machine Learning Research, 16(1):3487–3602, 2015.
[13] Yining Wang and Aarti Singh. Noise-adaptive margin-based active learning and lower bounds under tsybakov noise condition. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
[14] Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Ruth Urner. Efficient learning of linear separators under bounded noise. In Conference on Learning Theory, pages 167–190, 2015.
[15] Yichong Xu, Hongyang Zhang, Kyle Miller, Aarti Singh, and Artur Dubrawski. Noise-tolerant interactive learning using pairwise comparisons. In Advances in Neural Information Processing Systems, pages 2431–2440, 2017.
[16] Ronald L Rivest and Robert H Sloan. Learning complicated concepts reliably and usefully. In AAAI, pages 635–640, 1988.
[17] Lihong Li, Michael L Littman, Thomas J Walsh, and Alexander L Strehl. Knows what it knows: a framework for self-aware learning. Machine learning, 82(3):399–443, 2011.
[18] Ran El-Yaniv and Yair Wiener. Active learning via perfect selective classification. Journal of Machine Learning Research, 13(Feb):255–279, 2012.
[19] Mark Braverman and Elchanan Mossel. Sorting from noisy information. arXiv preprint arXiv:0910.1191, 2009.
[20] Leslie G Valiant. A theory of the learnable. In Proceedings of the sixteenth annual ACM symposium on Theory of computing, pages 436–445. ACM, 1984.
[21] Vladimir Vapnik and Alexey Chervonenkis. Theory of pattern recognition, 1974.
[22] Benjamin Satzger, Markus Endres, and Werner Kiessling. A preference-based recommender system. In Proceedings of the 7th International Conference on E-Commerce and Web Technologies, EC-Web’06, pages 31–40, Berlin, Heidelberg, 2006. Springer-Verlag. ISBN 3-540-37743-3, 978-3-540-37743-6.
[23] Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2(4):343–370, 1988.
[24] Aaditya Ramdas and Aarti Singh. Optimal rates for stochastic convex optimization under tsybakov noise condition. In International Conference on Machine Learning, pages 365–373, 2013.
[25] Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542–542, 2009.
[26] Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Hongyang Zhang. Learning and 1-bit compressed sensing under asymmetric noise. In Conference on Learning Theory, pages 152–192, 2016.
[27] Tomáš Gavenčiak, Barbara Geissmann, and Johannes Lengler. Sorting by swaps with noisy comparisons. Algorithmica, 81(2):796–827, 2019.
[28] Barbara Geissmann, Stefano Leucci, Chih-Hung Liu, and Paolo Penna. Optimal sorting with persistent comparison errors. arXiv preprint arXiv:1804.07575, 2018.
[29] Rolf Klein, Rainer Penninger, Christian Sohler, and David P Woodruff. Tolerant algorithms. In European Symposium on Algorithms, pages 736–747. Springer, 2011.