b

DiscoverSearch
About
My stuff
Noise-tolerant, Reliable Active Classification with Comparison Queries
2020·arXiv
Abstract
Abstract

With the explosion of massive, widely available unlabeled data in the past years, finding label and time efficient, robust learning algorithms has become ever more important in theory and in practice. We study the paradigm of active learning, in which algorithms with access to large pools of data may adaptively choose what samples to label in the hope of exponentially increasing efficiency. By introducing comparisons, an additional type of query comparing two points, we provide the first time and query efficient algorithms for learning non-homogeneous linear separators robust to bounded (Massart) noise. We further provide algorithms for a generalization of the popular Tsybakov low noise condition, and show how comparisons provide a strong reliability guarantee that is often impractical or impossible with only labels - returning a classifier that makes no errors with high probability.

Due to the now ubiquitous presence of massive unlabeled datasets, recent years have seen an explosion in the search for computationally efficient, noise tolerant learning strategies that minimize the required amount of labeled data to learn a classifier. Active learning is a formalization of the PAC-learning paradigm for unlabeled data. In active learning, the learning algorithm has access both to either a stream or pool of unlabeled data, and an oracle which can label the data on request. The complexity of learning certain classes is then defined by their query complexity, the number of oracle calls required to almost learn the classifier with high probability. The goal in active learning is to adaptively choose data to send to the oracle in such a way that one uses much fewer queries than in the labeled case.

While active learning saw initial success in the noise-free regime with simple concept classes such as thresholds in one dimension, lower bounds [1] soon showed that important classes such as linear separators gave no improvement over PAC-learning, even in only two dimensions. However, subsequent work showed that slight tweaks to the model could overcome this barrier. Balcan and Long [2] showed that by assuming that the data was drawn from a log-concave distribution – a wide set of distributions including Gaussian distributions and uniform distributions over convex sets, learning homogeneous (through the origin) linear separators could be done in exponentially fewer queries than in the PAC model. Later, Balcan and Zhang [3] extended this to the more general class of s-concave distributions, a generalization of log-concavity that includes fat-tailed distributions as well. Rather than restricting the power of the adversary, Kane, Lovett, Moran, and Zhang [4] studied the effect on query complexity of empowering the learner. By allowing the learner to ask more complicated questions of the oracle, such as comparing two points, Kane et al. [4] showed that non-homogeneous linear separators in two-dimensions can be learned in exponentially fewer labeled samples than the PAC case. Later, Kane, Lovett, and Moran [5] extended this to higher dimensions using a complicated set of queries, and Hopkins, Kane, and Lovett [6] did the same by assuming weak concentration and anti-concentration on the distribution – conditions once again satisfied by s-concave distributions.

While query efficient algorithms in high dimensions are an important step towards the use of active learning on real world data, it is equally important that algorithms be computationally efficient and noise tolerant. In an early work, Castro and Nowak [7] provided query efficient algorithms for thresholding in one dimension in the presence of bounded (Massart [8]) and unbounded (Tsybakov [9]) noise under the uniform distribution on [0, 1]. Soon after, Balcan, Broder, and Zhang [10] extended these results to d-dimensional homogeneous hyperplanes over a uniform distribution on a ball. Years later, Hanneke [11] offered a more general analysis for Tsybakov noise based off of the distributional complexity measure the disagreement coefficient, and later with Yang provided a distribution-free analysis [12]. In another vein of work, Balcan and Long [2] provided an algorithm for learning d-dimensional homogeneous hyperplanes over nearly isotropic log-concave distributions with optimal query complexity for Tsybakov noise [13], a result which was later extended by Awasthi, Balcan, Haghtalab and Urner [14] to be computationally efficient for Massart noise when the distribution is restricted to uniform over the unit ball. Similarly, Balcan and Zhang [3] gave a computationally efficient algorithm for learning the more difficult adversarial noise model over s-concave distributions. Concurrently, Xu et al. [15] proposed using comparison queries as a sub-routine in previous algorithms to deal with noise in a computationally efficient manner, improving the overall query complexity along the way.

The comparison based methods of Xu et al. [15], however, do not carry over to the algorithmic technique proposed by Kane et al. [4] for learning non-homogeneous linear separators. Kane et al.’s technique is based upon logical inference. Viewing concept classes as the sign of an underlying family of functions, they build a learner via a linear program with constraints given by query solutions. As a result, the learners created by Kane et al.’s method actually fall into a stronger model than PAC-learning called Reliably and Probably Useful (RPU)-learning [16], variants of which have been studied more recently under a variety of names (e.g. KWIK learning [17], perfect selective classification [18], or confident learning [4]). In this model, the learner is not allowed to err, but may instead output “I don’t know” a small fraction of the time. While Kane et al.’s RPU-learner is computationally efficient, it is not tolerant to noise – the linear program is sensitive to errors in both labels and comparisons. This raises a natural question: can the inference based algorithms of Kane et al. be extended to noisy scenarios, and if so, does a strong reliability guarantee remain? In this work we answer these questions in the positive for Massart and Tsybakov noise. In both cases our algorithms satisfy a noisy version of RPU-learning: with high probability the learner makes no errors at all. Due to their similarity to RPU-learners, we call learners that satisfy this property Almost Reliable and Probably Useful (ARPU). Indeed, taking the limit of our reliability condition returns exactly the RPU model. Our work provides the first query and computationally efficient algorithm for PAC or ARPU-learning non-homogeneous linear separators in the presence of Massart noise over s-concave distributions, as well as more generally for hypothesis classes with finite inference dimension or small average inference dimension. In addition, we provide the first algorithm for ARPU-learning non-homogeneous linear separators under the Tsybakov Low Noise Condition.

Similar to how Xu et al. [15] use comparisons as a subroutine for correcting label errors, we use an approximate sorting scheme (modified from a seminal work from Braverman and Mossel [19] on sorting with noisy comparisons) to create a small set of points whose labels and comparisons are correct with high probability. We then feed this cleaned set into an inference LP, and repeat the process in a boosting style algorithm based off of the framework of [4]. By carefully curating the cleaned set at each step, we are able to use a symmetry argument from [4] to prove that our learners have good coverage, while the guarantees of [19] and the inference framework give reliability.

Our algorithms require the use of comparison queries, an addition which we show is necessary in many cases for active PAC and ARPU-learning. Along with recalling lower bounds from [6] which show comparisons are necessary for efficiently active learning non-homogeneous hyperplanes, we show that in the noiseless case it is impossible to ARPU-learn the uniform distribution over  S1in a finite number of label queries. Further, even with the addition of a margin assumption we show the existence of simple distributions which require a number of label queries that is exponential in dimension. Because Massart noise and certain instantiations of Tsybakov noise subsume the noiseless case, these results prove the existence of a large gap between labels and comparisons for noisy ARPU-learning.

Our paper proceeds as follows. In Sections 1.1, 1.2, and 1.3 we cover preliminaries, our main results, and our main techniques respectively. In Section 2 we present query and computationally efficient algorithms for ARPU-learning hypothesis classes with finite inference dimension or super exponential average inference dimension under the Massart noise model, as well as a lower bound for ARPU-learning  S1 using only labels. In Section 3 we present algorithms for ARPU-learning linear separators with margin and finite inference dimension or over distributions with weak distributional conditions under the Tsybakov Low Noise Condition, as well as a lower bound for ARPU-learning a corresponding distribution with margin using only labels

1.1 Preliminaries

1.1.1 Basic definitions

A hypothesis class is a pair (X, H), where X is a set, and H is a class of functions  h: X → R. Each function h ∈ His called a hypothesis. We refer to  CH = {sign(h): h ∈ H}as the associated concept class. For example, when H is the class of  Rd → Raffine functions, then the associated concept class  CHis the class of d-dimensional half-spaces.

We consider the binary classification problem, where we want to predict the binary label y for each instance x. We assume access to an underlying unknown distribution  DXover X and a label oracle  QL. Querying  QLwith unlabeled  x ∈ Xgenerates a label  QL(x), drawn from unknown distribution  P(QL(x)|x).Note that querying  QLon the same point again would generate the same answer. We use notation  DLto denote the joint distribution over examples x and labels from  QL:

image

1.1.2 PAC-Learning

Probably Approximately Correct (PAC) learning is a probabilistic framework due to Valiant [20] and Vapnik and Chervonenkis [21] for learning adversarially chosen classifiers and input distributions. In this model, given a set X and a set H of hypotheses  h : X → R, an adversary first chooses distribution  DL over X × Ywith the marginal distribution  DX over X. If Y = sign(h⋆(X)) for some h⋆ ∈ H, we call this realizable case learning. With no knowledge of the choice of distribution the learner draws labeled samples from  DLwith the goal of outputting c = sign(h) for some hypothesis  h ∈ Hwhich minimizes loss over  DL:

image

In the realizable case, a hypothesis class (X, H) is called PAC-learnable if  ∀ε, δ, there exists a learner A, where no matter the choice of the adversary, outputs a concept A(S) such that:

image

Here  n = n(ε, δ)is called the sample complexity, and must be  poly( 1ε, 1δ ) for (X, H)to be PAC-learnable.

1.1.3 RPU-Learning

Reliable and Probably Useful (RPU) learning is an alternative learning framework in which the learner is not allowed to make errors, but may instead respond “I don’t know”, notated by “⊥”. Introduced by Rivest and Sloan [16], RPU learning was later studied under the name of Perfect Selective Classification by El-Yaniv and Weiner [18], and confident learning by Kane, Lovett, Moran, and Zhang [4]. Since it is easy to make a reliable learner by simply always outputting “⊥”, our learner must be useful, and with high probability cannot output “⊥” more than a small fraction of the time. Let A be a reliable learner and A(S) be the concept returned by the learner A on training sample S, then we define the loss of A(S) as the measure of unlearned samples:

image

We will commonly refer to  1 − LDL(A(S)) as the coverage of A(S). Sample complexity and learnability are then defined analogously to PAC-learning. Note that any point which is not labeled “⊥” by an RPU-learner is labeled correctly.

1.1.4 Comparison Queries

Following the framework of [4], our learner will have access to more information than just the label of a point. We focus on one particularly natural additional query, the ability to compare points. A comparison query measures the relative distance of two points to the decision boundary. In other words, say that our goal is to identify photographs of diseased vs healthy patients. A comparison query asks: “which patient looks healthier?”. Formally, given an underlying function  h ∈ Hand two points  x1, x2, a comparison query asks which one of  h(x1), h(x2)is bigger. Equivalently:

image

Similar to our label oracle  QL, we define a comparison oracle  QC. Querying QCwith two points  x1, x2 ∈ Xgenerates a comparison result  QC(x1, x2), which is drawn from an unknown distribution  P(QC(x1, x2)|x1, x2).Along with their added theoretical power [4, 6], comparison queries are already used in practice in recommender systems [22] and ranking systems [19], and in some scenarios have better accuracy than label queries [15].

1.1.5 Inference Dimension

Inference dimension is a combinatorial complexity measure introduced by Kane et al. [4] to characterize the query complexity in active learning when the learner is allowed to ask a more complicated set of questions. Given a set of binary queries Q, let Q(S) denote the answers to all such queries on the sample  S. Let S ⊆ Xbe an unlabeled sample. For  x ∈ X and h ∈ H, let

image

denote the statement that answers to binary queries from Q on the sample S determine the label of x, when the learned concept is sign(h(x)), corresponding to an hypothesis h. We will often say for shorthand that S “infers” x, and sometimes drop the underlying classifier h. In this case the underlying function is assumed to be the Bayes optimal classifier. Inference dimension with respect to some query set Q is defined as follows.

Definition 1.1 (Inference dimension). The inference dimension of (X, H) is the minimal number k such that for every  S ⊆ X of size k, and every  h ∈ Hthere exists  x ∈ S such that

image

If no such k exists then the inference dimension of (X, H) is defined as  ∞.

Inference dimension is a worst case measure. Since we will be dealing with varying levels of distribution dependence, we will also take advantage of an average case version of inference dimension introduced in [6].

Definition 1.2 (Average Inference Dimension). We say  (DX, X, H)has average inference dimension g(n), if:

image

Average inference dimension is used to prove that the inference dimension of a finite sample drawn from DXcannot be too large with high probability. This allows us to build query efficient algorithms for hypothesis class with infinite inference dimension by proving that large finite samples do not take too many queries to learn with high probability.

1.1.6 Noisy Learning

Before we discuss our relaxation of RPU-learning, we formalize the presence of noise in our distributions. Given a hypothesis class (X, H), we assume the Bayes optimal classifier is some hypothesis  h⋆ ∈ Hwith decision boundary  h⋆(x) = 0. Note that  h⋆ itself can have non-zero error. To measure the noise in our model we define the conditional probability distributions  βL and βC:

image

Note that for all the noise models discussed below, querying  QLon the same point again (and similarly querying  QCwith the same pair of points again) would generate the same answer. This is a realistic model for the case where the oracle is a human expert who may err with some probability across different inputs, but will always return the same answer on the same input.

Massart Noise Massart, or bounded noise, is a well studied model of noise throughout statistics and learning theory [2, 8, 15]. Massart noise is a tractable and realistic generalization of the standard random classification noise model [23], where the oracle flips its response with probability p < 1/2. Similar to [14, 15], we say “noisy” oracles  QLand  QCsatisfy Massart noise with parameter  λ > 0if the conditional label and comparison distributions are such that

image

Equivalently, we say that  QL (resp. QC) satisfies Massart noise with parameter  λ, if an adversary constructs QL(resp.  QC) by first taking the “clean” oracle ¯QL(resp. ¯QC) and then flipping the result of the oracle with probability at most 12 − λ.

Tsybakov Low Noise Condition Massart error is restrictive in that the distributions  βLand  βCare bounded away from 12– in reality, this may not be the case as examples approach the decision boundary. The Tsybakov Low Noise Condition (TNC) [9] offers an alternative: the closer an example is to the decision boundary, the closer its error to 1/2. There is a natural extension of this intuition to comparison queries as well: comparisons made between arbitrarily close points should be arbitrarily noisy. A number of variants of TNC have been studied in the literature. Here we will follow the variant studied in [7, 24]. Let  h⋆be the Bayes optimal classifier. We say  QLsatisfies the Tsybakov Low Noise Condition with parameters, m < M,

image

In other words, far away from the decision boundary  βL(x)satisfies Massart noise, but approaches 1/2 at a polynomial rate as x approaches the decision boundary. Similarly,  QCsatisfies the Tsybakov Low Noise Condition with parameters,  m < M, ε0 > 0, and κ ≥ 1 (TNC(m, M, κ, ε0)) if ∀x1, x2:

image

Similar to the label case,  βC(x1, x2)satisfies Massart noise for pairs of points  x1, x2which differ greatly with respect to  h∗ and approaches 1/2 at a polynomial rate as  h∗(x1) − h∗(x2)approaches 0.

Generalized Tsybakov Low Noise Condition The Tsybakov Low Noise Condition upper and lower bounds correctness by a particular function of distance. We will consider the direct generalization of this model where these bounds are replaced with arbitrary monotone increasing functions  gL ≤ gU : [0, ε0] → [0, 1/2]. We say  QLsatisfies the Generalized Tsybakov Low Noise Condition with parameters  (gL, gU, ε0) if ∀x:

image

Similarly, we say  QCsatisfies the Generalized Tsybakov Low Noise Condition with parameters  (gL, gU, ε0) if

image

For notational convenience, we will sometimes write  gL(x) = gL(ε0)for  x > ε0. In addition, since we will often need to compose  gL and g−1U , we will use the simplified notation:

image

where c is some constant.

1.1.7 ARPU-Learning

RPU learning suffers from an inability to deal with noise. We introduce the learning framework Almost Reliable and Probably Useful Learning (ARPU-Learning), a relaxation of RPU-learning that allows for noise, but keeps stronger reliability guarantees than PAC-learning. Recall that given a distribution  DL over X × Y ,for a reliable learner A and sample S, we define the loss of A(S) as the measure of unlearned samples:

image

We will commonly refer to  1 − LDL(A(S))as the coverage of A(S). A model is a pair  (Q, DX)where Q is a set of oracles  (QL, QC)and  DXis a set of distributions over X. In ARPU-Learning, given a hypothesis class (X, H) and a model  (Q, DX), an adversary chooses a distribution  DX from DXand the “noisy” oracles (QL, QC) from Q, which induces a distribution ˜DL over X × Y given by:

image

Definition 1.3 (ARPU-Learnable). We say that a hypothesis class (X, H) is ARPU-learnable under model (Q, DX) if ∀δr, δu, ε > 0, there exists a learner A which is

image

Note that in both Equations (5) and (6) the probability is over the randomness of the algorithm, sample S, and noisy oracles  QL, QCchosen by the adversary. Also, in comparison to PAC learning, all point which are not labeled “⊥” by an ARPU-learner are labeled correctly with high probability and setting  δr = 0reduces exactly to RPU learning. Sample complexity and learnability are then defined equivalently to PAC-learning. Finally, we will refer to learners that satisfy condition (5) as  δu-useful, and learners that satisfy condition (6) as  δr-reliable. While the logical inference technique previously used to build RPU learners [4, 6] are very sensitive to noise, we show in later sections how to modify those techniques to build ARPU-learners.

1.1.8 Passive vs Active learning

PAC-learning traditionally is applied to supervised learning, where the learning algorithm receives pre-labeled samples. We call this paradigm passive learning. In contrast, active learning refers to the case where the learner receives unlabeled samples and may adaptively query a labeling or comparison oracle. Similar to the passive case, for active learning we study the query complexity as the minimum number of queries to learn some pair (X, H) in either the PAC, RPU or ARPU-learning model. In general, passive learners learn concept classes up to error  εin  Θ(1/ε)samples. We add to a long line of work [25, 7, 14] showing that active learning can achieve such learning in only polylog(1/ε)queries on important concept classes.

1.2 Our Results

In this work, we study ARPU-learning (Section 1.1.7) under two widely studied noise models: Massart Noise and the Generalized Tsybakov Low Noise Condition.

1.2.1 Notation

We use notation where X is the instance space, H is the set of hypothesis from  X → R, Hdis the class of linear separators in  Rd(corresponding to affine functions  h : Rd → R), and  Hd,γis the class of linear separators in  Rd with margin  γ from X. Since previous work [2, 14] refers to the class of homogeneous linear separators as simply “linear separators,” we will often refer to  Hdas “non-homogeneous linear separators” to differentiate our results. For noise models,  M(λ)is the set of all oracles which satisfy Massart noise with parameter  λ, GTNC(gL, gU, ε0)is the set of all oracles which satisfy the Generalized Tsybakov Low Noise Condition with parameters  (gL, gU, ε0), and TNC(m, M, κ, ε0)is the set of all oracles satisfying the Tsybakov Low Noise Condition with parameters (m, M, κ, ε0). A model is a pair  (Q, DX)where Q is a set of oracles (QL, QC) and DXis a set of distributions over X. For distributions over instance space  X or Rd,

1.  CXis the class of all continuous distributions over X,

2.  LCdis the class of all log-concave distribution on  Rd,

3.  SCdis the class of all s-concave distributions on  Rd for s ≥ − 12d+3,

4.  ISCdis the class of all isotropic s-concave distributions on  Rd for s ≥ − 12d+3,

5.  ACCd,c1,c2is the class of all continuous distributions D which satisfy the following concentration and anti-concentration inequalities:

image

6. For hypothesis class  (X, H), A(X,H),a,f(d)is the class of all continuous distributions  DXover X such that  (DX, X, H)has average inference dimension  g(n) ≤ 2−Ω�n1+af(d)�.

We will call an algorithm sample (respectively time) efficient if it uses  poly(d, 1ε, 1δr , 1δu )samples (respectively time), and query efficient if it uses  poly(d, log 1ε, log 1δr , log 1δu )queries. Finally, for some parameter n (e.g. dimension, error) and function  f : R → R, for the sake of readability we will often use the notation ˜O(f(n))to ignore multiplicative factors that are logarithmic in f(n).

1.2.2 Massart Noise

To begin, we show that under the Massart noise model, finite inference dimension (Definition 1.1) implies computationally efficient ARPU-learning with exponentially better query complexity than any passive PAClearner1. Recall M(λ)is the set of all oracles which satisfy Massart noise with parameter  λ, CXis the class of all continuous distributions over X, and a model is a pair  (Q, DX)where Q is a set of oracles  (QL, QC)and  DXis a set of distributions over X. Note that in the ARPU-Learning model (Definition 1.3), given a hypothesis class (X, H) and a model  (Q, DX), an adversary chooses a distribution  DXfrom  DXand the “noisy” oracles  (QL, QC) from Q.

Theorem 1.4 (Finite Inference Dimension  =⇒ARPU-Learning under Massart Noise). Let the hypothesis class  (X, H), X ⊆ Rd, have inference dimension k with respect to comparison queries. Then, (X, H) is ARPUlearnable under model  (M(λ), CX) in time poly�d, k, 1δr , 1ε, log( 1δu )� ˜O( 1λ5 ), uses only  poly�k, 1λ, 1ε, log( 1δr ), log( 1δu ))�

unlabeled samples, and has a query complexity of

image

for  δr ≤ 1/2.

To put this result into context, we note two lower bounds which together with Theorem 1.4 show a separation between passive and active learning, and label only and comparison based ARPU-learning. In the case of passive, comparison based PAC-learning, we recall the  Ω� 1ε�lower bound from [6]. For label only APRU-learning, we present a lower bound novel to this work:

Lemma 1.5. The query complexity of 1/4-reliably, 1/8-usefully ARPU learning  (S1, H2) with 1/2-coverageunder model  (M(λ), CX)is infinite:

image

Together, these bounds show that comparison based active learning provides not only an exponential improvement in query complexity over any passive PAC-learner, but also an infinite improvement over any active ARPU-learner using only labels. Further, Theorem 1.4 provides the first algorithm for learning noisy non-homogeneous linear separators in two dimensions which is time, sample, and query efficient in the sense of Section 1.2.1, since the inference dimension of  (R2, H2)is 5 [4]. If the instance space has bounded bit-complexity or minimal-ratio, the result also implies an efficient learner for higher dimensional

non-homogeneous linear separators.

Bounded bit-complexity and minimal-ratio, however, are assumptions that may not hold on real-world data. Instead, we will take a path inspired by the recent explosion of work in data science [25] that focuses on weakly restricting the distribution over data to beat lower bounds based off of improbable adversarial examples. While inference dimension itself is not applicable in this scenario, we will employ its average case variant, average inference dimension (Definition 1.2). In particular, we provide a computationally efficient algorithm for learning under Massart noise under the assumption that the hypothesis class and distribution have super-exponential average inference dimension, a fact true for non-homogeneous linear separators and comparison queries across a wide range of distributions [6]. Given a hypothesis class (X, H), recall  A(X,H),a,f(d)is the class of all continuous distributions  DX over X such that (DX, X, H)has average inference dimension  g(n) ≤ 2−Ω�n1+af(d)�for some a > 0and function of dimension f(d). Then,

Theorem 1.6 (Average Inference Dimension  =⇒ARPU-Learning under Massart Noise). Consider any hypothesis class  (X, H), X ⊆ Rd, and corresponding class of distributions  A(X,H),a,f(d). Then, (X, H) is ARPU-learnable under model  (M(λ), A(X,H),a,f(d))in time  poly�f(d), 1δr , 1ε, log( 1δu )� ˜O( 1λ5 ), uses only

poly�f(d), 1λ, log( 1ε), log( 1δr ), log( 1δu ))�unlabeled samples, and has a query complexity of

image

for small enough  δr.

To see the applicability of Theorem 2.8, we note that Hopkins et al. [6] proved that a wide range of distributions lie in  A(Rd,Hd),1,d log(d). In particular, following [6], we say two distributions  D, D′ over Rd areaffinely equivalent if there is an invertible affine map  f : Rd → Rd such that D(x) = D′(f(x)). Hopkins et al. [6] proved that distributions which may be affinely transformed to a distribution with anti-concentration and concentration (i.e. to a distribution in  ACCd,c1,c2) lie in  A(Rd,Hd),1,d log(d), a condition satisfied by s-concave distributions2. Recall that  SCdis the class of all s-concave distribution,  s ≥ − 12d+3, on  Rdand  Hdis the class of both homogeneous and non-homogeneous linear separators in  Rd. Then, as a direct corollary to Theorem 1.6, we have

Corollary 1.7. The hypothesis class  (Rd, Hd)is ARPU-learnable under model  (M(λ), SCd) in time

image

for small enough  δr.

Previous work showed a similar result for homogeneous linear separators over nearly isotropic log-concave distributions [26] and isotropic s-concave distributions [3] with label queries. However, their techniques cannot be extended to the non-homogeneous case due to a  poly( 1ε)lower bound on the query complexity of active label-only learners [6]. Thus it is only by leveraging the additional power of comparison queries that we extend efficient learning to non-homogeneous linear separators over (not necessarily isotropic) s-concave distributions.

1.2.3 Generalized Tsybakov Low Noise Condition

While Massart noise is a clean theoretical model, its assumption that the noise is bounded away from 1/2 is not necessarily reminiscent of the real world. This motivates us to study a variant of the Tsybakov Low Noise Condition, a model in which noise is unbounded as data approaches the Bayes optimal classifier. However, learning in this unbounded regime is harder, as evidenced by the polynomial query lower bounds of [12, 13, 15]. In order to ARPU-learn in this regime, we need to introduce several restrictions not present for our Massart algorithms. First, instead of allowing any hypothesis class with finite inference dimension, we will only consider (non-homogeneous) linear separators. Second, we will either assume some margin  γ, or that the distribution satisfies certain weak concentration and anti-concentration bounds. To begin, we consider learning hypothesis classes over any continuous distribution with finite inference dimension and margin. Recall  GTNC(gL, gU, ε0)is the set of all oracles which satisfy the Generalized Tsybakov Low Noise Condition with parameters  (gL, gU, ε0), Hd,γis the class of linear separators in  Rdwith margin  γfrom X, and  CXis the class of all continuous distributions over X.

Theorem 1.8 (Finite Inference Dimension and Margin  =⇒ARPU-Learning under GTNC). Let  X ⊆ Rdand  (X, Hd,γ)have inference dimension k with respect to comparison queries. Then for small enough  δr, (X, Hd,γ)is ARPU-learnable under model  (GTNC(gL, gU, ε0), CX)with query complexity:

image

Where

image

We prove in addition that while ARPU-learning may no longer be impossible using only labels when margin is introduced, it still suffers from query inefficiency due to the curse of dimensionality.

Lemma 1.9. Let X ∈ Rd be the d-dimensional hypercube  {0, 1}d modified to have a ball of radius 14√d centered

about each point. The query complexity of ARPU-learning  (X, Hd, 14√d )under model  (GTNC(gL, gU, 14√d), CX)is at least:

image

In the above example,  (X, Hd, 14√d )has inference dimension  ˜O(d)by a minimal-ratio argument from [4]. Theorem 1.4 thus gives an algorithm using only poly(d) queries, demonstrating the exponential gap in query complexity between label only and comparison based ARPU-learning with Tsybakov noise. Due to margin causing bounded error in label queries, another way to view this result is the statement that comparison queries with unbounded error exponenentially improve the query complexity of ARPU-learning using only labels with bounded error.

Similar to the case of Massart noise, we may drop the restrictive assumptions of finite inference dimension and margin by assuming weak distributional requirements. Unlike in the case of Massart, here we deal with the requirements directly rather than assuming average inference dimension. Recall  ACCd,c1,c2is the class of all continuous distributions D with the following properties:

1.  ∀α > 0, Prx∼D[||x|| > dα] ≤ c1α

2.  ∀α > 0, v ∈ Rd, ∥v∥ = 1, b ∈ Rd, Prx∼D[|⟨x, v⟩ + b| ≤ α] ≤ c2α

Theorem 1.10 (Concentration and Anti-Concentration  =⇒ARPU-learning under GTNC). For small enough  δr, the hypothesis class  (Rd, Hd)is ARPU-learnable under model  (GTNC(gL, gU, ε0), ACCd,c1,c2) withquery complexity:

image

where

image

Since isotropic s-concave distributions satisfy these conditions [3, 6], we get the immediate corollary for TNC noise under isotropic s-concave distributions. Recall that  ISCdis the class of all isotropic (0 mean, identity variance) s-concave distribution on  Rd, and Hdis the class of non-homogeneous linear separators in Rd.

Corollary 1.11. The hypothesis class  (Rd, Hd)is ARPU-learnable under model  (TNC(m, M, κ, ε0), ISCd)with query complexity:

image

Where

image

This result similarly extends previous work on homogeneous linear separators over isotropic log-concave distributions [2, 13] to the non-homogeneous case. In comparison to Hanneke and Yang’s [12] distribution free algorithm for label only PAC-learning, Corollary 1.11 provides an improved query complexity for  1 < κ < 1514,and more importantly provides the reliability guarantees of the ARPU-learning model.

Finally, note that unlike Theorems 1.4, 1.6, and 1.8, Corollary 1.11 has polynomial rather than polylogarithmic dependence on  ε−1. This is unavoidable, as we prove a lower bound also polynomial in  ε−1.

Lemma 1.12. The query complexity of actively PAC-learning  (R2, H2)under model  (TNC(m, M, κ, ε0), SC2)is at least

image

Thus the main advantage of comparisons in this regime is their added reliability.

1.3 Techniques

1.3.1 Inference Dimension

Our algorithms will follow the form of the learning technique for hypothesis classes with finite inference dimension (Definition 1.1) introduced in [4]. Drawing and querying a subsample S, Kane et al. build a weak learner by defining a Linear Program (LP) with constraints given by the query responses Q(S), and objective function defined by the input point to be labeled. Through a symmetry argument, Kane et al. are able to show that if S is large enough with respect to the inference dimension, the coverage of this weak learner will be at least 3/4. Since we will rely on this argument throughout our paper, we offer a brief description here.

The expected coverage of the learner may be viewed as the probability that a randomly drawn point from the distribution is inferred by the LP. Since our weak learner is built from some finite sample from the same distribution, symmetry gives that this is equivalent to the probability that any of |S| + 1 points can be inferred from the other |S|. Kane et al. then provide the following observation for |S| = n and inference dimension k which proves that setting n = 4k gives coverage at least 3/4.

Observation 1.13 (Observation 3.4 [4]). Let the hypothesis class (S, H), |S| = n, have inference dimension k for the set of binary queries Q. Then  ∀h ∈ H, there exists a subset  S′ ⊂ Sof size  n − k + 1such that

image

Inference dimension on its own, however, is restrictive. Using only comparisons and labels, the inference dimension of linear separators in three or more dimensions is infinite, which implies the existence of realizable distributions with  Ω( 1ε)query complexity [4]. To get around this barrier, we will introduce weak distributional assumptions and instead employ the framework of average inference dimension introduced in [6]. Average inference dimension (Definition 1.2) allows us to build algorithms for hypothesis classes with infinite inference dimension, as long as the distribution it is over is sufficiently nice. We will take advantage of a reduction from average to worst case inference dimension to prove such results:

Observation 1.14 (Observation 3.6 [6]). Let (D, X, H) have average inference dimension  g(n), and S ∼ Dn.Then (S, H) has inference dimension k with probability:

image

1.3.2 Noisy Sorting

The linear program used as a weak learner relies heavily on the correctness of Q(S), making noisy oracles a challenging problem. To retain correctness and reliability, we rely on using extra points outside of the linear program to help identify the true answers Q(S). This idea is not all together new. Contemporaneously with Kane et al., Xu, Zhang, Singh, Miller and Dubrawski [15] suggested using noisy comparisons as a sub-routine in older active learning algorithms to correct for noise in labels. However, as Xu et al. [15] point out, this technique does not work for Kane et al.’s [4] algorithm which requires corrected comparisons as well. Instead, we adopt and adapt a noisy sorting algorithm from Braverman and Mossel [19].

Braverman and Mossel [19] study the problem of recovering the best possible ranking from an ordered set with access to a noisy comparison oracle  QC. In particular, given a ground set S of size n, Braverman and Mossel aim to find an order  πthat minimizes the number of discrepancies with the measured comparisons QC(S), denoted by the order relation �<:

image

If the oracle  QCflips comparisons with probability exactly p < 1/2 and the true ordering has a uniform prior, Braverman and Mossel [19] note that this scoring function has a nice probabilistic interpretation: it is a Maximum Likelihood ordering

image

Braverman and Mossel [19] call finding such an ordering the Noisy Signal Aggregation (NSA) problem, and provide a randomized algorithm that uses only  Oλ(n log(n))comparisons for oracles satisfying Massart noise with parameter  λ. Further, they provide an important structural insight into MLE orderings: with high probability, no point in an MLE order has moved further than  Oλ(log(n))from its position in the true order.

Theorem 1.15 (Optimal Ranking [19]). Let S be a set of size n with underlying order 1, . . . , n and  σan MLE order for S under comparisons given by an oracle  QCsatisfying Massart noise with parameter  λ. Thenwith probability at least  1 − δ:

image

as long as  n or 1δ is at least exponential in  λ−1.

This pointwise movement allows us to determine with high probability comparisons between points that are well-separated throughout an MLE order. By using only such separated points to build our inference LP, our algorithms are almost reliable – a point can only be mislabeled if some well-separated comparison is wrong, a low probability event.

While Braverman and Mossel’s algorithm is query efficient and has a strong pointwise movement guarantee, its exponential time complexity in the error parameter is the main limiting factor in the computational efficiency of our algorithm for Massart noise. The existence of an efficient (polynomial in error) sorting scheme that retains some sub-linear (not necessarily logarithmic) point-wise movement bound under Massart noise would immediately imply computationally and query efficient algorithms for Massart noise for  λ−1poly-logarithmic in 1ε, rather than for  λ−1 = ˜O(log1/5(1/ε))as we require. Follow up works on Braverman and Massart’s algorithm [2729] made progress in this direction, providing algorithms with significantly improved time complexity, but only work for  λbounded from below by some constant. Providing an algorithm that remains efficient while  λgoes to 0 is an open problem.

1.3.3 Cluster Detection and Inference

Braverman and Mossel’s noisy sorting algorithm works well in the case of bounded error, but noise models with unbounded error require a different approach. The particular model we examine in this case, the Generalized Tsybakov Low Noise Condition, is a distance based error metric. This means that as points approach each other in function value, their comparisons “look random”. We can use this fact to detect clusters of points close in function value by testing whether comparisons between them look like they have been drawn at random. In particular, we define a natural measure of randomness that we call equitability:

Definition 1.16 (Equitability). Let S be a set with comparisons denoted by �<on each pair of elements. For an element  x ∈ S, let v(x)denote the number of elements  y ∈ S such that y�<x. We call S ε-equitable if

image

We prove a bi-directional equivalence between clusters and equitable sets: any cluster is equitable with high probability, and any equitable set contains a large cluster with high probability.

If a sample has no cluster, we will prove that a modified version of noisy sorting is sufficient to learn. On the other hand, if we detect a cluster, another approach is required. To handle this case we prove a novel structural lemma regarding the inference power of clusters, showing that any cluster of size  Ω(d log(d)) mustcontain a point that can be inferred from the rest.

2.1 Lower Bounds

In this section we provide two lower bounds: the first to separate comparisons from label only ARPU learning, and the second to explain our restriction to continuous distributions. Our label only lower bound uses the same distribution that shows an exponential gap in active PAC learning between labels and comparisons [1, 4], a circle, except in the case of ARPU learning, the gap is infinite.

Lemma 2.1 (Restatement of Lemma 1.5). The query complexity of 1/4-reliably, 1/8-usefully ARPU learning (S1, H2) with 1/2-coverage under model  (M(λ), CX)is infinite:

image

Proof. By Yao’s minimax principle it is enough to show that the adversary may pick a distribution over hyperplanes such that no learner can 1/4-reliably and 1/8-usefully learn with coverage 1/2. In particular, assume that the adversary picks a uniform distribution over all tangent hyperplanes to the circle. This may be equivalently thought of as the adversary picking a single point on the circle to be negative, and the rest to be positive. Note that the probability that a learner which queries a finite number of points finds the negative point is 0.

Let the learner fix an optimal strategy, querying whatever points they desire. With probability 1, the learner will always query the same set of points since they receive all positive labels. The learner is then left to label 1/2 the measure of the circle blind, since all points except a measure 0 set (the queried points) are equally likely to be the negative point. No matter which set the learner chooses, the probability that it mislabels a point is at least 1/2, violating the ARPU-learning requirement that the learner must must label at least 1/2 of the points with probability at least 5/8 while making no errors.

Note that this lower bound holds even in the noiseless case, which is strictly weaker than Massart as the adversary may simply choose no noise.

Second, we justify why our upper bounds are only for continuous distributions, as the inference dimension framework was initially developed for the worst case rather than distributional model. However, with the introduction of noise, we observe that learning up to arbitrary error is no longer possible over some distributions.

Observation 2.2. Let (X, H) be a hypothesis class, and D a distribution on X whose support consists of a single point x. Let the corresponding noisy label and comparison oracles  (QL, QC) ∈ M(λ). If there exist h, h′ ∈ H s.t. h(x) ̸= h′(x), then no learner can correctly label x with probability more than  1/2 + λ.

This lower bound holds as well across a wide range of distributions containing points with non-zero measure. Take, as an example, a distribution which samples uniformly from the unit ball with probability 1/2, and some disjoint point x with probability 1/2. Setting the error parameter low enough would force the learner to correctly label x, and since the adversary can pick a classifier such that the point cannot be inferred from comparisons, a similar lower bound holds. In order to avoid such examples, we will restrict our consideration to continuous distributions.

2.2 Finite Inference Dimension

With the lower bound out of the way, we prove that hypothesis classes with finite inference dimension are efficiently ARPU-learnable under Massart noise. Recall  M(λ)is the set of all oracles which satisfy Massart noise with parameter  λ, CXis the class of all continuous distributions over X and a model is a pair  (Z, DX) where Zis a set of oracles  (QL, QC) and DXis a set of distributions over X. Note that in the ARPU-Learning model (Definition 1.3), given a hypothesis class (X, H) and a model  (Z, DX), an adversary chooses a distribution  DX from DXand the “noisy” oracles  (QL, QC) from Z. The following is our learning algorithm.

image

Before proving the lemmas necessary to show the coverage of Step 2 from Algorithm 1, we will restate our theorem of the efficient learnability of hypothesis classes with finite inference dimension under Massart noise.

Theorem 2.3 (Restatement of Theorem 1.4). Let the hypothesis class  (X, H), X ⊆ Rd, have inference dimension k with respect to comparison queries. Then, (X, H) is ARPU-learnable under model  (M(λ), CX)in time  poly(d, k, 1δr , 1ε, log( 1δu ))˜O( 1λ5 ), uses only  poly(k, 1λ, 1ε, log( 1δr ), log( 1δu )))unlabeled samples, and has a query complexity of

image

for small enough  δr.

See Algorithm 1. The proof of this theorem lies in the combination of Braverman and Mossel’s [19] approximate ordering with Kane et al.’s [4] inference based algorithm. The idea is as follows:

Step 1: Draw a sample  S ∼ DnX, and sort it into an MLE order by the algorithm of Braverman and Mossel [19]. Draw another m points, and independently slot them into the ordering on S near their true position (again by an algorithm from [19]).

Step 2: From the m points, create a clean subset of points with correct labels and comparisons by selecting a chain of points separated by  Ωλ(log(n))in the MLE order of S to build an inference LP. This LP correctly infers points with high probability by [19], and has large coverage due to the space’s finite inference dimension [4].

Step 3: Restrict D (by rejection sampling) to points un-inferred by the LP in step 2, and repeat steps 1 and 2 until coverage has reached  1 − ε.

The main challenge of the proof then comes down to proving the correctness and coverage of Step 2. First, we need to show that points separated in  S by Ωλ(log(n))are correctly ordered. Since our sample size n will not be exponential in 1λ, we need to slightly modify Theorem 1.15 for this result.

Observation 2.4 (Point-wise Movement). Let S be a set with underlying order  1 . . . n. If σis an MLE order for S under noisy label and comparison oracles  (QL, QC) ∈ M(λ), then with probability at least  1 − δ:

image

as long as  n or 1/δis polynomial in  λ−1.

Proof. Braverman and Mossel define a parameter  m2during their proof as:

image

The requirement on size of  n or 1/δof Theorem 1.15 then comes from the final equation of Lemma 28 [19]:

image

Increasing  m2by a factor of log(1/λ)λ2removes the need for exponential dependence on  λ−1, but increases the pointwise movement bound by the same factor.

Note that points in S separated by 2 max i|σ(i) − i|in an MLE order are thus correctly ordered with high

probability. As a result, picking a chain of points each separated by twice Equation (7) gives an entire set of points with correct comparisons with high probability.

However, using an MLE order itself is challenging. Recall from Section 1.3 that we compute the expected coverage of our learner by the probability that it infers an additionally drawn point. If we use an MLE order, we cannot directly appeal to the symmetry argument of [4], as adding an additional point to S might change the MLE order we have picked. To get around this, our learner is not built off of S itself, but  S′, a set of additional points which we place into the order on S independently of each other. This independence allows us to directly appeal to the argument of [4].

Our method of finding a clean subset, however, is currently for S – we need to modify the method to find a subset of  S′ with correct labels and comparisons. We do this in two steps. First, we adopt a method from [19] for inserting points into a previously sorted set such that they cannot be too far away from their true position. This implies that if points in  S′ are separated by enough points in S in the underlying order on S ∪ S′, we will be able to correctly compare them with high probability. Second, we show that because the underlying true order on  S ∪ S′ is uniform, there exists a chain of such points in  S′ with constant probability from which we can build our cleaned set.

Lemma 2.5 (Slotting [19]). Let S of size n and S′ of size mbe ordered sets with noisy label and comparison oracles  (QL, QC) ∈ M(λ). Divide an MLE order  σ of S into b blocks Biof size at least:

image

There exists an algorithm placing points in  S′into  σsuch that, with probability at least  1 − δ, any pair of points separated by 4 blocks are in the correct order.

Proof. This lemma is a slight modification of part of [19, Theorem 30]. Assume some  x ∈ S′ lies in the i-th block  Biin the true order. By Observation 2.4, with probability at least  1 − δ, xmust be bigger than all elements before  Bi−1and smaller than all elements past  Bi+1. To find which side of a block B x lies in, we measure whether x is greater than, or less than a majority of elements in the block. A standard Chernoff bound gives that the probability the majority is incorrect is at most  e−λ2|B| ≤ δmn, and union bounding over blocks and  S′ gives that all elements will be slotted up to an error of two blocks on either side. Note further that this slotting procedure may be performed by binary search, and thus uses at most O(log(b)m|B|) queries in total. Finally, since elements must be slotted within two blocks of their true position, any pair of elements separated by at least 4 full blocks must be in the correct order.

Since we can safely compare points separated by 4 blocks in the MLE order, and points slot within 2 blocks of their true position, points separated in the underlying order by 8 blocks can be correctly compared with high probability. It is left to show that there is a large enough chain of points in  S′separated by 8 blocks in S.

Lemma 2.6. Let S of size n and S′ of size 32k + 16be ordered sets with noisy label and comparison oracles (QL, QC) ∈ M(λ). Let the size of S satisfy:

image

Then with constant probability we can find a subset of 4k points from  S′which can be labeled and compared correctly with probability  1 − δ.

Proof. Consider the true order  πon the set  S ∪ S′. Let 0be the special point whose comparison to another point x is given by x’s label  QL(x). Lemma 2.5 provides an algorithm for determining the labels and comparisons of points in  S′ separated by more than

image

elements in S and not within c of 0. Consider dividing the order  πrestricted to S (denoted by  πS) up into b = 32k + 16 equal blocks  Biof size at least c. Since any two points in  S′ which are separated by more than a block in  πSwill be correctly ordered by Lemma 2.5 with probability  1 − δ and only 2non-contiguous blocks can be adjacent to 0, it is sufficient to find a chain of non-contiguous blocks of size 4k + 2 that all contain a point in  S′. To simplify this, consider the set of every other block (Bodd = {B1, B3, ...}), and let Y be the random variable denoting the number of blocks in  Boddwithout a point in  S′. To upper bound the value of Y , we bound its mean and variance and apply Chebyshev’s inequality. Note that since  S and S′ are drawn i.i.d., the ordering on  S ∪ S′ is uniform at random. We can write Y as the sum of indicator variables  Y1 + Y3 + ...,where  Yidenotes the event that  Bidoes not have a point in  S′. Since the ordering is uniform, the probability that a point in  S′ lies in any given block is 1b. Using this, we can bound the expectation of Y by:

image

and similarly the variance of Y by:

image

Here the second to last inequality follows from the assumption that  b ≥ 48(or equivalently that  k ≥ 1). Noting that the number of blocks with a point from  S′is b2 − Y, Chebyshev’s inequality then gives that a constant fraction of the blocks must have a point from  S′ with constant probability:

image

Lemma 2.6 allows us to build a clean set of points with correct comparisons and labels. By feeding this set of points into an inference LP, we create a weak learner that infers a constant fraction of the space with constant probability.

Lemma 2.7 (Weak Learner). Let (X, H) have inference dimension k, and let the label and comparison oracles  QL, QC ∈ M(λ). Then there exists a constant  c1 > 0such that for any  1/2 > δr > 0, there exists a weak learner that  3δr-reliably learns (X, H), has coverage  c1with probability  ≥ c1, makes at most  qwl(δr)queries, and runs in time  poly(k, 1δr ) ˜O(λ−5), where

image

Following Lemmas 2.5 and 2.6, we we will slot a second, i.i.d. drawn set  S′of points into our MLE order where  |S′| = 32k + 16. Then with constant probability we can find a subset of  4k points in S′ which may be

correctly ordered and labeled with probability at least  1 − δr.

We are now in position to apply the symmetry argument from [4] to show that this subset gives constant coverage with constant probability. The expected coverage is given by the probability that an additional, independently drawn point  x ∼ DXis inferred:

image

Since  S′ and xare drawn randomly, the right hand side is equivalent to the probability that any point in the sample can be inferred from the rest:

image

Recall that with probability at least 59we can find and, with probability  1 − δr, correctly order and label a subset of 4k points from  S′. By Observation 1.13, at least 3k of these can be inferred from the rest, bounding the right hand side by:

image

where we have assumed  δr < 1/2. Then for any constant  c1 > 0 we have:

image

which for small enough  c1 gives:

image

Accounting for the fact that we have assumed our comparisons and labels are correct, our weak learner has coverage  > c1with probability at least  (1 − δr)2c1 > c1 for δr < 12.

Query Complexity: Now, we compute the number of queries made by the weak learner. Let  c3 = m2/ log nwhere  m2is the point-wise movement as defined in Observation 2.4. Using the same notation as [19], we let (setting  α = O

image

Using [19, Lemmas 31 and 32], the number of queries made in the sorting n points (which includes dynamic programming step on n points and slotting n points) and slotting additional  |S′| = 32k + 16points are

image

with probability  1 − δr. Since we do not want our number of queries to be probabilistic, if our learner does not complete after  qwl(δr)queries, we stop and output all 0’s. This increases our error probability by  δr.

Time Complexity: Using an algorithm from [19, Theorem 30], we can sort n points with noisy comparisons in time  nc4where  c4 = O(λ−5 log 1λ(1 + (log 1δr )( 1log n)))with probability  1 − δr. Since slotting a point in worst case takes O(n) time, we can slot O(k) points in time O(kn). This gives us the total time taken by the weak learner as

image

Therefore, the time complexity of the algorithm is  poly(k, 1δr )˜O( 1λ5 ). Once again taking the strategy of outputting all 0’s if the algorithm does not complete in time  Twl(δr), we lose another error factor of  δr, making the algorithm all together  3δr-reliable.

With our weak learner in hand, all that is left for the proof of Theorem 2.3 is Step 3: stringing together copies of the weak learner through rejection sampling.

Proof of Theorem 2.3. Let δwr and δwu be reliability and usefullness parameters for our weak learner. Recall that Lemma 2.7 gives a  3δwr-reliable weak learner with coverage  c1with probability  c1. Applying this weak learner  O(log(1/δwu ))then amplifies this probability to at least  1 − δwu .

Restricting to the distribution of un-inferred points via rejection sampling, we repeat the above process until our coverage reaches  1 − ε. Assume each repetition is successful, then after t steps our coverage is:

image

Setting  t to O(log(1/ε))is then sufficient to set the right hand side to  1 − ε. However, each repetition in this process degrades the overall probability of usefulness. In order to get an overall guarantee of  δu, we must adjust our initial  δwu to:

image

Similarly, since we apply the weak learner  O(log(1/ε) log(1/δwu ))times, we adjust our  δwr to

image

Query Complexity: In total, we run our weak learner at most  O�log� 1ε�log�1δwu��times, giving a query complexity of:

image

Sample Complexity: At each step of our algorithm, we restrict to the distribution of un-inferred points through rejection sampling. By itself, this poses a problem: what if we have inferred much of the space early and our algorithm continually rejects points? To combat this, we note that we can estimate the measure of remaining un-inferred points by how many samples we have to draw before finding one. Formally, if at any step we draw  2 log(1/δu)/εinferred points in a row, then by a Chernoff bound the coverage of our learner is 1 − εwith probability at least  1 − δu. Let nbe the sample size as defined in Lemma 2.7. Since our algorithm

image

our algorithm stops after rejecting  2 log(N/δu)/εpoints in a row. This means that we can bound the total number of samples drawn by

image

Time Complexity: The time complexity of our algorithm has two main components: the complexity of finding an MLE order in the weak learner, and the complexity of rejection sampling. We already computed the time complexity of the weak learner in Lemma 2.7 as  Twl(δr) = poly(k, log( 1δr ))˜O( 1λ5 ). Since,

we run our weak learner at most  O�log� 1ε�log�1δwu

image

It remains to compute the time complexity of rejection sampling. Recall that the we sample at most n(ε, δr, δu)points total in our process. For each point, we run an LP in d + 1 variables with constraints detailed by our previous queries that round. Since the queries our weak learner uses in each round only involve ˜O(n)points, the time complexity of sampling is at most:

image

Since the total time complexity is order of the sum of sampling and sorting, we get an algorithm that runs in time  poly(d, k, 1δr , 1ε, log( 1δu ))˜O( 1λ5 ).

2.3 Average Inference Dimension

While inference dimension allows us to work over arbitrary continuous distributions, as a complexity parameter it is rather restricting, barring for instance the learning of linear separators in dimensions above two. To generalize to a broader range of classifiers, we will use the framework of average inference dimension introduced in [6]. In particular, we show that any hypothesis class and distribution with super-exponential average inference dimension may be efficiently learned under Massart noise. As a result, we provide the first computationally and query efficient learner for non-homogeneous linear separators over s-concave distributions with Massart noise.

Theorem 2.8 (Restatement of Theorem 1.6). Consider any hypothesis class (X, H) and corresponding class of distributions  A(X,H),a,f(d). Then, (X, H) is ARPU-learnable under model  (M(λ), A(X,H),a,f(d))in time

poly(f(d), 1δu , 1ε, log( 1δr ))˜O( 1λ5 ), uses only  poly(f(d), 1λ, log( 1ε), log( 1δr ), log( 1δu )))unlabeled samples, and has a query complexity of

image

for small enough  δr.

Average inference dimension gives a high probability bound on the inference dimension of a finite sample. However, shifting our strategy to directly work with a finite samples introduces a new problem: since our algorithm corrects noise via extra helper points, we may not be able to learn the entire sample. Our first step will be to show that learning most of a finite sample in few queries with high probability is sufficient to learn the entire distribution.

Lemma 2.9. Let (X, H) be a hypothesis class, and  DXa distribution over X. Let A be an active, inference based learner taking in finite samples  S ∼ DnXwith the property that for sufficiently large n, A learns a (1 − ε1)fraction of S with probability  1 − δ, while querying at most an  ε2fraction of the points. The expected coverage of A over the entirety of X is at least:

image

Proof. To find the expected coverage of A over the entire distribution  DXbased on samples S of size n, we look at the probability that an additional randomly drawn point is inferred:

image

We can bound the right hand term by looking at A applied to samples  S′of size n + 1. In particular, if xn+1is learned but not queried by A, then because A is an inference based learner, it must be the case that {x1, . . . xn}infer  xn+1. Since the points of  S′are drawn i.i.d from  DX, the probability that A queries or learns any given point  xiis the same for all  1 ≤ i ≤ n + 1. Because a  1 − ε1fraction of points are learned with probability  1 − δand only an  ε2fraction of points are queried, the probability that a point is learned but not queried is at least  1 − δ − ε1 − ε2by a union bound, which gives the desired bound on A’s coverage.

Proof of Theorem 2.8. We will argue that the learner presented in Theorem 2.3 satisfies the properties of Lemma 2.9 for a large enough sample size. To prove this, we first examine learning a specific sample with small inference dimension. The coverage over all samples will then follow from the fact that almost all samples have small inference dimension due by Observation 1.14 [6] and our assumption on average inference dimension.

Because we are considering a fixed sample S, the weak learner draws uniformly without replacement from S (denoted  x ∼ S) rather than from the distribution itself. All required symmetry arguments still hold in this regime, as the order that points are pulled is still uniformly random. The expected coverage of our learner over S is thus the same as for X in Lemma 2.7 adjusted for the fact that we sample without replacement:

image

and hence

image

Assume for now that |S| is large enough that the subtracted term is negligible. To analyze the remaining coverage probability, assume that n satisfies the constraints of Lemma 2.7 with  k = ˜Θ(f(d)1/a log1/a(|S|)), and further that S has inference dimension k. Then by the arguments in Lemma 2.7, this probability over the sample itself and noisy oracles is constant. Further, as long as S is sufficiently large, we can get coverage 1 − εwith probability  1 − εby applying the same argument restricted to the subset of un-inferred points O�log2 � 1ε��times. This argument only fails when there are no longer n remaining points for our weak learner to use, but as long as  |S| = ω( nε ), this will not affect our coverage. Since Lemma 2.9 also only allows the learner to query a  εfraction of points, we set S to:

image

which also validates our assumption that Oλ(log(n))|S|is negligible (we lose less than  εover all iterations). To apply Lemma 2.9, it is sufficient to have a learner A such that:

image

Because  |S| > Ω� 1ε�, S has inference dimension k with probability at least  1 − εby Observation 1.14 [6]. Combining this with the fact that our algorithm has a  1 − εprobability of achieving  1 − 2εcoverage when the inference dimension is k proves this claim.

Finally, by Lemma 2.9, our learner has expected coverage is  ≥ 1 − 5εover the entire space. To get the desired coverage probability, we run the algorithm over  O(log(1/δu))samples, setting  δr to δr/ log(1/δu)to amend the degradation of correctness over repetition. Then by the same argument as Theorem 2.3, our query complexity is:

image

Sample and time complexity follow similarly to Theorem 2.3.

The Massart noise model does well to capture situations with adversarial bounded noise, but even in a realistic non-adversarial scenario, error may not be bounded away from 1/2. One might think, for instance, that label noise should be bounded as a function of the distance to the Bayes optimal classifier, reaching purely random labels on the decision boundary itself. Likewise, comparisons between arbitrarily close points should be difficult, with error approaching 1/2 as well. This motivates us to study the Tsybakov Low Noise condition, a popular instantiation of distance-based noise. However, learning in this unbounded regime is harder, as evidenced by polynomial query lower bounds [13, 15], and the lack of computationally efficient algorithms for the model. In order to ARPU-learn in this regime, we need to introduce more stringent restrictions than for Massart noise. First, instead of allowing any set system with finite inference dimension, we will only consider non-homogeneous linear separators. Second, we will either assume some margin  γ, or that the distribution satisfies certain weak concentration and anti-concentration bounds, a property which implies our earlier assumption for Massart noise of super-exponential average inference dimension.

3.1 Finite Inference Dimension and Margin

In this section, we will consider ARPU-learning hyperplanes over any continuous distribution with finite inference dimension and margin. Note that in the GTNC model, introducing margin bounds the error on label queries away from 1/2. Thus our results should informally be viewed as saying the following: comparison queries with unbounded error exponentially improve query complexity over label queries with bounded error in the ARPU-learning model. Indeed, although we have picked a specific model of bounded label error in this case, trading for another model such as Massart noise on labels causes no significant change to our upper or lower bound.

As in the case of Massart noise, we will first show the gap in query complexity between label only and comparison ARPU-learning. Our previous method showed an infinite gap between the two regimes, but the assumption of a non-zero margin requires a different argument. In this case, we will show a family of examples in which comparisons provide an exponential improvement.

Lemma 3.1 (Restatement of Lemma 1.9). Let  X ∈ Rdbe the d-dimensional hypercube  {0, 1}dmodified to have a ball of radius 14√dcentered about each point. The query complexity of ARPU-learning  (X, Hd, 14√d )under model  (GTNC(gL, gU, 14√d), CX)is at least:

image

Proof. For simplicity, the adversary will pick the uniform distribution from  CX, and the noiseless case from (GTNC(gL, gU, 14√d), CX). Further, by Yao’s minimax principle it is sufficient to show there is a distribution over hyperplanes in  Hd, 14√dfor which no learner can achieve at least 3/4 coverage with perfect correctness

with greater than 3/4 probability. Let the adversary pick the uniform distribution over the  2dhyperplanes which truncate corners of the hypercube, e.g.

image

Note that these hyperplanes have margin 14√d, so they lie in  Hd, 14√d, and that each one may be seen as selecting a single ball to be negative. Given any set strategy, the learner can only query points in  2d−1out of  2dballs. The probability that one of the balls the learner queries is the negative ball is at most 1/2. If the learner does not locate the negative ball, to have coverage 3/4 it must label half of the remaining space with no additional queries. However, any set strategy from the learner in this case will have an incorrect label with probability at least 1/2 since the negative ball is uniformly distributed over the remaining balls. Thus any learner that has 3/4 coverage with probability more than 3/4 must incorrectly label some point, violating the conditions of ARPU-learning.

By an argument based on minimal-ratio (margin normalized by the maximum function value) from [4], the inference dimension of the above hypothesis class is ˜O(d). We will prove that this implies a comparison based algorithm that only makes poly(d) queries.

Theorem 3.2 (Restatement of Theorem 1.8). Let  X ⊆ Rdand  (X, Hd,γ)have inference dimension k with respect to comparison queries. Then,  (X, Hd,γ)is ARPU-learnable under model  (GTNC(gL, gU, ε0), CX) withquery complexity:

image

Where

image

See Algorithm 2. Unlike the Massart case, we can no longer directly rely on the sorting algorithm of [19], as the point-wise movement guarantees rely on bounded noise. Instead, we rely on the fact that we can, with high probability, check the level of noise of a drawn sample. If the sample is not too noisy, we can modify the bounds of [19] and apply the same technique. On the other hand, if the sample is very noisy, we use this to infer structural information about the sample and thus learn some fraction of the instance space. Informally, our algorithm follows a similar three step process to the Massart case:

image

Step 2a (high noise): If S measures as noisy, we identity a subset  S′ ⊂ Sof points which are close with respect to the underlying hypothesis. Using additional randomly drawn points, we create an inference LP based on the structure of  S′ to learn a fraction of the instance space.

Step 2b (low noise): If S measures as having only a small amount of noise, sort S into an MLE order, and apply the same learning strategy as for Massart.

Step 3: Restrict D (by rejection sampling) to points un-inferred by the LP in step 2a/b, and repeat steps 1 and 2a/b until coverage has reached  1 − ε.

At the core of this technique is the ability to detect subsets with high levels of noise, and to certify that they are highly structured. With this in mind, we show that if comparisons on a subset of S look sufficiently random, then almost all points in this subset are clustered together in function value. Formally, we define a cluster as:

image

Definition 3.3 (Cluster). Let (X, H) be a set system. Given  h ∈ Hand a sample  S ⊆ X, S is an ε-clusterwith respect to h if

image

We will often omit “with respect to  h” when his the function underlying the Bayes optimal classifier.

We will detect clusters by a measure of randomness we term equitibility, the condition that every element is bigger than about half of the other elements.

Definition 3.4 (Equitability). Let S be a set with comparisons denoted by �<on each pair of elements. For an element  x ∈ S, let v(x)denote the number of elements  y ∈ S such that y�<x. We call S ε-equitable if

image

parameter:

image

i.e. the probability that  x1measures less than  x2. Note that  ηC(x1, x2) is either βC(x1, x2) or 1 − βC(x1, x2).

In order to distinguish between steps 2a and 2b, we show that if a cluster exists, then with high probability there is a large equitable subset, and that vice versa, a large equitable subset implies the existence of a large cluster with high probability. Consider testing a sample  S′of size 2c + m for equitability. Call the order on  S′ induced by the underlying classifier the “true order.” To start, we examine a single such sample S′ and show that with high probability:

1. If  S′ is a cluster, then it is equitable

2. If the middle m elements of  S′ with respect to the true order is not a cluster, then  S′ is not equitable.

Lemma 3.5. Consider a set  S′ of size 2c + m, where Cdenotes the middle m elements of  S′ with respect to the true order. Then for  ε ≤ gL(ε0), S′ satisfies the following properties:

image

Proof. Proof of (1). For simplicity, let  n = 2c + m − 1and assume that  x0, . . . , xnis the true order of  S′. Recall that  v(xj) = v(j)is the number of elements that measure as less than  xj and that

image

is the probability that  ximeasures less than  xj. We can view v(j) as a random variable given by

image

Thus v(j) is a Poisson binomial distribution with parameters  ηC(xi, xj). Let h⋆ be the bayes optimal classifier. By assumption, we have for all pairs that  |h⋆(xi) − h⋆(xj)| ≤ g−1U (ε/2), and that  ε ≤ gL(ε0). Combining these gives

image

Thus we are in position to apply the upper bound from the GTNC condition (Equation (3)), which gives for all pairs:

image

Since,  ηC(xi, xj) is either βC(xi, xj) and 1 − βC(xi, xj), we have

image

This allows us to upper and lower bound the distribution by the binomial distributions  Xu = Bin(n, 12 + ε/2),and  Xl = Bin(n, 12 − ε/2). In particular, for all valid  ηCand values j, we have

image

Where random variables  X, X′ satisfy X ≤ X′ if ∀i,

image

This means that concentration lower bounds on  Xland upper bounds on  Xu transfer to v(j). Now we apply Chernoff bounds to  Xl and Xu:

image

Union bounding over all values of j, the probability that there exists a coordinate outside these ranges is bounded by:

image

Thus our test satisfies the first condition.

Proof of (2). Assume the middle m = [i, j] points of S are not a  2g−1L (ε)-cluster. Since our set is or- dered, this implies  h⋆(j) − h⋆(i) > 2g−1L (ε), and further that the middle point must be at least  g−1L (ε)far from either i, or j. Since our argument will be symmetric, assume this to be i without loss of generality. Our strategy will be to bound the random variable

image

and use an averaging argument to show that there exists a value  1 ≤ x ≤ c s.t. v(x) < |S′|(1/2 − ε/4)We can decompose V (c) into

image

where the first term is always�c2�. Because each point left of i is at least  g−1L (ε) ≤ ε0far away from the right half of  |S′|, we can bound the second term as for any v,

image

A Chernoff bound gives

image

Then an averaging argument shows that

image

We are not quite done with our cluster detection algorithm, as our goal will be to test for clusters sublinear in the size of our main sample S. Lemma 3.5 is enough to show that if such a cluster exists some subset will measure as equitable, but we need to prove that any equitable subset of S contains a cluster. For large enough c, this is true with high probability.

Corollary 3.6. Let S be a sample of size n, and  ε ≤ gL(ε0)4. For all subsets  S′ ⊆ Sof size  |S′| = (2c + m)satisfying:

image

the following guarantees hold:

1. If S contains a  g−1U (ε/2)-cluster of size 2c + m, then at least one  S′is  ε-equitable with probability at least  1 − δ.

2. For all  ε-equitable S′, C, the middle m elements of  S′ with respect to the true order, is a  2g−1L (4ε)-clusterwith probability at least  1 − δ.

Proof. Both statements follow from applying Lemma 3.5 to subsets  S′ ⊂ S of size 2c + m.

Proof of (1). By assumption, S contains at least one subset  S′of size 2c + m which is a cluster. Applying statement (1) of Lemma 3.5 to  S′ gives that S′ is equitable with probability at least  1 − δ.

Proof of (2). We prove statement (2) by the contrapositive: with probability  1 − δ, all subsets  S′such that C is not a  2g−1L (4ε)-cluster are not equitable. This follows from statement (2) of Lemma 3.5 and union bounding over all� n|S′|�possible subsets.

We can now explain step 1 of our algorithm, cluster detection, in a bit more detail.

Step 1: Draw a sample  S ∼ DnX, and set c and mcorresponding to the desired cluster sizes for testing. For every subset  S′ ⊂ Sof size 2c + m, check whether  S′is  ε-equitable. By the contrapositive of Corollary 3.6 (1), if no such  S′is  ε-equitable, then no  g−1U (ε/2)-cluster exists in S. Similarly, by Corollary 3.6 (2) if  S′is ε-equitable, then it contains a  2g−1L (4ε)-cluster C of size m.

With step 1 out of the way, we will prove that steps 2a and 2b provide reliable learners with good coverage as long as the cluster assumption from step 1 holds. Since our focus has been on clusters, we will begin by showing how to build the learner for step 2a. Recall that to apply the symmetry argument of [4] for Massart noise, we had to slot a set of extra points. We will adhere to a similar strategy for step 2a in which we slot an extra set of points and find a cluster there rather than in S itself. To find this cluster, our first goal will be to prove that additionally drawn points measure as equitable with  S′ if and only if they are in the same cluster as C.

Lemma 3.7. Let S be a  ε-equitable set of size m + 2c satisfying the conditions of Corollary 3.6. Let C be the subset of S which is the  2g−1L (4ε)-cluster specified in Corollary 3.6, and let  S′be a set of independently drawn points. Further, choose  ε, mto satisfy:

image

The following guarantees hold  ∀x ∈ S′ with probability at least  1 − δ.

image

image

Then GTNC allows us to bound  ηC(x, y) for all y ∈ C:

image

To show that  v(x) ≤ |S|(1/2 + λ1), we assume the worst case – that all elements of S are smaller than x. Since  C ∪ {x}is a cluster, we can bound v(x) by the following Binomial:

image

The probability that  v(x) > |S|(1/2 + λ1)is then given by a Chernoff bound as

image

We can bound the probability that  v(x) < |S|(1/2 − λ1)by rehashing the same argument for  |S| − v(x), thenumber of elements x is less than. Thus the probability that  C ∪ {x} is not λ1-equitable is

image

Union bounding over  S′ completes the proof.

Proof of (2). Similar to the proof of statement (2) of Corollary 3.6, we prove the contrapositive: that all  C ∪ {x}which are not  g−1L (2λ1) + 2g−1L (4ε)-clusters are not  λ1-equitable with high probability. Assume C ∪ {x} is not a g−1L (2λ1) + 2g−1L (4ε)-cluster. Since  C is a 2g−1L (4ε)-cluster, ∀y ∈ C we have

image

Since we have assumed  g−1L (2λ1) < ε0, GTNC gives  ∀y ∈ C:

image

Since  C ∪ {x}is not a cluster, it must either be the case that  ∀y ∈ C, x > y, or  ∀y ∈ C, x < y. Assume the latter without loss of generality. We can bound v(x) by a Binomial:

image

Knowing that additionally drawn points which measure as equitable with S come from a cluster, we can feed them into an inference LP based on this assumption. However, to infer remaining points in the instance space, the LP must also know the label of the cluster we feed in. Since we are assuming our points have some margin  γ, we can solve for the label of the cluster with high probability by majority vote.

Lemma 3.8 (Cluster Labeling). Assume that a set  S, |S| ≥ 2 log(1/δ)gL(γ)2, consists entirely of one label and has margin  γwith respect to the decision boundary. The probability that this true label differs from the majority label measured by the oracle  QLis at most  δ.

Proof. This follows from applying a Chernoff bound to the fact that each point has at least a  1/2 + gL(γ)probability of being correct.

Finally, we need to show that the LP based upon the structure and label of clustered points has good coverage. We do this by an argument inspired by inference dimension: that given a  γ/d-cluster C of large enough size, there exists a point in  x ∈ Csuch that the knowledge that  C − {x}is a cluster is sufficient to infer the label of x. This will allow us to use the symmetry argument of [4] to show that step 2a has good coverage.

image

Figure 1: The above image illustrates the construction of sets  Cmin and Cmaxwhich sandwich the cluster C.

Lemma 3.9. Let X be a set, and  Hd,γthe set of hyperplanes with margin  γwith respect to X. Consider a query set Q containing a cluster query along with the standard label queries. Given a subset S, a cluster-query returns 1 if  S is a γ/d-cluster, and 0 otherwise. Then for any  γ/d-cluster C ⊆ Xof size at least 24d log(d+1):

image

Proof. A γ/d-cluster C = {x1, . . . , xn}infers a point y if there is a solution to the following system of linear equations:

image

Informally, because  C is a γ/d-cluster and all points have margin  γ, it infers the labels not just of points in its convex hull, but in a d times expansion of the hull. We will show that a large enough cluster C must contain some point y s.t. C \ {y} infers y. Our strategy relies on the fact that if C does not infer y, adding y to C expands the volume of its convex hull by a multiplicative factor. Since we can upper bound the volume of the convex hull of C by the volume of the largest simplex times the size of a decomposition of C into simplices (a triangulation), this multiplicative volume expansion contradicts the upper bound for large enough C.

In order to prove that adding a point multiplicatively expands the volume of the convex hull, we will need to prove the existence of a certain affine linear function. In particular, if C and y are such that this system of equations has no solution, then there exists an affine function L such that:

image

ear combination of the inequalities and real linear combination of the equalities that sum to the contradiction 1 ≤ 0by LP-duality. Since  ai, xi,and y do not appear in this contradiction, the linear combinations of Equation (8) and (9) must cancel. To see this explicitly, let the linear combination of 8 be denoted T, then the equality becomes:

image

Note that Equation (9) in a truly linear form is a set of  2dequations � aieifor  e ∈ {−1, 1}d. The positive real linear combination of these terms is then of the form

image

for some  b ∈ Rd. Since these two sums must cancel, we get that the  biare in fact  −T(xi). Summing the two equations then gives:

image

Now define  L = −T, which remains an affine linear function. This sign only affects the left-hand term, and thus we get:

image

Noting that  L(xmax) − L(xmin)is at most  2 maxi |L(xi)|proves the claim.

Using the function L we can show how C expands in volume when adding an un-inferred point:

image

Proof of (11): Our strategy will be to sandwich the convex hull of C in the difference of two cones defined by L with apex y. For arbitrary points  x ∈ C, let h(x, y) be the line passing through x and  y, N(xmin)be the plane given by  L(x) = L(xmin)and  N(xmax)be the plane given by  L(x) = L(xmax). The cone which does not contain C is then defined by its apex  y and base Cmax:

image

Likewise, we define the cone that contains both Cone(Cmax, y) and ConvHull(C) as the cone with apex y and base  Cmin:

image

We refer to these cones respectively as Cone(Cmax, y) and Cone(Cmin, y). Note that  Cmaxis similar to  Cminand that Equation (10) bounds the ratio in volume of these cones:

image

Further, since C is sandwiched between the two cones we have  ConvHull(C) ⊂ Cone(Cmin, y) − Cone(C2, y),and can bound the ratio in volume between the Convex Hull of C and the smaller cone:

image

Finally, because the Convex Hull of  C ∪ {y}contains both Cone(Cmax, y) and ConvHull(C), this allows us to

lower bound the expansion factor of including y into C:

image

Using Equation (11), we can build our contradiction on the volume of the convex hull for large enough C. For analysis, we denote the size of C by n. To start, we note a simple upper bound on the volume of the convex hull of any  n point set C ∈ Rd:

image

where  Vmaxis the volume of the largest simplex with vertices in C. This follows from choosing any vertex x ∈ Cand noting that choosing every simplex which contains x is a triangulation of ConvHull(C). While this triangulation is certainly not optimal, it is sufficient for our purposes.

Since there exists some h s.t. no point in C can be inferred from the rest, every point added to C after the largest simplex multiplies the volume by e2e2−1. This gives a lower bound on the volume of ConvHull(C) of:

image

Together, these bounds give the equation:

image

Setting n > 24d log(d + 1) gives a contradiction.

With Lemmas 3.7, 3.8, and 3.9 in hand, we can now give a more detailed explanation of step 2a:

Step 2a (high noise): It is assumed by step 1 that we have detected an  ε-equitable subset  S′. Draw an additional set of points  {x1, . . . , xm}, and for each point test whether  S′ ∪{xi} is λ1-equitable. By Lemma 3.7, the points which measure as equitable with  S′make up a cluster. Using Lemma 3.8 to label these points, build an LP based on the labels and cluster structure. Applying Lemma 3.9 and the symmetry argument of [4] shows that this LP has good coverage.

It is left to show that step 2b has good coverage. Step 2b will follow a similar strategy to the Massart case, using points well-separated in an MLE ordering to build our LP. However, since we are still in the regime of unbounded error, we will need to exploit the fact that our sample has no large clusters to show that this LP infers correctly with high probability. Notice that a sample with no clusters consists mostly of pairs of points whose comparisons are bounded in error. With this in mind, we modify the pointwise movement bounds of [19] to differentiate between pairs of points with bounded and unbounded comparison error.

Definition 3.10. Let S be a set with a noisy comparison oracle  QC. We call a comparison between points x, y ∈ S λ-far if the probability that  QCreturns the correct comparison is at least  1/2 + λ. Otherwise we call the comparison  λ-close.

To prove a point-wise movement bound, we will follow exactly the strategy of [19]. First, we prove that it is unlikely that an ordering which disagrees on many far comparisons from the true order is an MLE ordering. Second, we use this to upper bound the total number of wrong far comparisons in any MLE order with high probability. Finally, we prove that as long as no large cluster exists, a single point cannot move too far without contradicting the upper bound on total far errors.

Lemma 3.11. Let σbe a permutation which differs from the true order on  σc λ-close comparisons, and  σfλ-far comparisons. The probability that  σis an MLE order is

image

Proof. To be an MLE order,  σmust beat the true order on half or more of the comparisons on which they differ. We can bound this probability by the Poisson Binomial:

image

A Chernoff bound then gives the desired result.

Using this upper bound, we show that any order which disagrees with the true ordering on more than ˜Ω(n3/2)comparisons is not an MLE ordering with high probability.

Lemma 3.12 (Total Far Movement). The probability that an MLE order disagrees with the identity on c1n3/2 λ-far comparisons, where

image

is  ≤ δ

Proof. For a given permutation  σ, assume σf > c1n3/2. By Lemma 3.11, the probability that  σf is an MLEorder is at most:

image

Finally, we show a bound on point-wise movement by proving that any point which moves more than ˜Ω(n3/4)from its true position creates  ˜Ω(n3/2)total far errors.

Lemma 3.13 (Point-wise Far Movement). Given a sample S of size n and  λ ≤ gL(ε0), assume that the sample does not have a  g−1L (λ)-cluster of size  m. Let l = (2c1)1/2n3/4. Then with probability at least  1 − 2δ,no point moves by further than  c2m2in an MLE order, where

image

Proof. Assume without loss of generality that the true order on S is the identity 1, . . . , n. Denote by  Aijthe event that  i maps to σ(i) = jin an MLE order,  |i − j| > c2m2, and at most l elements from outside the range  [i − l − m, j + l + m] map into [i, j]. Note that if more than l of such elements map into [i, j] then the order must differ on at least l22 λ-far comparisons from the identity. This follows from the fact that each such element must shift m + l places towards [i, j], but has at maximum  m λ-close comparisons in that direction, and that each comparison is counted at most twice.

For i to be in slot j in an MLE order, it must beat the identity on more than half of elements in between. Since we have assumed all but l of the elements between i and j in the order are from  [i − l − m, j + l + m],then this range must contain at least  c2m2/2 − l − 1incorrect comparisons with i. This further implies that at least  c2m2/2 − 2l − m − 1comparisons with i must be incorrect in the range [i, j + l + m]. By our assumption on cluster size, all but m of these comparisons are  λ-far, so we can bound the probability of  Aijby the Poisson Binomial:

image

Combining our assumptions on  m2with a Chernoff bound then gives:

image

Union bounding over pairs i, j then gives that if any point moves by more than  c2m2in an MLE ordering, the total number of wrong  λ-far comparisons are more than  c1n3/2 with probability  1 − δ. By Lemma 3.12, the probability that this occurs is  ≤ δ, giving the desired result.

With a point-wise movement bound in hand, step 2b essentially follows the same strategy as Lemma 2.7 with a different set of parameters.

Step 2b (low noise): Draw an additional sample of m points, and use the labels and comparisons of all pairs of points separated by ˜Ω(n3/4) in Sto build an inference LP. This LP correctly infers points with high probability by Lemma 3.13, and has large coverage due to the space’s finite inference dimension.

All that remains is step 3, which repeats steps 1 and 2 until reaching the desired coverage. Tying all of these together, we present the proof of Theorem 3.2: learning margin  γ, finite inference dimension non-homogeneous linear separators with GTNC noise.

Proof. (Proof of Theorem 3.2)

Let S be the subsample described in step 1 of size n, and c and m the parameters defining the size of subsets we check for  εT-equitability. Further, in the case that some subset tests as equitable, let  S′ be the additionally drawn points. To begin, we set  εTsuch that if we measure an equitable subset  Seq ⊂ S, points  x ∈ S′s.t. Seq ∪ {x} is 2gU(2g−1L (4εT ))-equitable make up a  γ/d-cluster (see Lemma 3.7):

image

Note that this also satisfies the requirement on  εTfrom Lemma 3.7. To satisfy Lemmas 3.6, 3.7, and 3.8, we set  c, m, and |S′| to:

image

Note that  c1is a simplified (and somewhat larger) version of the parameter from Lemma 3.12 where  λhas been set to  (gL(g−1U (εT /2)). We must further set parameters  c2 and m2to satisfy Lemma 3.13:

image

Finally, we must select the sample size n itself. To employ the same slotting strategy as Theorem 2.3, we need  Ω(k)blocks of size  c2m2. This gives the requirement on n:

image

To satisfy this condition, it is enough let n be:

image

where the additional factor in d ensures that  m and m2satisfy the constraints of Lemmas 3.7 and 3.13.

We will now structure our analysis as in the 3 step informal explanation.

Step 1: Draw the sample  S ∼ DnX, where in later iterations D is restricted to un-inferred points by rejection sampling. Check  S for εT-equitable subsets of size 2c + m.

Step 2a (high noise): Assume that at least one subset,  Seq, is equitable with true cluster C. Draw an additional set  S′ and test for each  x ∈ S′ whether Seq ∪ x is gL(γ′)2-equitable. With probability  1 − O(δr), wecan identify by Lemma 3.7 and correctly label by Lemma 3.8 at least  96d log(d + 1) + 2 log(1/δr)gL(γ)2 points of S′

which are in a  γ/dcluster. We build our learner based off of this cluster. Recall that the expected coverage of the learner is given by the probability that an additional point is inferred. To compute this, we first note that the probability an additional point lands inside the cluster is at least  Ω(m/n). Assuming this occurs, Lemma 3.9 and the symmetry argument of [4] give the point a 3/4’s probability of being inferred. Together with our high probability assumptions, this gives an expected coverage of  Ω(m/n)for small enough  δr. Thus,the probability that the coverage of our weak learner is  Ω(m/n)is at least  Ω(m/n)by the Markov inequality.

Step 2b (low noise): Assume instead that no subset was  εT-equitable. By statement 1 of Corollary 3.6, this implies that no  g−1U (εT /2)-cluster of size 2c + m exists in S. Sort S into an MLE order. By Lemma 3.13, no point in S has moved by further than  c2m2from its true position with probability at least  1 − δr. S is ofthe appropriate size to apply the argument from Lemma 2.6, so slotting O(k) extra points gives constant coverage with constant probability.

Step 3: Steps 1 and 2 build a weak learner which we must string together to get coverage  1 − ε mirroringTheorem 2.3. Our worst case per-step coverage is  Ω(m/n)with probability  Ω(m/n). After repeating the learner t times, the coverage becomes:

image

Denoting the reliability and usefullness parameters again as  δwrand  δwu, setting  t = ˜O�n log(1/δwu )m �is then sufficient to give this coverage with probability at least  1 − δwu .

Restricting to the distribution of un-inferred points via rejection sampling, repeating the above  O�n log(1/ε)m �

times will have coverage  1−εwith probability  1−O�n log(1/ε)m δwu�, and correctness  1−O�n2 log(1/ε) log(1/δwu )m2 δwr�.Thus setting  δwr and δwu of our weak learner to:

image

gives the desired coverage and error by union bounding over the number of applications.

Query Complexity: Now we compute the Query complexity of our algorithm. Because we check equitability for every subset, at each iteration our algorithm must make  O(n2)comparisons. This is dominated by the slotting complexity, which we upper bound as ˜O(dn2)for simplicity. The worst-case number of iterations for our algorithm is  α log(α/δu),giving a total query complexity of:

image

For sample complexity, we follow the same argument of Theorem 2.3, ending our algorithm if we reject too many samples in a row. Letting  N = O�d log(d)nα log�αδu��, the sample complexity is then:

image

Our time complexity, however, diverges from the Massart case due to our need to test all subsets for equitability. In particular, we check all� n2c+m�subsets, which is exponential in inference dimension and noise parameters, and quasi-polynomial in the error parameter  δr. Further, with unbounded error we cannot employ the sorting algorithm from [19], making sorting an exponentially expensive step as well.

As a direct corollary, we show that this gives us a query efficient3 algorithm for the special case of TNC.

Corollary 3.14. Let the hypothesis class  (X, Hd,γ)have inference dimension k. Then  (X, Hd,γ)is ARPUlearnable under model (TNC(m, M, κ, ε0),CX) with query complexity:

image

As an example of an explicit concept class, consider the query complexity of half-spaces with fixed minimal-ratio (the ratio between the closest and furthest points from the decision boundary), a case studied in [4].

Example 3.15. Let  X ⊆ Rdbe an instance space, and  Hd,γ,ηthe class of hyperplanes with margin  γand minimal ratio  ηwith respect to  X. Then (X, Hd,γ,η)is ARPU-learnable under model  (TNC(m, M, κ, ε0), CX)with query complexity:

image

3.2 GTNC with Weak Distributional Conditions

Our algorithm for learning with GTNC noise introduced an additional restrictive condition on the set system: margin  γ. We will show that this assumption and the assumption of finite inference dimension may be replaced with weak concentration and anti-concentration conditions on the distribution. In this case, however, it is difficult to show a gap between label only and comparison ARPU-learning for two reasons. The first is that learning in this regime in simply harder–it is the first case we show where comparisons do not provide an exponential improvement in the active PAC setting over its passive counterpart. The second is that in the membership query setting, label queries in the TNC model can give comparison like information, making it difficult to apply our lower bounding techniques. We will begin by proving this first statement by showing a lower bound polynomial in  ε−1 for active PAC learning with labels and comparisons.

Lemma 3.16. Let  s = min�1, g−1L (1/8)�, and  c1 = maxa∈[2ε,s](8gL(4ε), 2(gL(a) − gL(a − 2ε))). The query complexity of actively PAC-learning  (R2, H2)under model  (GTNC(gL, gU, ε0), SC2)is at least

image

Proof. The adversary begins by choosing the distribution over  R2 to be uniform over the square  S = [0, s]2.We will use (a, b) to denote points in  R2. Consider two parallel hyperplanes  h, hεdefined as:

image

We denote the region between the two hyperplanes by  ∆ := {(a, b) ∈ S : 0 ≤ a ≤ 2ε}, and twice the region as

image

By Yao’s minimax principle it is enough to show that the adversary may pick a distribution over hyperplanes such that no learner can learn the labels with  < εerror with probability  ≥ 7/8. In particular, the adversary considers a uniform distribution over hyperplanes h and  hε. Note that any algorithm which correctly labels more than half of the points between h and  hε(i.e. at least  εmass of S) can be seen as identifying the hyperplane h or  hε. We now show how to lower bound the number of label or comparison queries needed to identify the target hyperplane  h or hε.

Given a set of n query responses  Q1, . . . , Qnfrom the learner, we argue that the learner cannot succeed with probability greater than:

image

since it can do no better than simply picking the more likely hyperplane given the set of queries. Taking the maximum over all possible sets of query responses then gives a lower bound on the number of samples. In other words, to show that the learner must make at least n queries, it suffices to show that this maximum is less than 7/8:

image

Using Bayes theorem, we can rewrite these probabilities as:

image

Note in this case that query response  Qi,which rolls together both the point or pair of points being queried and the value which the oracle returns, is dependent on  Qi−1, . . . , Q1due to being in an active setting–the chosen point or pair is dependent on the previous responses  Qi−1, . . . , Q1. We can now rewrite Equation (13) as:

image

To analyze this, note that each term in the product is simply the ratio of probabilities that a label query on some point x or comparison on pair of points  x, y ∈ S(where x, y are determined by  Qi−1, . . . , Q1) will return  Qi. Then we can bound this product from above and below by looking at the maximum and minimum such ratio across all points and pairs in S. Recall that these probabilities are chosen by the adversary from a range defined by the GTNC parameters. For simplicity, when the ranges on a query for  h and hεoverlap, we let the adversary choose the same probability, but otherwise always choose the lower bound  gL.

image

or comparison for h, as this will always have the larger ratio. For a point  (a, b) ∈ S, the ratio for the correct label  (Qi = +) for his given by:

image

For comparisons, we only have to consider pairs  (a1, b1), (a2, b2) ∈ 2∆, since the adversary will otherwise pick a ratio of 1. In this case, the maximum is given by the correct comparison with ratio:

image

Thus we can bound the product of the ratios from above by:

image

To bound the ratio from below, we look at the probability for the incorrect label or comparison. For labels, this is:

image

Likewise, the minimum ratio for comparisons is:

image

Thus we can also bound the product of the ratios from below as:

image

Recalling that  c1 < 1/2due to the initial values of  s and ε, setting n to:

image

satisfies this and in turn Equation (13), completing the proof.

Note that for notational simplicity the adversary has chosen a non-isotropic distribution, but the bound is easily modified to hold for a distribution in  ISC2. Specifying to the Tsybakov Low Noise condition gives the following lower bound.

Corollary 3.17 (Restatement of Lemma 1.12). The query complexity of actively PAC-learning  (R2, H2)under model  (TNC(m, M, κ, ε0), SC2)is at least

image

Proof. Observe that for  f(x) = mxκ−1, |∇f(x)| ≤ m(κ − 1) for all x ∈ [0, s]. By the mean value theorem,

image

Specifying to the TNC model from GTNC, we have  gL(x) = f(x), and thus that  gL(x) − gL(x − 2ε) =f(x) − f(x − 2ε) ≤ 2m(κ − 1)εfor  x ∈ [2ε, s], and  8gL(4ε) = Θ(εκ−1). Plugging this into Lemma 3.16 then gives the desired bound.

Note that this bound is tight with respect to  εfor  κ > 2, and not far off for  1 < κ < 2, as Hanneke and Yang [12] provide a label only active PAC-learning algorithm with ˜Od( 1ε)queries and  ˜Od(� 1ε�2−2/κ)queries respectively. However, while comparison queries alone may not enough to exponentially improve the query complexity over passive PAC-learning (which is also polynomial in  ε−1 [8]), we will show that they are sufficient for ARPU-learning.

Theorem 3.18 (Restatement of Theorem 1.10). The hypothesis class  (Rd, Hd)is ARPU-learnable under model  (GTNC(gL, gU, ε0), ACCd,c1,c2)with query complexity:

image

for small enough  δr, where

image

The margin condition is necessary for Lemmas 3.8 and 3.9–we cannot reliably label points or infer from clusters lying close to the decision boundary. If we were only interested in keeping our guarantee on coverage, it would be enough to set a fake margin  γsuch that anti-concentration gives that the set of points with such a margin has  O(ε)probability mass. However, we also require that our algorithm is reliable, and thus with high probability cannot err on points close to the decision boundary. This suggests the following strategy: if a cluster is found in step 1, before using it for inference, test whether it is too close to the decision boundary. Because the error on our labels is proportional to their distance from the decision boundary, we can build a test similar to Lemma 3.5 to detect this by measuring the relative sizes of the subsets with different labels.

Lemma 3.19 (Margin Detection). Let C be a  γ/d-cluster with respect to the hyperplane f of size at least

image

Further, let  LDif(C)denote the difference in size between the sets  {x ∈ C : QL(x) = 1} and {x ∈ C : QL(x) =0}. With probability at least  1 − δ:

image

Proof. Assume without loss of generality that the true label of the majority of points in C is 1.

Proof of (1): If there exists a point  x ∈ Cwith  f(x) < γ, then the entire entire cluster lies within margin  γ + γ/d < 2γ. By assumption  2γ < ε0, so the probability that a point measures as 1 is at most 1/2 + gU(2γ)using Equation (1). The probability that more than  (1/2 + 2gU(2γ))|C|points label as 1 is then given by a Chernoff bound:

image

Since we have assumed the majority label is 1, the probability that more than  (1/2 + 2gU(2γ))|C|label as 0 is upper bounded by this as well.

Proof of (2): Assume  ∀x ∈ Cwe have  f(x) < g−1L (4gU(2γ)). Since  g−1L (4gU(2γ)) < ε0by assumption, the probability that any point measures as 1 is at least  1/2 + 4gU(2γ)using Equation (1). The probability that less than  (1/2 + 2gU(2γ))|C|points label as 1 is then given by a Chernoff bound:

image

The idea is now to follow the structure of Theorems 3.2 and 2.8 with the one exception that we will check the closeness of every cluster to the decision boundary by checking whether  |LDif(C)| ≥ (1/2 + 2gU(2γ)). Ifa cluster measures as too close, we will avoid labeling the points, preserving the reliability of the algorithm.

Proof. (Proof of Theorem 3.18)

To ensure that our coverage is wide enough, we will need to set the margin parameter such that for any hyperplane, the probability mass of points within margin  2g−1L (4gU(2γ))is at most  ε/2. By our anti-concentration bound, it is enough to let  γ be:

image

Our goal is to learn the rest of the space up to  ε/2error via Theorem 3.2, assuming for the moment that the modification from Lemma 3.19 will cause at most an overall loss of  ε/2coverage. Noting that our space has good average inference dimension, i.e.  ACCd,c1,c2 ⊂ A(X,H),1 [6], we will achieve this by applying Lemma 2.9. Thus we need to prove that Thereom 3.2 can be used to learn a  (1 − ε/6)fraction of random samples S with probability at least  (1 − ε/6)while querying only  (1 − ε/6) points.

To begin, we must set  εT to detect γ/d-clusters:

image

and set the size of S such that the algorithm in Theorem 3.2 only queries an  ε/6fraction of points. Letting N be the total number of points queried as given in Theorem 3.2, it is then sufficient for |S| to satisfy:

image

Note that due to the distributional conditions, the inference dimension k of our sample is O(d log(d) log(|S|)) with probability at least  1 − ε/12[6]. Applying the same argument from Theorem 2.8 then gives that the learner of Theorem 3.2 satisfies the conditions of Lemma 2.9. Thus to have coverage  1 − ε/2with probability  1 − δuand reliability  1 − δr, it is sufficient to set our  δrto  O� δrlog(1/δu)�and run the algorithm

from Theorem 3.2  O(log(1/δu)) times.

We have ignored, up until now, the modification to Theorem 3.2 in the cluster step. If a subset measures as equitable, after slotting our extra points to obtain the  γ/d-cluster C, we use Lemma 3.19 to test the margin of C. If the cluster has margin at least  g−1L (4gU(2γ)), the test passes with high probability. Likewise, if the cluster has margin less than  γ, the test fails with high probability. If the test fails, we skip the iteration of the weak learner.

How does this modification affect our reliability and coverage? A point can only be mislabeled if the test passes on a cluster with margin less than  γ. Over all iterations of the learner, the probability of this occurring is less than  1 − δr, so our reliability guarantee is maintained up to a constant. To analyze coverage, note that Lemma 3.9 only infers points within the d-convex hull of the cluster C. Thus, if C is  γ/d-cluster which does not have margin  g−1L (4gU(2γ)), it infers points within at most a  2g−1L (4gU(2γ))margin. Since we set  γsuch that this region has at most  ε/2probability mass, we lose at most  ε/2coverage for skipping clusters with margin less than  g−1L (4gU(2γ)). We are left then with the loss in coverage caused by our test failing on a cluster with margin at least  g−1L (4gU(2γ)). Since this only occurs with probability  1 − δrby Lemma 3.19, for small enough  δrthis only changes the constant on the coverage probability of our weak learner, and thus has no asymptotic affect.

The total query complexity is then given by the complexity for running Theorem 3.2  O(log(1/δu))times with the appropriate parameters:

image

Since s-concave distributions satisfy the requisite distributional properties [3],

Corollary 3.20. The hypothesis class  (Rd, Hd)is ARPU-learnable under model  (TNC(m, M, κ, ε0), ISCd)with query complexity:

image

When compared with the query complexity of the label only PAC-learning algorithm of [12], Corollary 3.20 only shows improvement for a small range of parameters  1 < κ < 1514. However, it is not clear to the authors that Hanneke and Yang’s algorithm can be extended to an ARPU learner without substantially increasing the query complexity with respect to dimension.

[1] Sanjoy Dasgupta. Analysis of a greedy active learning strategy. In Advances in neural information processing systems, pages 337–344, 2005.

[2] Maria-Florina Balcan and Phil Long. Active and passive learning of linear separators under log-concave distributions. In Conference on Learning Theory, pages 288–316, 2013.

[3] Maria-Florina F Balcan and Hongyang Zhang. Sample and computationally efficient learning algorithms under s-concave distributions. In Advances in Neural Information Processing Systems, pages 4796–4805, 2017.

[4] Daniel M Kane, Shachar Lovett, Shay Moran, and Jiapeng Zhang. Active classification with comparison queries. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 355–366. IEEE, 2017.

[5] Daniel Kane, Shachar Lovett, and Shay Moran. Generalized comparison trees for point-location problems. In International Colloquium on Automata, Languages and Programming, 2018.

[6] Max Hopkins, Daniel M Kane, and Shachar Lovett. The power of comparisons for actively learning linear classifiers. arXiv preprint arXiv:1907.03816, 2019.

[7] Rui M Castro and Robert D Nowak. Upper and lower error bounds for active learning.

[8] Pascal Massart, Élodie Nédélec, et al. Risk bounds for statistical learning. The Annals of Statistics, 34 (5):2326–2366, 2006.

[9] Enno Mammen, Alexandre B Tsybakov, et al. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808–1829, 1999.

[10] Maria-Florina Balcan, Andrei Broder, and Tong Zhang. Margin based active learning. In International Conference on Computational Learning Theory, pages 35–50. Springer, 2007.

[11] Steve Hanneke et al. Rates of convergence in active learning. The Annals of Statistics, 39(1):333–361, 2011.

[12] Steve Hanneke and Liu Yang. Minimax analysis of active learning. The Journal of Machine Learning Research, 16(1):3487–3602, 2015.

[13] Yining Wang and Aarti Singh. Noise-adaptive margin-based active learning and lower bounds under tsybakov noise condition. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.

[14] Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Ruth Urner. Efficient learning of linear separators under bounded noise. In Conference on Learning Theory, pages 167–190, 2015.

[15] Yichong Xu, Hongyang Zhang, Kyle Miller, Aarti Singh, and Artur Dubrawski. Noise-tolerant interactive learning using pairwise comparisons. In Advances in Neural Information Processing Systems, pages 2431–2440, 2017.

[16] Ronald L Rivest and Robert H Sloan. Learning complicated concepts reliably and usefully. In AAAI, pages 635–640, 1988.

[17] Lihong Li, Michael L Littman, Thomas J Walsh, and Alexander L Strehl. Knows what it knows: a framework for self-aware learning. Machine learning, 82(3):399–443, 2011.

[18] Ran El-Yaniv and Yair Wiener. Active learning via perfect selective classification. Journal of Machine Learning Research, 13(Feb):255–279, 2012.

[19] Mark Braverman and Elchanan Mossel. Sorting from noisy information. arXiv preprint arXiv:0910.1191, 2009.

[20] Leslie G Valiant. A theory of the learnable. In Proceedings of the sixteenth annual ACM symposium on Theory of computing, pages 436–445. ACM, 1984.

[21] Vladimir Vapnik and Alexey Chervonenkis. Theory of pattern recognition, 1974.

[22] Benjamin Satzger, Markus Endres, and Werner Kiessling. A preference-based recommender system. In Proceedings of the 7th International Conference on E-Commerce and Web Technologies, EC-Web’06, pages 31–40, Berlin, Heidelberg, 2006. Springer-Verlag. ISBN 3-540-37743-3, 978-3-540-37743-6.

[23] Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2(4):343–370, 1988.

[24] Aaditya Ramdas and Aarti Singh. Optimal rates for stochastic convex optimization under tsybakov noise condition. In International Conference on Machine Learning, pages 365–373, 2013.

[25] Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542–542, 2009.

[26] Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Hongyang Zhang. Learning and 1-bit compressed sensing under asymmetric noise. In Conference on Learning Theory, pages 152–192, 2016.

[27] Tomáš Gavenčiak, Barbara Geissmann, and Johannes Lengler. Sorting by swaps with noisy comparisons. Algorithmica, 81(2):796–827, 2019.

[28] Barbara Geissmann, Stefano Leucci, Chih-Hung Liu, and Paolo Penna. Optimal sorting with persistent comparison errors. arXiv preprint arXiv:1804.07575, 2018.

[29] Rolf Klein, Rainer Penninger, Christian Sohler, and David P Woodruff. Tolerant algorithms. In European Symposium on Algorithms, pages 736–747. Springer, 2011.


Designed for Accessibility and to further Open Science