Efficient, Effective and Well Justified Estimation of Active Nodes within a Cluster

2020·arXiv

Abstract

Abstract

Reliable and efficient estimation of the size of a dynamically changing cluster in an IoT network is critical in its nominal operation. Most previous estimation schemes worked with relatively smaller frame size and large number of rounds. Here we propose a new estimator named “Gaussian Estimator of Active Nodes,”(GEAN), that works with large enough frame size under which testing statistics is well approximated as a Gaussian variable, thereby requiring less number of frames, and thus less total number of channel slots to attain a desired accuracy in estimation. More specifically, the selection of the frame size is done according to Triangular Array Central Limit Theorem which also enables us to quantify the approximation error. Larger frame size helps the statistical average to converge faster to the ensemble mean of the estimator and the quantification of the approximation error helps to determine the number of rounds to keep up with the accuracy requirements. We present the analysis of our scheme under two different channel models i.e. {0, 1} and {0, 1, e}, whereas all previous schemes worked only under {0, 1} channel model. The overall performance of GEAN is better than the previously proposed schemes considering the number of slots required for estimation to achieve a given level of estimation accuracy.

I. INTRODUCTION

In a world pervaded with Internet of Things (IoT), small distance communications, now a days, carry more value than ever. A very common scenario is a set of agents communicating with a central access point (AP) and exchanging infomation. Typical examples would be some self powered chips attached to parked vehicles in a lot sending and receiving infomation from a common source, locally placed sensors accumulating information and sending to a base station or tags attached to the inventories in a supermarket sending and receiving signal to and from a reader helping the inventory control [1] [2] [3] [4] [5] [6] [7]. In all the above scenarios, knowing the number of active agents (AG) at a given time i.e. the nodes that are communicating with the AP, is of paramount importance. And that gives rise to a rich vein of reserch area that caters to solving the problem of estimating the number of active agents in the close vicinity of a common AP. In a very broad sense, such estimation calls for techniques that offer estimation reliability while ensuring minimum possible consumption of resources to do so. For a more concrete understanding of such estimation technicques and the respective reliability, the problem of estimating the number of active nodes communating to a central server can be formally stated as follows: for a given reliability requirement , a confidence interval a central node will have to estimate an unknown population of active agents t in its vicinity. The estimation has to maintain the minimum accuracy condition where is the estimated value of the actual number of active active agents t.

One of the early works in this trail was Unified Probabilistic Estimator (UPE) proposed in [8]. UPE estimation was based on the number of empty slots in a frame or the number of collision slots in the frame, where empty slots are the slots that have not been replied to by any of the AGs, singleton slots are the ones that have been replied to by exactly one of the AGs and collision slots are the slots that have been replied to by more than one AGs. UPE has larger variance which only meant more number of rounds required. An improved framed slotted Aloha protocol-based estimation called Enhanced Zero Based (EZB) estimator was proposed in [9]. EZB makes its estimation based on the total number of empty slots in a frame. The difference between EZB and UPE is, UPE makes an estimation of the population size in each frame and at the end averages out all the estimation results, whereas EZB finds the average of the number of 0s in each frame and finally makes the estimation based on this average value. First Non Empty Based (FNEB) estimator proposed in [10] is based on the average number of slots before the first 1 appears in a frame . Maximum Likelihood Estimator (MLE) proposed in [11] came with the motive to minimize power consumption by the AGs. The multireader estimation proposed in [12] assumes that any AG covered by several APs replies to only one of them. Collision Set Estimator (CSE), proposed in [13] uses maximum likelihood estimation to estimate the population size. CSE does not take accuracy requirements into account, hence cannot achieve required level of reliability. An algorithm to estimate the cardinality or RFID tags under multiple readers with overlaping regions was proposed in [14]. ART [15] estimates the population size based on the average run size of 1s in a frame where 1 stands for a nonempty slot. For each frame ART calculates the average run length of 1s. After n such rounds ART uses the average of these averages to estimate the population size by an invertible function. A tag identification technique was proposed by the same authors in [16]. In the two subsequent paragraphs we describe the major differences between our approach i.e. Gaussian Estimator of Active Nodes (GEAN) and the other existing schemes.

Firstly, all the other schemes worked with smaller frame sizes and relied on larger number of rounds to meet the accuracy requirements. Unlike them, we work with bigger frame size and smaller number of frames. With small frames sizes for any statistics to be approximated as Gaussian, the other schemes have to run large number of rounds. In contrast, GEAN selects a large enough frame size according to Triangular Array Central Limit Theorem [17] to ensure that the estimator is Gaussian distributed within a frame. So, our scheme does not have to rely on the number of rounds for its estimator to be approximated as Gaussian. The only reason that we need to play more than one round is to counter the impact of variances of the estimator. Larger frame size helps the convergence speed of the statistical average of the estimator to its ensemble mean. Another telling advantage a larger frame size gives is that, it decreases the probability of a slot to have more than one reply and increases the probabiliy with which you can trust a slot to have either no reply or exactly one reply, which leads to better estimation accuracy.

Secondly, unlike the other works who naively assumed their estimators to be Gaussian distributed, we quantified the approximation error of our estimator within a frame to Gaussian distribution with all the necessary details. Any estimator approximated as Gaussian incurs some approximation error. It’s important to quantify the approximation error to get the measure of the error you are making and how badly does that error impact the overall estimation accuracy. In this paper we have dedicated a subsection to talk about approximation error, its quantification and its impact in the overall estimation accuracy.

Precise selection of frame size along with accurate approximation of the estimator gives us advantages both in terms of saving the resources and achieving the desired estimation accuracy. Operationally speaking, such analysis give us advantages over other schemes both in terms of reliability and sparing use of resources. Some of the previously proposed schemes offers reliable estimation but uses up too many frame slots for estimation, where as some others are economical in terms of their use of the number of frame slots but fail to deliver on the promised reliability. We have presented relative comparison of the performance of GEAN with other works both in terms of reliability and respective usage of frame slots. And the results clearly shows that GEAN performs better than the other well known schemes if the overall package of reliability and usage of resources considered together.

Unlike other schemes we have presented our scheme under two different channel models i.e. {0, 1} and {0, 1, e}. Under {0, 1} channel model each empty slot is represented by a 0 and each non-empty slot is represented by a 1, whereas under {0, 1, e} channel model each empty slot is represented by a 0, each singleton slot is represented by a 1 and each collision slot is represented by an e. The motivation for {0, 1, e} is that, it achieves the desired accuracy in lesser number of time slots than {0, 1} channel models. This is because under {0, 1, e} each 1 is definite whereas under {0, 1} a 1 can actually be exactly 1 or a number greater than 1. This added assurance helps {0, 1, e} model to achieve the desired accuracy using lesser number of slots, or equivalently gives better accuracy at the expense of the same number of slots . It is important to note that, {0, 1, e} channel model has the added cost of distinguishig between a 1 and an e. For someone willing to pay more for a more accurate and assured estimation, {0, 1, e} is an automatic choice over {0, 1} channel model. Next we enumerate the prominent features of our appraoch.

1) The scheme uses large enough frame size selected according to Triangular Array Central Limit Theorem, that makes the estimator distribution within a frame a well justified Gaussian random variable.

2) We have analytically quantified the approximation error of our estimator to Gaussian and paid the required extra number of rounds to ensure that the approximation error does not impact the overall estimation accuracy.

3) Considering the overall package of the estimation accuracy and the number of slots required for estimation, GEAN performs better than previously proposed estimation schemes.

4) We have presented the analysis of GEAN under two different channel models i.e. {0, 1} and {0, 1, e} along with their respective performances.

It should be noted that a conference version of this work [18] has been published in the proceedings of IEEE International Conference on Communications (IEEE ICC 2018). The major difference between the two versions is that the conferecne version presents a brief analysis and limited results only on {0, 1} channel model. It does not cover {0, 1, e} channel model and the detailed derivations and discussions on the {0, 1} channel model that this journal version has. The next paragraph gives a brief outline about the organization of the rest of the paper.

Since the analysis of GEAN both under {0, 1} and {0, 1, e} channel models follow similar paths, sections II, III and IV combine together to set the rigorous analytical ground work of the paper under {0, 1} channel model, which we use as the framework when we extend our analysis to {0, 1, e} channel model in section V. Section VI talks about the algorithm the paper used to get an upper bound on the tag population size. Comparative performance of GEAN with some of the recently proposed schemes and conclusion are given in Sections VII and VIII respectively. At the end we have references and appendices containing all the proofs of the lemmas and theorems used in the paper.

II. FORMULATION OF THE PROBLEM

Before we jump into the mathematical formulation of the problem, it would serve well to introduce the protocol we used and give a brief outline of our approach. Expectedly a good number of works have proposed different techniques to solve the mentioned estimation problems. As a result there already have been many communication protocols in literature. Though the idea of our work is independent of any such protocol, to assess the worth of our method we have resorted to the protocol explained in the next paragraph which is one of the widely used protocol in literature [9] [10] [15].

We used the framed slotted Aloha protocol specified in C1G2 [standardized in EPCGlobal C1G2 RFID standard [19] and implemented in commercial RFID systems] as the MAClayer communication protocol. To begin the process AP broadcasts the frame size (f) and a random seed number (S) to all the AGs in its vicinity. Each of the AGs participate in the forthcoming frame with probability p, where p is the persistence probability, the probability that decides if an AG is going to remain active to participate in the forthcoming frame. Each individual AG has an ID and uses f and S values to evaluate a hash function h(f, S, ID). The value of the hash function is uniformly distributed in the range [1, f]. Each AG has a counter that has an initial value equal to the slot number that it has got evaluating the hash function. After each slot the AP sends out a termination signal to all the AGs and each AG decreases its counter value by 1. At any given point the AGs with counter value equal to 1, reply to the reader. The next paragraph gives a brief outline of our approach.

At the beginning of the estimation process GEAN runs a probe using the Flazolet Martin algorithm [20] to get a rough upper bound on the AG population size t. The critical parameters p, f and n are calculated based on the upper bound and the accuracy requirements , where n is the number of rounds (i.e. frames) required to meet the accuracy requirements . Using standardized framed slotted Aloha protocol, GEAN gets a reader sequence of either {0, 1} or {0, 1, e} depending on the channel model. GEAN uses

and as the estimators under {0, 1} and {0, 1, e} channel models respectively. Where is the number of non-empty slots, is the number of is the number of slots having exactly one reply and is the number of slots having more than one replies in the reader sequence. GEAN calculates the value of the respective estimator for each round and after n rounds of these measurements takes the average of all these values. This average is finally substituted for the true mean in the expected value equation of the respective estimator to estimate the AG population size by an inverse function. We have shown in this paper that the expected value function of GEAN under both channel models are invertible and analyzed the conditions under which and are asymptotically Gaussian distributed, while meeting the imposed estimation accuracy requirements. The required number of slots for estimation is defined as , where l = 1ms (i.e. time slots) is the C1G2 specified mandatory time delay between the end of a frame and the start of the next one [19], [21]. In the next two paragraphs, we will explain two important underlying assumptions of the protocol we used.

Firstly, in our above mentioned framework AP identifies the responses coming from the AGs and generate the sequence of {0, 1} or {0, 1, e}. Asking “to what extent can we trust that generated sequence?”would be a very legitimate question given the context. Because there can be numerous ways that can result in a false positive (a 0 getting detected as a 1) or a false negative (a 1 getting detected as a 0). For example, the range of the antenna of the detection device might be extended by metallic objects and result in a false positive, and similarly an erroneous communication link due to environmental or physical interferences may result in a false negative. There have been good number of works in literature addressing either type of miss detection [22] [23], and this part of the work is beyond the scope of our work. Essentially we are assuming the ideal scenario where each bit generated by the protocol can be trusted with probability 1, and such an assumption has been implicitly made by a predominant number of works in this trail [8] [9] [10] [15].

Secondly, in the original protocol each AG, given a frame size, selects one out of f slots to respond. As a result, we can think of each particular ID as producing a binary vector of dimension f, with one and only one of the f entries set as 1 and the rest set as 0. The final outcome due to the aggregated behaviors of all IDs is the binary sum of these f-dimensional binary vectors. Apparently, the analysis of this model is highly complicated. However, a much more analytically tractable model where instead of picking one out of f slots to reply an AG independently decides to reply in each slot of the frame regardless of its decision about previous or forthcoming slots, has been proven to produce the same statistical outcome as the original model [24]. Since then it has been a norm to use this statistically equivalent model [25] [10] [15]. This slot by slot independence implies that an AG may end up choosing none or more than one slots in a frame. But the expected value of the number of slots that an AG chooses to reply in a frame is still 1 [15]. Since our scheme works with multiple frames of bigger frame sizes than the other schemes, we can asymptotically expect to observe this expected number. Hence the analysis remains the same with or without the assumption of independence.

Having described our approach and its underlying assumptions, we are now well set to embark on the mathematical formulation of our work. Consistent with the above definitions and notations, equations (1) and (2) define GEAN estimator and its expected value equation respectively,

As we already mentioned in the preceeding paragraph that, to make the formal development of our work tractable we assumed independence of slots. Let Bernoullibe the random variable that represents the probability that the ith AG replies to the slot. So the value of is 1 with a probability and the value of is 0 with a probability , i.e,

Now we introduce to represent the random variable wheather jth slot of the frame is empty or non-empty. i.e. we define,

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 −0.6

Fig. 1. Expected value function of GEAN channel model . (f = 200, p = 1)

Fig. 2. Variance of GEAN estimator against model. (f = 200, p = 1)

It is straightforward to derive the following probability distribution of ,

where and are the probabilities that a slot empty and non-empty respectively.

Introduction of the following two indicators and makes our analysis easier,

So, (1) can be re-written as

It is straightforward to show that has the following PMF.

The first and second moments of can be given by,

Combination of (10) and (11) gives us the variance of ,

Let, and be the mean and variance of , respectively. Using (2), (7), (10) and (12) we have,

Now using the expressions from (5), equation (13) can be rewritten as,

We define,

which essentially is the average number of AGs per slot.

Lemma 1. given in (15) is a monotonically increasing function of r .

The proof to Lemma 1 is given in Appendix A.

III. GAUSSIAN APPROXIMATION OF THE ESTIMATOR

Our estimation of the AG population size has to maintain the accuracy requirement given by the condition,

Since we are using as our estimator to determine the value of , using (1) and (2), the condition in (17) can be written as,

Now to perform our estimation of the AG population size while maintaining the accuracy requirements given in (18), we need the following two things,

1) has to be an invertible function. 2) a well approximated PDF for . In the previous section we saw that is monotonic function and hence invertible. This section is particularly devoted to the analysis of the conditions under which can be well approximated by a Gaussian random variable within a frame. Since the probabilities given in (5) vary with the frame size, we require a special version of the CLT, that is Triangular Array CLT [17], to prove that follows Gaussian distribution. In other words, we resort to Lindeberg Feller Theorem [26]. The statement of Lindeberg Feller Theorem says, let be an array of independent random variables with and and , then distribution if the condition below holds for every ,

where is the indicator function of a subset A of the set X, and is defined as,

In our algorithm, it is easy to see that the set of random variables given in (9) are independent. From equations (10) and (13) we see that . Lindeberg Feller theorem requires the variable to have zero mean which is not the case with . To fulfill that requirement, we define a new set of variables such that,

Now using (9), the probability distribution of can be given by,

Using (10),(12) and (20) we have,

We define,

Now, are independent random variables with and . According to Lindeberg Feller Theorem, will be asymptotically if the condi- tion below holds for every ,

Substituting in (24), simple algebraic manipulations using (7), (12), (14) and (24) give us . So, we can sum up, if (25) holds.

In the condition given in (25) , the indicator function plays a pivotal role. For the variable we have the following 2 cases of the indicator function,

It is easy to see that if none of (26) and (27) holds, (25) not just converges to but actually becomes 0. We have shown in Appendix B that if the following condition holds, none of the (26) and (27) holds, or consequently (25) holds,

where k(r) is defined as

and the values for and are given by the following equations:

Figure 3 demonstrates and against different values of r. We now come down to a condition, if (28) holds, (25) strictly becomes 0. So, for given r and if we select a frame size large enough so that (28) holds, the distribution of the estimator can be approximated as .

A. Quality Considerations of Gaussian Approximation

The quality of the approximation depends on the value of the approximation error . To ensure that we satisfy the reliability requirements, we take this approximation error into account when we calculate the overall estimation error. Exactly speaking means, if the frame size is large enough to satisfy (28),

where, . Using the above equation we can write,

Now for the given reliability requirement , using (33) we have,

Which means that, if we approximate as a standard normal, to compensate for the approximation error our target reliability should be instead of . Using the fact that probability can not be greater than 1,

Equation (35) gives the maximum value of approximation error that GEAN can allow for a given reliability requirement .

B. Impact of the Approximation Error in Overall Estimation Accuracy

The given above quantifies the the approximation error of the estimator for a given set of parameters i.e. (f, t, r). For example for given if we operate on r = 0.84 and select frame size f = 200 the value of calculated from (28) becomes 0.08. For the same set up if we select f = 1000 then we get . In the first case we make more approximation error than in the second case. Equation(41) suggests that for f = 200 we have to run more rounds than for f = 1000. If we ignore the approximation error and select f = 200 for the above case, we are essentially aiming at a maximum achievable reliablility of 92% as calculated from (35), where as our required reliability is 95%. But for f = 1000 the maximum achievable reliablility is 97% which means the required 95% is achievable. So, its critical to exactly quantify the approximation error you make in terms of the estimation parameters of your algorithm to make sure you are aiming at a maximum reliability calculated from (35) which is greater than your required reliability. In the performance section of this paper we clearly presented the significant hand that approximation error plays in the overall estimation accuracy.

IV. SELECTION OF CRITICAL PARAMETERS

This section clarifies the steps to attain the optimum parameters for GEAN under {0, 1} channel model, described by Algorithm 1. Since all the parameters are functions of the AG population size t which is unknown, we have substituted for t in those equations.

A. Frame size f

For given r and we can have different pairs of p and f satisfying (16). Since , (16) should give us the maximum possible frame size when p = 1 for given r and . Using (16),

For a given r we will have a particular value of k(r) calculated from (29), to satisfy the Lindeberg Feller conditions. Hence,

we can get different pairs of f and satisfying (28). Plugging in maximum possible given by (35) in (28) should give us the minimum allowable frame size for given r. Using (28),

Because of the fact that we operate on a fixed value of r, any frame size in the range will have a corresponding , hence the pair will always satisfy Lindeberg Feller conditions.

Since corresponds to p = 1, any f in the range will have a corresponding for given r.

B. Number of rounds n

For a given frame size f, the required number of rounds required for the estimation of the tag population follows from the accuracy requirements specified in (18). For all the frame sizes in the permissible range is a monotonic function and with an approximation

Fig. 3. channel model

Fig. 4. channel model

error . So for any particular f in that range, using (18) and (33) we have,

P

Since , we have . Now taking the value of as the CDF of standard normal distribution we get the corresponding cut off points in the standard normal curve either side of the vertical axis. Let the symmetric cutoff points to the right and to the left be and respectively. Using (38) it is straightforward to find the value of to be , where is the error fucntion. Let and be the number of rounds required corresponding to and respectively. Because of the fact that the standard deviation gets scaled down times if we take n rounds of the measurements, solving the following two equations should give us the values for and ,

Since the two equations are not quite symmetric, the values of and might differ from each other. We will go by the higher value and take the ceiling if it is not an integer to ensure that we fulfill the minimum accuracy requirements. So, using (39) and (40) the required number of rounds for given frame size f, is given by

C. Selection of r

All other parameters were selected for a given value of r. Calculation of the maximum possible r that follows from the fact that we need to ensure that the following holds,

Combinig (37) and (42), we can calculate the biggest r for which (42) holds,

So, for given the range of the values of r that we operate on is . It is easy to see that, for any value of the range exists. Since r can be arbitrarily small in that range, the maximum frame size can be arbitrarily big and the condition is always satisfied. This implies we can estimate any arbitrary AG population size under GEAN {0, 1} channel model.

V. GEAN UNDER {0, 1, e} CHANNEL MODEL

This section of the paper introduces {0, 1, e} channel model and analyzes our scheme under this particular channel model. Though the major part of the analysis under {0, 1, e} will be presented with reference to that of {0, 1} channel model, in terms of idea and outcome {0, 1, e} is a substantial entity by itself.

Under this model a 0 represents an empty slot, a 1 represents a singleton slot and an e represents a collision slot. The Aloha protocol that we used for {0, 1} is the same protocol that we use for {0, 1, e} channel model to obtain the reader sequence. The estimator that we are using for {0, 1, e} model is where is the number of collision slots and is the number of singleton slots in a frame. The corresponding equations for (1) and (2) for this channel model can be given by, (44) and (45) respectively.

0 1 2 3 4 5 6 7 8 −0.4

Fig. 5. Expected value function of GEAN {0, 1, e} channel model . (f = 200, p = 1)

Fig. 6. Variance of GEAN estimator against model. (f = 200, p = 1)

It is straightforward to see that the random variable defined in (4), has the following probability distribution under {0, 1, e} channel model,

where and are the probabilities that a particular slot is an empty, a singleton and a collision slot respectively. Corresponding to the indicators defined in (6), we introduce the following two indicators for {0, 1, e} channel model,

Corresponding to (8), we define , and its easy to see that has the following probability distribution under {0, 1, e} channel model,

Derivation of the following mean and variance equations for under {0, 1, e} channel model is straightforward.

It is important to notice that {0, 1} model equation (7) holds for {0, 1, e} channel model as well. Using (7), (46), (49), and (50), we get the following mean and variance equations for under {0, 1, e} channel model.

We notice in Figure 5 that, there is a dip at the beginning and other than that the expectation curve is monotonically increasing. The monotonically increasing part guarantees us a distinct inverse that will help us estimate the actual number of AGs. But for the dip we instead have a singularity i.e. we will get more than one horizontal points for one vertical point. We either have to operate in the monotonically increasing region, or we have to find a special way out to select the actual value out of the multiple values suggested by the estimator in the singularity region. The following two lemmas shed more light on the matter. Let the corresponding values of t and r at the point where the dip of occurs be and , respectively.

Lemma 2. The local minimum of curve occurs at an AG population size or equivalently at , given frame size f and persistence probability p.

Lemma 3. is a convex function of t for , and for the rest of the t values the function is concave, given frame size f and persistence probability p.

The proofs of Lemma 2 and Lemma 3 are given in Appendix C and Appendix D respectively .

or equivalently such that (considering that the minimum possible AG population size is 1). That t corresponds to the point B in the curve and we represent the corresponding value of r by will go into the singularity if we violate the following condition,

We numerically got the value of . So, the summary from this section is that, if we have an r so that (53) holds, will be a monotonic function and hence invertible.

It is important to note that, we do not know the value of t rather we just have an upper-bound on t which is . The frame size we are selecting corresponds to not t. Since our frame size corresponds to we may end up selecting a bigger f than we should. As a result we may still end up operating in the singularity region of . The next lemma helps us to resolve this issue.

Lemma 4. In the case of a singularity, i.e. when we have to decide between two different corresponding to the same value of the estimator, the value to the right of the dip is always the value desired.

The proof to Lemma 4 is given in the Appendix E .

Now, corresponding to (21), the modified version of , has the following probability distribution under {0, 1, e} channel model,

Using (20), (49) and (50) we have,

It is obvious that under {0, 1, e} channel model, the indicator function in (25) has the following three cases,

Consequently, the corresponding equation for k given in (29), under {0, 1, e} channel model would be,

and the values for and are given by the following equations,

The proofs of the above equations are given in Appendix F. Based on Lindeberg Feller theorem applied in {0, 1} case, we can say, if we have a big enough frame size to satisfy (28) for given and a value of k calculated from (60), (25) holds and we have asymptotically under {0, 1, e} channel model. To ensure that we take the approximation error into account, like {0, 1} we have to maintain the target reliability instead of for {0, 1, e} channel model, and we still have that upper bound on the value of given by (35).

A. Critical parameters under {0, 1, e} channel model

We use the same equations (36) and (37) as under {0, 1} channel model, to calculate and for given and r, except for the fact that the value of k(r) we use in (37) is specific to {0, 1, e} channel model and is calculated from (60). The corresponding values of p and for each f in the range follow like they did under {0, 1} channel model.

Selection of the number of rounds n required to meet the accuracy requirement under {0, 1, e} is no different from that of {0, 1} channel model. For any f in the mentioned in Algorithm 2, we calculate n required for the given accuracy requirements from (41), as we did for the {0, 1} channel model.

The only difference is that the fucntions and we use in this case are the ones specific to {0, 1, e} channel model given by (51) and (52).

B. Selection of r

For {0, 1, e} channel model, the value of r cannot be less than because of the singularity considerations. Hence, under {0, 1, e} channel model the equation (43) for the maximum value of r gets modified to,

The value k(r) we use in (64) is the one specific to {0, 1, e} channel model.

If for a given the is smaller than , the range does not exist. That means the AG population size is not big enough to be estimated even at a value for given . Plugging in in (36) and (37) then solving the equations under the constraint , we have

Equation (65) implies that, for a given accuracy requirement we can estimate an AG population size only if the upper bound is greater than . For example, for i.e. 0.08, we can only estimate AG population sizes that have greater than 230 under {0, 1, e} channel model.

VI. UPPER BOUND ON THE AG POPULATION SIZE tm

Our algorithm requires an upper-bound on the AG population size which we obtain by using Flajolet and Martin’s probabilistic counting algorithm [20]. We do it before calculating the parameters p, f, n and r. Because, to calculate all these parameters we need . In Flajolet and Martin’s probabilistic counting algorithm, the AP keeps issuing one slot frames till it gets an empty slot. The persistence probability starts with a value 1 and keeps on decreasing following a geometric distribution in the ith frame). If the empty slot occurs in the jth frame then, is considered to be an upper bound on the existing AG poputation size t [20], [27]. Average of values obtained in large number of rounds assymptotically approaches t [20].

We want the upper bound to be as close to actual population size as possible. In [15] fair bit of analysis was done on how close is to t. Their result shows that for 99% reliability and 1% confidence interval is within , and for 90% reliability and 10% confidence interval , which is not particularly a very tight accuracy requirement, is within . This gives us a fair idea that for reasonable accuracy requirements is within . This is an assumption that we made in this paper that which is well supported by the findings in [15] and we take the average over 100 rounds to get that .

VII. PERFORMANCE EVALUATION

After the completion of the analytical analysis of GEAN we used MATLAB to get simulation results for GEAN. We compared the performance of GEAN with few other recent works. The comparison is divided into two chambers complementary to each other. In one chamber we have the acheieved reliability by different schemes and in the other one we have the cost the respective schemes have to pay in terms of the number of slots. To put things into perspective and ensure a fair comparison we presented GEAN both with and without the approximation error counted. In Figures 7 and 9 we see that for two different acuracy requirements GEAN and EZB achieve the required reliability and GEAN Without the Approximation Error Counted (GEAN-WAEC) falls a little short as expected. The remaining two schemes fall a good distance short in meeting the reliability requirements. Figures 8 and 10 show the required number of slots required by different schemes to achieve the reliabilities given in Figures 7 and 9 respectively. We see that if we ignore the approximation error like others, we require less number of slots for estimaiton than the other three schemes. Taking the approximation error into account GEAN requires more number of slots than ART, but delivers much better reliability than ART. On the other note, EZB and GEAN both deliver the promised reliability but GEAN outperforms EZB in terms of the number of slots required for estimation except for very small AG population sizes.

In Figures 8 and 10, since is random due to the randomness of , we presented the mean value line along with the standard deviation over 50 samples. The gradual descent of the GEAN curves, is due to the fact that for the smaller values of t we can not shoot for the bigger values of r due to restrictions incurred by the condition . As t gets bigger we can shoot for the larger values of r and Figure 2 suggests that the variance is smaller for the bigger values of r. That saves us in terms of the number of rounds. The fact that curve eventually gets saturated, happens because k(r) given by (29) has a very sharp increment for the bigger values of r. This trend cost us in terms of the frame size required to satisfy (28). So, even if t continues increasing we cannot operate beyond a certain value of r for a given accuracy requirements, hence is the saturation.

As a side note we present the performance of GEAN under {0, 1, e} channel model in Figure 11. Under {0, 1, e} each 1 is certain. Because of this added certainty, we generally need fewer number of slots to meet the same accuracy requirements under {0, 1, e} than under {0, 1} channel model. The only aberration noticable is because under {0, 1, e} channel model

Fig. 7. Actual reliability achieved by different tag estimation schemes against different tag population sizes

Fig. 8. Number of slots required by different tag estimation schemes against different tag population sizes .

for smaller values of t we can not operate on the smaller values of r because of the singularity issue. That in turn costs in terms of the number of slots required.

VIII. CONCLUSION

The key contribution of this paper is that it proposes a completely new and more effective technique for the estimation of an AG population size that communicate with the same AP. The technique is primarily based on the idea of a big enough frame size and precise approximation of the estimator to Gaussian distribution. These features provide us with advantages over other schemes both in terms of estimation accuracy and cost savings. The simulated results clearly demostrate the rigorous analytical foundation of our work under both {0, 1} and {0, 1, e} channel models.

200 400 600 800 1000 1200 1400 1600 1800 2000 60

Fig. 9. Actual reliability achieved by different tag estimation schemes against different tag population sizes

Fig. 10. Number of slots required by different tag estimation schemes against different tag population sizes .

Proof. Using equation (15), and the fact that for and can be appriximated as and that in turn can be reduced to applying Taylor series, we get,

Fig. 11. Number of slots required by GEAN under two different channel models against the tag population size t

In our algorithm we have and . So, the first and second derivatives of with respect to r will always be non-negative and negative respectively. Hence, is a monotonically increasing function of r.

APPENDIX B DERIVATION OF LINDEBERG FELLER CONDITIONS FOR GEAN UNDER {0, 1} CHANNEL MODEL First condition: From equation (26) we have the following,

Making the same approximation as we did in Appendix A, we can say (26) does not hold if the following is satisfied,

Second condition: From equation (27) we have the following,

Simple algebraic manipulations and aforementioned approximation gives us, (68) or equivalently (27) does not hold if the following is satisfied,

From the above two conditions we see that if we select the value of k such that , (26) and (27) will not hold or equivalently (25) will hold. That essentially means for the GEAN estimator under {0, 1} channel model is Gaussian distributed.

Proof. To find the local minimum we need to differentiate the expectation curve and set the derivative to 0. Solving that equation we will get the local minimum of the dip. Using (15),

Simple algebraic calculations give us,

Here, stands for the t value where the local minimum of the dip occurs (i.e. at point D in Figure:1). Now, since the value of we can approximate . This

will give us the following, . Which means the local minimum for the dip occurs at a value of which is supported by our simulation results. Substituting, in (16) gives us .

If we closely follow the equation we see that, the value of the common factoris negative . So, for the total value to be posive the part of the equation inside the third bracket will have to be negative. After algebraic manipulations and approximating we get

Or, equivalently, for is positive, indicating is a convex function of t and for the rest of the t values the curve is concave. Substituting in (16) gives us .

Proof. At point B in Figure 1, we know the corresponding frame size is , uning (16) and (53) we have,

Now from [15], we know that the upper bound on t is always less that 2t for and which are way less than the accuracy level we are dealing with. For the tighter accuracy requirements the upperbound is even lower. Since using (73) we get,

APPENDIX F DERIVATION OF LINDEBERG FELLER CONDITIONS FOR GEAN UNDER CHANNEL MODEL

First condition: From equation (57) we have the following,

Letting be represented by k, and inserting the expression for and we have,

We know, for and can be appriximated as and that in turn can be reduced to

applying Taylor series. Applying this we get,

Now we have a list of approximations to make. They are,

After all these approximations and using (16) we have,

(2 + + (2 + (1 + k) (1+ + 4+ (1 + 0

For the equation (57) to not hold, (82) must not hold. Simple algebraic manipulations give us that for (82) to not hold the value of k must be,

So, is the minimum value of k for which (57) does not hold. 2nd condition: From equation (58) we have the following,

letting be represented by k, and inserting the expression for and we have,

Like the first condition can be appriximated as and that in turn can be reduced to applying Taylor series. Applying this along with approximations made in (78), (79), (80), (81) and using (16) we have,

For the equation (58) to not hold, (85) must not hold. Simple algebraic manipulations give us that for (85) to not hold the value of k must be,

So, is the minimum value of k for which (58) does not hold. 3rd condition: From equation (59) we have the following,

> ϵ

⇒ − k

Like we did in the previous two conditions, can beappriximated as and that in turn can be reduced to applying Taylor series. Applying this along with approximations made in (78), (79), (80), (81) and using (16) we have,

For the equation (59) to not hold, (87) must not hold. Simple algebraic manipulations give us that for (87) to not hold the value of k must be,

So, is the minimum value of k for which (59) does not hold.

From the above three conditions we see that if we select the value of k such that all of (57), (58) and (59) will not hold or equivalently (25) will hold. That essentially means for the GEAN estimator under {0, 1, e} channel model is Gaussian distributed.

REFERENCES

[1] A. Nemmaluri, M. D. Corner, and P. Shenoy, “Sherlock: automatically locating objects for humans,” in Proceedings of the 6th international conference on Mobile systems, applications, and services. ACM, 2008, pp. 187–198.

[2] C.-H. Lee and C.-W. Chung, “Efficient storage scheme and query processing for supply chain management using RFID,” in Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008, pp. 291–302.

[3] F. Klaus, “Rfid handbook: Fundamentals and applications in contactless smart cards, radio frequency identification and near-field communication,” 2010.

[4] Y. Zhao, Z. Li, B. Hao, and J. Shi, “Sensor selection for tdoa- based localization in wireless sensor networks with non-line-of-sight condition,” IEEE Transactions on Vehicular Technology, vol. 68, no. 10, pp. 9935–9950, 2019.

[5] Y. Zhang, R. Wang, M. S. Hossain, M. F. Alhamid, and M. Guizani, “Heterogeneous information network-based content caching in the internet of vehicles,” IEEE Transactions on Vehicular Technology, vol. 68, no. 10, pp. 10 216–10 226, 2019.

[6] S. Gurugopinath, P. C. Sofotasios, Y. Al-Hammadi, and S. Muhaidat, “Cache-aided non-orthogonal multiple access for 5g-enabled vehicular networks,” arXiv preprint arXiv:1906.06025, 2019.

[7] M. Zhang, P. H. J. Chong, and B.-C. Seet, “Performance analysis and boost for a mac protocol in vehicular networks,” IEEE Transactions on Vehicular Technology, vol. 68, no. 9, pp. 8721–8728, 2019.

[8] M. Kodialam and T. Nandagopal, “Fast and reliable estimation schemes in RFID systems,” in Proceedings of the 12th annual international conference on Mobile computing and networking. ACM, 2006, pp. 322–333.

[9] M. Kodialam, T. Nandagopal, and W. C. Lau, “Anonymous tracking using RFID tags,” in IEEE INFOCOM 2007-26th IEEE International Conference on Computer Communications. IEEE, 2007, pp. 1217– 1225.

[10] H. Han, B. Sheng, C. C. Tan, Q. Li, W. Mao, and S. Lu, “Counting RFID tags efficiently and anonymously,” in INFOCOM, 2010 Proceedings IEEE. IEEE, 2010, pp. 1–9.

[11] T. Li, S. Wu, S. Chen, and M. Yang, “Energy efficient algorithms for the RFID estimation problem,” in INFOCOM, 2010 Proceedings IEEE. IEEE, 2010, pp. 1–9.

[12] V. Shah-Mansouri and V. W. Wong, “Anonymous cardinality estimation in RFID systems with multiple readers,” in Global Telecommunications Conference, 2009. GLOBECOM 2009. IEEE. IEEE, 2009, pp. 1–6.

[13] A. Zanella, “Estimating collision set size in framed slotted aloha wireless networks and RFID systems,” IEEE Communications Letters, vol. 16, no. 3, pp. 300–303, 2012.

[14] V. Shah-Mansouri and V. W. Wong, “Cardinality estimation in RFID systems with multiple readers,” IEEE Transactions on Wireless Communications, vol. 10, no. 5, pp. 1458–1469, 2011.

[15] M. Shahzad and A. X. Liu, “Fast and accurate estimation of RFID tags,” IEEE/ACM Transactions on Networking, vol. 23, no. 1, pp. 241–254, 2015.

[16] ——, “Probabilistic optimal tree hopping for rfid identification,” ACM SIGMETRICS Performance Evaluation Review, vol. 41, no. 1, pp. 293– 304, 2013.

[17] A. Araujo and E. Gin´e, The central limit theorem for real and Banach valued random variables. Wiley New York, 1980, vol. 431.

[18] M. M. Hasan, S. Wei, and R. Vaidyanathan, “Estimation of RFID tag population size by Gaussian estimator,” in 2018 IEEE International Conference on Communications (ICC). IEEE, 2018, pp. 1–6.

[19] E. EPCglobal, “Radio-frequency identity protocols class-1 generation- 2 uhf RFID protocol for communications at 860 mhz–960 mhz version 1.0. 9,” K. Chiew et al./On False Authenticationsfor C1G2 Passive RFID Tags, vol. 65, 2004.

[20] P. Flajolet and G. N. Martin, “Probabilistic counting algorithms for data base applications,” Journal of computer and system sciences, vol. 31, no. 2, pp. 182–209, 1985.

[21] A. W. Reza, T. K. Geok, K. J. Chia, and K. Dimyati, “RFID transponder collision control algorithm,” Wireless Personal Communications, vol. 59, no. 4, pp. 689–711, 2011.

[22] G. Alfian, M. Syafrudin, B. Yoon, and J. Rhee, “False positive rfid detection using classification models,” Applied Sciences, vol. 9, no. 6, p. 1154, 2019.

[23] L. V. Massawe, J. D. Kinyua, and H. Vermaak, “Reducing false negative reads in rfid data streams using an adaptive sliding-window approach,” Sensors, vol. 12, no. 4, pp. 4187–4212, 2012.

[24] H. Vogt, “Efficient object identification with passive RFID tags,” in International Conference on Pervasive Computing. Springer, 2002, pp. 98–113.

[25] J.-R. Cha and J.-H. Kim, “Novel anti-collision algorithms for fast object identification in RFID system,” in 11th International Conference on Parallel and Distributed Systems (ICPADS’05), vol. 2. IEEE, 2005, pp. 63–67.

[26] B. M. Brown et al., “Martingale central limit theorems,” The Annals of Mathematical Statistics, vol. 42, no. 1, pp. 59–66, 1971.

[27] C. Qian, H. Ngan, Y. Liu, and L. M. Ni, “Cardinality estimation for large-scale RFID systems,” IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 9, pp. 1441–1454, 2011.

Designed for Accessibility and to further Open Science