’Privacy’ [1] is defined as - ”a sweeping concept, encompassing (among other things) freedom of thought, control over one’s body, solitude in one’s home, control over personal information, freedom from surveillance, protection of one’s reputation, and protection from searches and interrogations”. Privacy is a term used to describe an individual’s anonymity and how safe they feel in a location preferably in Internet, which is one of the most sensitive and concerned ’concept’ at present. In the current situation crowd sourcing is the most popular source of collecting data directly from people for many research topics. Generally it is being done through several online site or portal in Internet. But there are some basic issues regarding the whole survey process, like (a) the system of survey should be convincing enough to gain the participants trust ,(b) the processes after the survey should be effective enough to ensure the ‘truthfulness’ of the participants to the researchers, (c) The processes of research should be robust enough to guarantee the leak proof of the research architecture model, (d) The system of survey should still produce a ‘good result’ in terms of gaining an insight of the problem in spite of the ‘noise’ in the data. Therefore it promises a large research field in case of statistical databases where a leak of small amount of data may lead to a personal identification which might be a concern for that person in his or her personal life . The phenomenon of ’Re-Identification’ was considered as a possibility at a time. However it has been a wrong solution on its own. The re-identification approach has been proved wrong by Lataniya Sweeney. She identified the medical information of the then governor of Massachusetts- William Weld, using the publicly available insurance data after cross checking it with the de-identified data published by the state govt. in the year 1997. She mailed the same data to the governor too. A probable solution regarding this phenomenon was given by Lataniya Sweeney in her k-anonymity theorem[2]. However, k-anonymization does not include any randomization and attackers can still make inferences about data sets that may harm individuals. Hence, the arrival of (DP). According to Cynthia Dwork [3], in general - ” The outcome of any analysis is essentially equally likely independent of whether any individual joins or refrains from joining the dataset.” Therefore it implies that if a data analyst will not know anything new about a person from a dataset of sensitive data when he/she is not present in that dataset, as well as when he/she join the dataset too in case if the dataset is maintaining the notion of DP.
There are two types of differential privacy - Centralized differential privacy or CDP and Local differential privacy or LDP. They have their own pros and cons in terms of data collection, data reservation, presence of trusted third party data curator etc. The detail discussion and comparison between LDP and CDP is discussed in the section 2.
There are some industry standard like RAPPOR[4] in ential privacy[9] etc. in case of LDP. Also in case of CDP there is ESA architecture and its PROCHLO implementation. RAPPOR is fast where PROCHLO is expensive to implement. In CDP the centralized part helps the whole approach
to be compatible with existing, standard software engineering processes which is in succession helpful to use the industry standard known techniques(differentially-private releases) to analyze[5] it. So, there is a clear gap between LDP and CDP approach, which is definitely a great research area.
There has been some approaches regarding this area like - OUTIS [6], Am-plification by shuffling: from local to central differential privacy via anonymity [8] etc. OUTIS claims that it provides a bridge between LDP and CDP. OUTIS provides an architecture for differential privacy that does not need any trusted third party data curator like the CDP but still achieves the accuracy guarantees and algorithmic expressibility like in the CDP approach, which gives the possibility of ”best of both the world” in LDP and CDP. In the amplification [8] model they provide a differential privacy algorithm that will be satisfied by any - invariant algorithm in the LDP as well as in the CDP model.
Our approach is to keep the best characteristics from both the LDP and CDP. Here,
• We are taking the reports from the already industry standard RAPPOR algorithm.
• Then we calculate the tf-idf value for all the positions where at least one ’on bit’ has occurred with respect to the sample size after sampling multiple times with different sample size.
• We get a constant attribution/ sample size always after sampling for a constant number of occurrence of ’on bit’ in the reporting string of prr and irr of RAPPOR.
• After that in the calculation step the weighted sum is being calculated. The results are then stored in a separate database.
• This whole approach assures that no direct identification is possible in case of cohort, prr or irr by the curator of the database or any attacker.
From our approach we contribute several things-
• The storing database in the server side becomes less in size.
• We are not storing anything other than the weighted sum only.
• The analysis phase is faster.
• Our model identifies the major ’True Value’ (which has the more occurrences in the samples) every time.
Our overall approach is to give a centralized environment to the RAPPOR LDP model that will eventually give a more generalized bigger picture to ensure more differential privacy in a comparative faster way which is better than the other centralized approach such as OUTIS[6], Amplification model[8] and PROCHLO[5].
This paper is divided into six sections. After the introduction(in section- 1) part, we will discuss about the background theories in the related work(in section-2) , then in the previous approach(in section-3) we will discuss about the the previous approaches that helps us to set our aim to achieve the current result. The last three sections all are devoted to talk about the architecture(in section- 4), results and analysis (in section-5) about our approach that will conclude(in section-6) with our thought of future approach. The description of the acronyms used in this paper is given in the following Table 1:
Table 1: Name and description of all the acronyms used
Differential privacy guarantees the following two things -
• the output of the differential privacy algorithm is definitely stable and
• only the forest of sampling and analysis is guaranteed which ensures that if the input of a single user get changed it will not have any effect on the output at all.
2.1 Differential Privacy
[3]“Differential privacy” describes a promise, made by a data holder or curator, to a data subject: “You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available.” The mathematical definition after having the standard relaxation with an addition of will be as follows -
A randomized algorithm satisfies (
) - differential privacy if for all S
S and for all adjacent
it holds that
In Table 2 we have provided the essential explanation of the parameters in the equation (1). From the above definition and [7] it is evident that differential privacy is a good measure of the output stability in a randomized response
Table 2: Description of all the parameters in the Differential Privacy
technique when the input can change frequently from any single user.DP is of two types:
2.1.1 Local Differential Privacy (LDP)
This is entirely depended on randomized response technique introduced in 1965 which is a simple response technique by the user depending on the coin toss probability. This model was formally introduced by kasiviswanathan et al.[7] In this model the distribution of a forest of data is always assured to be stable even when a user can change his or her response suddenly. This simple trust model is the main attraction of this model which is why it is the main adoption in the industrial implementation now a days which does not even need a third party assurance at all.
2.1.2 Central Differential Privacy (CDP)
Here the trusted data curator has the main role to play.The curator has the responsibility to add the uncertainty by adding random noise which eventually led to differential privacy. This whole process occurs entirely the untrusted data analyst’s queries. As the answer to the queries always hold a small fraction of data of the whole centralized dataset which helps to establish the differential privacy phenomenon.
The comparison between CDP and LDP is in the Table 3
Table 3: Comparison between LDP and CDP
2.2 Randomized Response Technique
It is introduced in 1965 by Stanly L.Warner [10], it is a very necessary and effective technique in terms of structured survey, which helps the participants to answer freely about the sensitive issues without being concerned about con-fidentiality. The whole model is based on the participants’ truthfulness. It can be interpreted like following -
Table 4: Description of all the parameters in the Randomized Response Tech-
In Table 4 the description of the parameters are given. We can consider the following example- Let’s ask a number of cancer patient under clinical observation, whether they have consumed the medicine today or not. They should answer ‘YES’ if it is a 6 and answer truthfully if it otherwise, after flipping a dice secretly which is unknown to the interviewer. Let the number of ‘YES’ is 55 out of 100 participants. Therefore Y A = 55/100= 11/20. Therefore p = 1/6. therefore EP = 42.42% or true proportion of medicine taker is 42.42%.
3.1 RAPPOR
Randomized Aggregatable Privacy-Preserving Ordinal Response[4] or RAPPOR is a industry standard open-source technology introduced by Google whose main function is to get the raw data locally from the client software, anonymously with the guarantee of differential privacy. They have applied randomized response in a novel manner by the introduction of Bloom filter. RAPPOR always make sure that the forest of client side strings can only get analyzed by the untrusted analyst but not the other way around. It is a very good example of LDP. After discussing CDP and it’s implementation PROCHLO in the next two paragraphs, the difference between CDP and LDP will be more clear.
3.2 PROCHLO implementation of ESA
This is a very good example of CDP. The entire architecture of ESA or Encode, Shuffle and Analyze to be precised is a three-step pipeline. PROCHLO [5] is the real life implementation of ESA, with the introduction of stash shuffle, which is a novel, scalable and efficient oblivious-shuffling algorithm using the Intel’s SGX architecture. In this architecture there is a central database of encrypted records which can only be decoded by a special analysis algorithm determined by the corresponding decryption key. After that in the shuffling stage the data truly become the part of the crowd with the assurance of that a threshold number of reported items will surely be existed for every data partition. The analyze step aggregates all the data partition publicly though the anonymity always intact. In every step of ESA the differential privacy has been maintained [5]. They have also taken the deep learning area into consideration and implement some test cases with success. Kashiviswanathan et al[7] has stated that there exits an exponential separation between the sampling complexity and accuracy in the LDP and CDP algorithms. That’s why LDP requires a huge amount of data to have a good population statistics. So the main gap between the CDP and the LDP approach is mainly of four kinds, like -
• differencein necessary amount of data to produce a good population distribution statistics,
• storage of data,
• difference in speed,
• last but not the least difference in approach.
ESA and its implementation PROCHLO[5] is a very good implementation of CDP. We have already discussed it above.
As we have discussed before there have been some model proposed in recent literature, like - OUTIS[6], Amplification by shuffling [8] etc. They have their own pros and cons.
3.3 Amplification Approach
In [8] erlingsson et al. have proposed an algorithm that gives a powerful am-plificaiton technique, that any permutation invariant algorithm satisfying will satisfy
log (1
) - central differential privacy. But their assumption standardized on the basis of static population, which is not possible in real life. Also they have ignored the implications of timing or traffic channels, which should have been considered.This is a huge drawback comparison to the ESA architecture.
3.4 OUTIS
OUTIS replaces a single trusted data curator by two untrusted non-collaborative servers i.e. the CSP and the AS the third party association in the CDP model is diminished. It gives the permission to the analysts to author the logical programs as logical programs always support the differential privacy, by restricting the access of the sensitive data.
The main cons of this model are -
• Aggregation operators, Multi-table queries, Matrix multiplication does not work in this environment.
• These servers are semi-honest ( this is achieved by linear homomorphic encryption and Yao’s garbled circuits) which means they follow the protocols honestly but their contents and computation can be observed by an adversary.
• Privacy engine is not a strong point of this model.
• Too much work pressure on the AS part
• It starts off with a total privacy budget of agreed upon by all the data owners. So, there is an option that the privacy is not enough.
3.5 Initial Approaches
Our first approach was to make an environment where both the RAPPOR and ESA will be implemented and will be chosen as per as the query requirement. But for that approach, we have to implement the expensive ESA architecture parallel to the RAPPOR algorithm. Also the query should be known beforehand. That is why we did not implement this idea physically.
Our second approach was to use a Convolutional Neural Network layer on top of the RAPPOR reports, which will eventually analyse the reports taking combination of prr and irr [4] as training data in sets of samples. After training, the model was not able to detect the true values beyond 2% of accuracy which was really poor. The main reason behind our failure was mainly for overfitting due to the inbuilt noise from randomized response sampling in the prr and irr combination.
Our target from the beginning is to build an analysis technique which will follow the CDP, as well as easy to understand, robust, fast and cheap in implement. All these points have also been discussed in the previous sections too. That’s why our third approach is to follow the already renowned TF-IDF technique (mainly used in the information retrieval field for relevant decision making)[12]. It is very easy to implement and as it follows the probabilistic relevance of ’term’s (in our case ’on bit’ and ’off bit’ in the prr and irr string) in a document as well as in a corpus of documents the analysis part became very easy from our side. The details of our proposed architecture is being discussed in the following section 4.
Our aim is to build a CDP system that is cheap, fast, robust and less complex. So, after two failed attempts we were success full to achieve at least a part of what have we aimed. The following four subsections will explain the methodology of our architecture, the data collection part and how both these parts has built the system.
4.1 Tf-Idf
The full form of Tf-Idf is term frequency-inverse document frequency [11]. It is mostly used to retrieve the ” probability-weighted amount of information ” mainly in the field of feature extraction of machine learning, automatic term extraction in computational terminology, information theoretic field, relative decision making etc. Term Frequency measures the ”local relevance” [12] in a specific document of a corpus of documents. So, it provides a direct estimation of the probability of occurrence of a specific text or word after normalization with respect to the scope of calculation. Inverse Document Frequency on the other hand provide wide relevance in the whole document. The formula to calculate TF is following-
Where, N = Total number of documents in the corpus and is the number of documents where t appears. If the term is not in the corpus, then the IDF term will be undefined, that’s why it is being adjusted by adding ’1’ in the denominator.
Therefore the total calculation is -
The explanation of all the parameters that have been used in the calculation of TF-IDF is being listed in the Table 5.
Table 5: Description of all the parameters in the TF-IDF calculation
4.2 Methodology
In our proposed ARA analysis model the actions are divided into three steps. In the first step or in the Sampling step we have sampled the reports taken from RAPPOR and calculated the TF-IDF value with the following formula for the ’prr’ and ’irr’ string -
Here N = count of ’on bit’ in the string and S = Sample Size.
After sampling (taking 100, 1000, 10000, 20000, 25000 samples at a time) from the reports many times we could deduce the following two important decisions-
• There are always a constant contribution from the ’on bit’s to the TF-IDF value abiding the rule depending on the number of ’on bit’ in the string, for the whole sample size..
• As the size of our string is 32 bits, we also have deduced that there could not be more than 17 ’on bit’ in a string.
The list of the constant with respect to the number of ’on bit’ in a string is in Table 6
These constants values are being kept secret and the database curator or the analyst only has the access to these constants. Also these constants are 1.1 times larger whenever the number of position gets lesser by one position.
In the second step or in the Weighted Sum calculation step we simply calculate the weighted sum for each report consisting of cohort value, prr, irr and true
Table 6: List of Constant contribution with respect to the number of ’on bit’ in a string
value using the following formula (Description of parameters used in weighted sum calculation step has been given in the Table 7) -
Table 7: Description of parameters used in weighted sum calculation step
. For experimental convenience, we have used 10 true values starting from v1, v2, v3, ..., v10 and the range of cohort is 0 - 63. The calculated weighted sum are then stored in a centralized database. Point to be noted that there are only two attributes that we are storing - the constant values and the weighted sums. Therefore there is no direct identification of the reports that is being stored. So, the security is not harmed, as there are so many combination of count of ’on bit’ in prr and irr with respect to the true values and cohort.
The third step is the last step. It is addressed as the Analysis Phase. Here we take the testing samples of reports and then calculate the weighted sum using the same formulas above and match them against the central database and generate a report. The Data Flow in the proposed system is showed in the Fig. 1
4.3 Data
We collected data after cloning the RAPPOR implementation in the Google repository from GitHub [13] and running it more than hundred times. We made our own datasets from the generated reports from running. For our convenience we used only ten true values (v1, v2, v3, ..., v10) where RAPPOR has used one hundred true values (v1, v2, v3, ..., v100). The distribution of the input report is being shown in the Figure 2. As we can see the reports are already distributed abiding the exponential distribution. So, there is already sufficient noise implemented inside the reports. As it is an open source repository we do not need to fetch the permission of authors to use these codes.
4.4 System Specification
Experiments were run on a Desktop Computer with an Intel(R) Core(TM) i7-7700 CPU running at 3.60GHz using 8192 KB of cache memory on the Linux 18.04 UBUNTU OS with 1000 GB of HDD. It is a 64 bit system. It took around 1.5 hour to complete one test set at a time. We wrote all our code using R Studio, where R version 3.5.1 (2018-07-02) – ”Feather Spray” on the Platform: x86 64-conda cos6-linux-gnu (64-bit) is used. All our codes have been uploaded here [14].
5.1 Result
For testing our model we have taken 1000 sample reports at a time consisting of the cohort, prr and irr strings. We then test the set of reports for 100 times each, against our model. We have done total 40 set of tests. After gaining the count of the occurrence of a true value in the matching process which is being described in the flow chart, the percentage achievement is simply the percentage calculation of the count against the sample size. The summary of the test results is in the Table 8 .
5.2 Analysis
From the test part it is evident that, every time our model was able to detect the major component out of the samples which has been collected from multiple clients at a time. Though the percentage of achievement is an issue, but still our model is fast. Also, the percentage of achievement and the sample size is not depended on each other. It is being shown in the following graphical representation in Figure 3 -
5.3 Comparison
We have outlined a comparison of our approach against current existing central approaches ESA, OUTIS and Amplification Model in Table 9. However, a direct comparison is not possible; given some of the approaches are hardware approaches and some are purely theoretical whose implementations are rather expensive and unavailable or currently do not exist respectively.
Privacy has been a long-established issue through decade. Though Differential Privacy has paved a significance contribution in this area but the achievement in the Local differential privacy area is still better in context of industry standard, speed, utility, expensiveness etc. Our model is highlighted on these issues.The goodness of our model are-
• It is fast in the analysis phase.
• The Centralized Database size is much smaller.
• The database does not store the reports for longer. Just the time for the calculation of weight only.
• Accurately identify the major true value every time.
• Simple probabilistic approach towards analysis, which is not as complex as OUTIS or PROCHLO.
• It maintains RAPPOR’s differential privacy promises.
The main drawback we get is the level of achievement is not more than 52.28% in average. Also our model is not able to detect the second major component with accuracy too. The utility and flexibility should be more too.
The drawbacks are definitely the motivation for our future work. But it is a simple approach towards centralized differential privacy which is less complex in framework and have an accessible computation. Therefore it is in turn definitely a good contribution towards central differential privacy.
: The authors declare that they have no conflict of interest.
[1] Solove, D. J. (2008). Understanding privacy (Vol. 173). Cambridge, MA: Harvard university press.
[2] Sweeney, L. (2002). k-anonymity: A model for protecting privacy. Inter- national Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05), 557-570.
[3] Dwork, C., & Roth, A. (2014). The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4), 211-407.
[4] Erlingsson, ´U., Pihur, V., & Korolova, A. (2014, November). Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security (pp. 1054-1067). ACM.
[5] Bittau, A., Erlingsson,`U., Maniatis, P., Mironov, I., Raghunathan, A., Lie, D., ... & Seefeld, B. (2017, October). Prochlo: Strong privacy for analytics in the crowd. In Proceedings of the 26th Symposium on Operating Systems Principles (pp. 441-459). ACM.
[6] Chowdhury, A. R., Wang, C., He, X., Machanavajjhala, A., & Jha, S. (2019). Outis: Crypto-Assisted Differential Privacy on Untrusted Servers. arXiv preprint arXiv:1902.07756.
[7] Kasiviswanathan, S. P., Lee, H. K., Nissim, K., Raskhodnikova, S., & Smith, A. (2011). What can we learn privately?. SIAM Journal on Computing, 40(3), 793-826.
[8] Erlingsson, ´U., Feldman, V., Mironov, I., Raghunathan, A., Talwar, K., & Thakurta, A. (2019). Amplification by shuffling: From local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 2468-2479). Society for Industrial and Applied Mathematics.
[9] Tang, J., Korolova, A., Bai, X., Wang, X., & Wang, X. (2017). Privacy loss in Apple’s implementation of differential privacy on MacOS 10.12. arXiv preprint arXiv:1709.02753.
[10] Warner, Stanley L.,(1965).Randomized response: A survey technique for eliminating evasive answer bias., (pp. 63-69). Taylor & Francis
[11] Aizawa A.(2003).An information-theoretic perspective of tf–idf mea- sures.Information Processing and Management 39 (2003).(pp. 45 – 65)
[12] Wu C. H., Robert W. P. L., Wong F. K., Kwok L. K. Interpreting TF- IDF Weights as Making Relevance Decisions.(2008). ACM Transactions on Information Systems. Vol. 26, No. 3, Article 13.
[13] https://github.com/google/rappor
[14] https://github.com/Suvixx/ARA-Aggregated-RAPPOR-and-Analysis-for- Centralized-Differential-Privacy
Figure 1: Data flow in the ARA architecture
Figure 2: Distribution of the Input reports arranged from RAPPOR
Figure 3: An Analysis between sample size and percentage of achievement
Table 8: Summary of the Test Results
Table 9: Comparison with other models