One critical challenge for applying today’s Artificial Intelligence (AI) technologies to real-world applications is the common existence of data silos across different organizations. Due to legal, privacy and other practical constraints, data from different organizations cannot be easily integrated. The implementation of user privacy laws such as GDPR [1] has made sharing data among different organizations more challenging. Collaborative learning has emerged to be an attractive solution to the data silo and privacy problem. While distributed learning (DL) frameworks [2] originally aims at parallelizing computing power and distributes data identically across multiple servers, federated learning (FL) [3] focuses on data locality, non-IID distribution and privacy. In most of the existing collaborative learning frameworks, data are distributed by samples thus share the same set of attributes. However, a different scenario is cross-organizational collaborative learning problems where parties share the same users but have different set of features. For example, a local bank and a local retail company in the same city may have large overlap in user base and it is beneficial for these parties to build collaborative learning models with their respective features. FL is further categorized into horizontal (sample-partitioned) FL, vertical (feature-partitioned) FL and federated transfer learning (FTL) in [4].
Feature-partitioned collaborative learning problems have been studied in the setting of both DL [5–7] and FL [4, 8–10]. However, existing architectures have not sufficiently addressed the communication problem especially in communication-sensitive scenarios where data are geographically distributed and data locality and privacy are of paramount significance (i.e., in a FL setting). In these approaches, per-iteration communication and computations are often required, since the update of algorithm parameters needs contributions from all parties. In addition, to prevent data leakage, privacy-preserving techniques, such as Homomorphic Encryption (HE) [11], Secure Multi-party Computation (SMPC) [12] are typically applied to transmitted data [4, 10, 9], adding expensive communication overhead to the architectures. Differential Privacy (DP) is also a commonly-adopted approach, but such approaches suffer from precision loss [6, 7]. In sample-partitioned FL[3], it is demonstrated experimentally that multiple local updates can be performed with federated averaging (FedAvg), reducing the number of communication round effectively. Whether it is feasible to perform such multiple local update strategy over distributed features is not clear.
In this paper, we propose a collaborative learning framework for distributed features named Federated stochastic block coordinate descent (FedBCD), where parties only share a single value per sample instead of model parameters or raw data for each communication, and can continuously perform local model updates (in either a parallel or sequential manner) without per-iteration communication. In the proposed framework, all raw data and model parameters stay local, and each party does not learn other parties’ data or model parameters either before or after the training. There is no loss in performance of the collaborative model as compared to the model trained in a centralized manner. We demonstrate that the communication cost can be significantly reduced by adopting FedBCD. Compared with the existing distributed (stochastic) coordinate descent methods [13–16], we show for the first time that when the number of local updates, mini-batch size and learning rates are selected appropriately, the FedBCD converges to a accuracy with rounds of communications despite performing multiple local updates using staled information. We then perform comprehensive evaluation of FedBCD against several alternative protocols, including a sequential local update strategy, FedSeq with both the original Stochastic Gradient Descent (SGD) and Proximal gradient descent. For evaluation, we apply the proposed algorithms to multiple complex models include logistic regression and convolutional neural networks, and on multiple datasets, including the privacy-sensitive medical dataset MIMIC-III [17], MNIST [18] and NUS-WIDE [19] dataset. Finally, we implemented the algorithm for federated transfer learning (FTL) [10] to tackle problems with few labeled data and insufficient user overlaps.
Traditional distributed learning adopts a parameter server architecture [2] to enable a large amount of computation nodes to train a shared model by aggregating locally-computed updates. The issue of privacy in DL framework is considered in [20]. FL [3] adopted a FedAvg algorithm which runs Stochastic Gradient Descent (SGD) for multiple local updates in parallel to achieve better communication efficiency. The authors of [21] studied the FedAvg algorithm under the parallel restarted SGD framework and analyzed the convergence rate and communication savings under IID settings. In [22], the convergence of the FedAvg algorithm under non-IID settings was investigated. All the work above consider the sample-partitioned scenario.
Feature-partitioned learning architectures have been developed for models including trees [9], linear and logistic regression [4, 8, 5, 7], and neural networks [10, 6]. Distributed Coordinate Descent [13] used balanced partitions and decoupled computation at the individual coordinate in each partition; Distributed Block Coordinate Descent [14] assumes feature partitioning is given and performs synchronous block updates of variables which is suited for MapReduce systems with high communication cost settings. These approaches require synchronization at every iteration. Asynchronous BCD [15] and Asynchronous ADMM algorithms [16] tries to tame various kinds of asynchronicity using strategies such as small stepsize, and careful update scheduling, and the design objective is to ensure that the algorithm can still behave reasonably under non-ideal computing environment. Our approach tries to address the expensive communication overhead problem in FL scenario by systematically adopting BCD with sufficient number of local updates guided by theoretical convergence guarantees.
Suppose K data parties collaboratively train a machine learning model based on N data samples and the feature vector are distributed among K parties , where is the feature dimension of party k. Without loss of generality, we assume one party holds the labels and it is party K. Let us denote the data set as , for , , and (where denotes the set ). Then the collaborative training problem can be formulated as
where denotes the training parameters of the kth party; and denotes the loss function and regularizer and is the hyperparatemer; For a wide range of models such as linear and logistic regression, and support vector machines, the loss function has the following form:
The objective is for each party k to find its without sharing its data or parameter to other parties.
If a mini-batch of data is sampled, the stochastic partial gradient w.r.t. is given by
Where is the number of samples in S. To compute locally, each party sends to party K, who then calculates and sends to other parties, and finally all parties can compute gradient updates with equation (4). See Algorithm 1.
For an arbitrary loss function, let us define the collection of information needed to compute
where is a function summarizing the information required from party q to k, the stochastic gradients (3) can be computed as the following:
Therefore, the overall stochastic gradient is given as
A direct approach to optimize (1) is to use the vanilla stochastic gradient descent (SGD) algorithm given below
The federated implementation of the above SGD iteration is given in Algorithm 1. Algorithm 1 requires communication of intermediate results at every iteration. This could be very inefficient, especially when K is large or the task is communication heavy. For a task of form in equation (2), the number of communications per round is , but for an arbitrary task, the number of communications per round can be if it requires pair-wise communication. We note that since Algorithm 1 has the same iteration as the vanilla SGD algorithm, it converges with a rate of regardless of the choice of K [23]. Since each iteration requires one round of communication among all the parties, T rounds of communication is required to achieve an error of
In the parallel version of our proposed algorithm, called FedBCD-p, at each iteration, each party performs Q > 1 consecutive local gradient updates in parallel, before communicating the intermediate results among each other; see Algorithm 2. Such “multi-local-step" strategy is strongly motivated by our practical implementation (to be shown in our Experiments Section), where we found that performing multiple local steps can significantly reduce overall communication cost. Further, such a strategy also resembles the FedAvg algorithm [3], where each party performs multiple local steps before aggregation.
At each iteration the kth feature is updated using the direction (where S is a mini-batch of data points)
Because is the intermediate information obtained from the most recent synchronization, it may contain staled information so it may no longer be an unbiased estimate of the true partial gradient . On the other hand, during the Q local updates no inter-party communication is required. Therefore, one could expect that there will be some interesting tradeoff between communication efficiency and computational efficiency. In the same spirit, a sequential version of the algorithm allows the parties to update their local ’s sequentially, while each update consists of Q local updates without inter-party communication, termed FedBCD-s. (Algorithm 3).
Here we aim to find out whether one party can learn other party’s data from collections of messages exchanged () during training. Whereas previous research studied data leakage from exposing complete set of model parameters or gradients [24–26], in our protocol model parameters are kept private, and only the intermediate results (such as inner product of model parameters and feature) are exposed.
The security definition is that for any party k with undisclosed dataset and training parameters following FedBCD, there exists infinite solutions for that yield the same set of contributions . That is, one can not determine party k’s data uniquely from its exchanged messages of regardless the value of M, the total number of iterations.
Such a security definition is inline with prior security definitions proposed in privacy-preserving machine learning and secure multiple computation (SMC), such as [27–29, 4, 30]. Under this security definition, when some prior knowledge about the data is known, an adversary may be able to eliminate some alternative solutions or certain derived statistical information may be revealed
(a) (b) Figure 1: Illustration of a 2-party collaborative learning framework (a) with neural network(NN)-based local model. (b) FedBCD-s and FedBCD-p algorithms
[28, 29] but it is still impossible to infer the exact raw data ("deep leakage"). However, this practical and heuristic security model provides flexible tradeoff between privacy and efficiency and allows much more efficient solutions. Our problem is particularly challenging in that the observations by other parties are serial outputs from FedBCD algorithm and are all correlated based on equation (11). Although it is easy to show security to send one round of due to protection from the reduced dimensionality, it is unclear whether raw data will be leaked after thousands or millions of rounds of iterative communications.
Theorem 1 For K-party collaborative learning framework following (2) with , the FedBCD Algorithm is secured for party k if k’s feature dimension is larger than 2, i.e.,
The security proof can be readily extended to collaborative systems where parties have arbitrary local sub-models (such as neural networks) but connect at the final prediction layer with loss function (2) (see Figure 1(a)). Let be the local transformation on and is unknown to parties other than k. We choose to be the identity map, i.e. , then the problem reduces to Theorem 1.
In this section, we perform convergence analysis of the FedBCD algorithm. Our analysis will be focused on the parallel version Algorithm 2 and the sequential version can be analyzed use similar techniques.
To facilitate the proof, we define the following notations: Let r denote the iteration index, in which each iteration one round of local update is performed; Let denote the latest iteration before r in which synchronization has been performed, and the intermediate information ’s are exchanged. Let denote the “local vector" that node k uses to compute its local gradient at iteration r, that is
where denotes a vector v with its kth element replaced by w. Note that by Algorithm 2, each node k always updates the kth element of , while the information about is obtained by the most recent synchronization step. Further, we use the “global" variable to collect the most updated parameters at each iteration of each node, where denotes the jth element of
Note that is only a sequence of “virtual" variables, it is never explicitly formed in the algorithm.
Assumptions
A1: Lipschitz Gradient. Assume that the loss function satisfies the following:
A2: Uniform Sampling. For simplicity, assume that the data sample is partitioned into B mini-batches , each with size S; at a given iteration, S is sampled uniformly from these mini-batches.
Our main result is shown below. The detailed analysis is relegated to the Supplemental Material.
Theorem 2 Under assumption A1, A2, when the step size in Algorithm 2 satisfies
where denotes the global minimum of problem (1).
Remark 1. It is non-trivial to find an unbiased estimator for the local stochastic gradient . This is because after each synchronization step, each agent k performs Q deterministic steps based on the same data set S, while fixing all the rest of the variable blocks at . This is significantly different from FedAvg-type algorithms, where at each iteration a new mini-batch is sampled at each node.
Remark 2. If we pick , and , then with any fixed K the convergence speed of the algorithm is (in terms of the speed of shrinking the averaged gradient norm squared). To the best of our knowledge, it is the first time that such an rate has been proven for any algorithms with multiple local steps designed for the feature-partitioned collaboratively learning problem.
Remark 3. Compared with the existing distributed stochastic coordinate descent methods [13–16], our results are different. It shows that, despite using stochastic gradients and performing multiple local updates using staled information, only communication rounds are requires (out of total T iterations) to achieves rate. We are not aware of any of such guarantees for other distributed coordinate descent methods.
Remark 4. If we consider the impact of the number of nodes K and pick , and S = Q = , then the convergence speed of the algorithm is . This indicates that the proposed algorithm has a slow down w.r.t the number of parties involved. However, in practice, this factor is mild assuming that the total number of parties involved in a feature-partitioned collaboratively learning problem is usually not large.
Datasets and Models
MIMIC-III. MIMIC-III (Medical Information Mart for Intensive Care) [17] is a large database comprising information related to patients admitted to critical care units at a large tertiary care hospital. Data includes vital signs, medications, survival data, and more. Following the data processing procedures of [31], we compile a subset of the MIMIC-III database containing more than 31 million clinical events that correspond to 17 clinical variables and get the final training and test sets of 17,903 and 3,236 ICU stays, respectively. For each variable we compute six different sample statistic features on seven different subsequences of a given time series, obtaining features. We focus on the in-hospital mortality prediction task based on the first 48 hours of an ICU stay with area under the receiver operating characteristic (AUC-ROC) being the main metric. We partition each sample vertically by its clinical features. In a practical situation, clinical variables may come from different hospitals or different departments in the same hospital and can not share their data due to the patients personal privacy. This task is refered to as MIMIC-LR.
MNIST. We partition each MNIST [18] image with shape vertically into two parts (each part has shape ). Each party uses a local CNN model to learn feature representation from raw image input. The local CNN model for each party consists of two 3x3 convolution layers that each has 64 channels and are followed by a fully connected layer with 256 units. Then, the two feature representations produced by the two local CNN models respectively are fed into the logistic regression model with 512 parameters for a binary classification task. We refer this task as MNIST-CNN.
(d) Figure 2: Comparison of AUC (Left) and training loss (Right) in MIMIC-III dataset with varying local iterations, denoted by Q.
Table 1: Number of communication rounds to reach a target AUC-ROC for FedBCD-p, FedBCD-s and FedSGD on MIMIC-LR and MNIST-CNN respectively.
NUS-WIDE. The NUS-WIDE dataset [19] consists of 634 low-level images features extracted from Flickr images as well as their associated tags and ground truth labels. We put 634 low-level image features on party B and 1000 textual tag features with ground truth labels on party A. The objective is to perform a federated transfer learning (FTL) task studied in [10]. FTL aims to predict labels to unlabeled images of party B through transfer learning form A to B. Each party utilizes a neural network having one hidden layer with 64 units to learn feature representation from their raw inputs. Then, the feature representations of both sides are fed into the final federated layer to perform federated transfer learning. This task is refered to as NUS-FTL.
Default-Credit. The Default-Credit consists of credit card records including user demographics, history of payments, and bill statements, etc., with a user’s default payments as labels. We separate the features in a way that the demographic features are on one side, separated from the payment and balance features. This segregation is common in industry applications such as retail and car rental leveraging banking data for user credibility prediction and customer segmentation. In our experiments, party A has labels and 18 features including six months of payment and bill balance data, whereas party B has 15 features including user profile data such as education, marriage . We perform a FTL task as described above but with homomorphic encryption applied. We refer to this task as Credit-FTL.
For all experiments, we adopt a decay learning rate strategy with , where is optimized for each experiment. We fix the batch size to 64 and 256 for MIMIC-LR and MNIST-CNN respectively.
Results and Discussion
FedBCD-p vs FedBCD-s. We first study the impact of varying local iterations on the communication efficiency of both FedBCD-p and FedBCD-s algorithms based on MIMIC-LR and MNIST-CNN (Figure 2). We observe similar convergence for FedBCD-s and FedBCD-p for various values of Q. However, for the same communication round, the running time of FedBCD-s doubles that of FedBCD-p due to sequential execution. As the number of local iteration increases, we observe that the required number of communication rounds reduce dramatically (Table 1). Therefore, by reasonably increasing the number of local iteration, we can take advantage of the parallelism on participants and save the overall communication costs by reducing the number of total communication rounds required.
Impact of Q. Theorem 2 suggests that as Q grows the required number of communication rounds may first decrease and then increase again, and eventually the algorithm may not converge to optimal solution. To further investigate the relationship between the convergence rate and the local iteration Q, we evaluate FedBCD-p algorithm on NUS-FTL with a large range of Q. The results are shown in Figure 3 and Figure 4(a), which illustrate that FedBCD-p achieves the best AUC with the least number of communication rounds when Q = 15. For each target AUC, there exists an optimal Q. This manifests that one needs to carefully select Q to achieve the best communication efficiency, as suggested by Theorem 2.
Figure 3: Comparison of AUC (Left) and training loss (Right) in NUS-WIDE dataset with varying local iterations, denoted by L.
Figure 4: The relationship between communication rounds and varying local iterations denoted by Q for three target AUC (a), and the comparison between FedBCD-p and FedPBCD-p for large local iterations (b).
Figure 4(b) shows that for very large local iteration Q = 25, 50 and 100, the FedBCD-p cannot converge to the AUC of 83.7%. This phenomenon is also supported by Theorem 2, where if Q is too large the right hand side of (13) may not go to zero. Next we further address this issue by making the algorithm less sensitive in choosing Q.
Proximal Gradient Descent. [32] proposed adding a proximal term to the local objective function to alleviate potential divergence when local iteration is large. Here, we explore this idea to our scenario. We rewrite (12) as follows:
where is the gradient of the proximal term , which exploits the initial model to limit the impact of local updates by restricting the locally updated model to be close to . We denote the proximal version of FedBCD-p as FedPBCD-p. We then apply FedPBCD-p with to NUS-FTL for Q = 25, 50 and 100 respectively. Figure 4(b) illustrates that if Q is too large, FedBCD-p fails to converge to optimal solutions whereas the FedPBCD-p converges faster and is able to reach at a higher test AUC than corresponding FedBCD-p does.
Increasing number of Parties. In this section, we increase the number of parties to five and seventeen and conduct experiments for MIMIC-LR task. We partition data by clinical variables with each party having all the related features of the same variable. We adopt a decay learning rate strategy with according to Theorem 2. The results are shown in Figure 5. We can see that the proposed method still performs well when we increase the local iterations for multiple parties. As we increase the number of parties to five and seventeen, FedBCD-p is slightly slower than the two-party case, but the impact of node K is very mild, which verifies the theoretical analysis in Remark 3.
Figure 5: Comparison of AUC in MIMIC-III dataset with varying local iterations (denoted by Q) and number of parties (denoted by K). (Left) FedBCD-p; (Right) FedBCD-s
Implementation with HE. In this section, we investigate the efficiency of FedBCD-p algorithm with homomorphic encryption (HE) applied. Using HE to protect transmitted information ensures higher security but it is extremely computationally expensive to perform computations on encrypted data. In such a scenario, carefully selecting Q may reduce communication rounds but may also introduce computational overhead because the total number of local iterations may increase (number of communication rounds). We integrated the FedBCD-p algorithm into the current FTL implementation on FATE1 and simulate two-party learning on two machines with Intel Xeon Gold model with 20 cores, 80G memory and 1T hard disk. The experimental results are summarized in Table 2. It shows that FedBCD-p with larger Q costs less communication rounds and total training time to reach a specific AUC with a mild increase in computation time but more than 70 percents reduction in communication round from FedSGD to Q = 10.
Table 2: Number of communication rounds and training time to reach target AUC 70%, 75% and 80% respectively for FedSGD versus FedBCD-p. R, comp. and comm. denote communication rounds, computation time (mins), and communication time (mins) respectively.
In this paper, we propose a new collaboratively learning framework for distributed features based on block coordinate gradient descent in which parties perform more than one local update of gradients before communication. Our approach significantly reduces the number of communication rounds and the total communication overhead. We theoretically prove that the algorithm achieves global convergence with a decay learning rate and proper choice of Q. The approach is supported by our extensive experimental evaluations. We also show that adding proximal term can further enhance convergence at large value of Q. Future work may include investigating and further improving the communication efficiency of such approaches for more complex and asynchronized collaborative systems.
[1] EU. REGULATION (EU) 2016/679 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). Available at: https://eur-lex. europa. eu/legal-content/EN/TXT, 2016.
[2] Dean, J., G. Corrado, R. Monga, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231. 2012.
[3] McMahan, H. B., E. Moore, D. Ramage, et al. Federated learning of deep networks using model averaging. CoRR, abs/1602.05629, 2016.
[4] Yang, Q., Y. Liu, T. Chen, et al. Federated machine learning: Concept and applications. CoRR, abs/1902.04885, 2019.
[5] Gratton, C., V. D., R. Arablouei, et al. Distributed ridge regression with feature partitioning. pages 1423–1427. 2018.
[6] Hu, Y., D. Niu, J. Yang, et al. Fdml: A collaborative machine learning framework for distributed features. In KDD. 2019.
[7] Hu, Y., P. Liu, L. Kong, et al. Learning privately over distributed features: An ADMM sharing approach. CoRR, abs/1907.07735, 2019.
[8] Hardy, S., W. Henecka, H. Ivey-Law, et al. Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption. CoRR, abs/1711.10677, 2017.
[9] Cheng, K., T. Fan, Y. Jin, et al. Secureboost: A lossless federated learning framework. CoRR, abs/1901.08755, 2019.
[10] Liu, Y., T. Chen, Q. Yang. Secure federated transfer learning. CoRR, abs/1812.03337, 2018.
[11] Rivest, R. L., L. Adleman, M. L. Dertouzos. On data banks and privacy homomorphisms. Foundations of Secure Computation, Academia Press, pages 169–179, 1978.
[12] Yao, A. C. Protocols for secure computations. In SFCS, pages 160–164. 1982.
[13] Richtárik, P., M. Takáˇc. Distributed coordinate descent method for learning with big data. J. Mach. Learn. Res., 17(1):2657–2681, 2016.
[14] Mahajan, D., S. S. Keerthi, S. Sundararajan. A distributed block coordinate descent method for training l1regularized linear classifiers. J. Mach. Learn. Res., 18(1):3167–3201, 2017.
[15] Peng, Z., Y. Xu, M. Yan, et al. Arock: an algorithmic framework for asynchronous parallel coordinate updates. 2015. Preprint, arXiv:1506.02396.
[16] Niu, F., B. Recht, C. Re, et al. Hogwild!: A lock-free approach to parallelizing stochastic gradient dea distributed, asynchronous and incremental algorithm for nonconvex optimization: An admm based approachvia a scent. 2011. Online at arXiv:1106.57320v2.
[17] Johnson, A. E., T. J. Pollard, L. Shen, et al. Mimic-iii, a freely accessible critical care database. Scientific data, 3:160035, 2016.
[18] LeCun, Y., C. Cortes. MNIST handwritten digit database. 2010.
[19] Chua, T.-S., J. Tang, R. Hong, et al. NUS-WIDE: A real-world web image database from National University of Singapore. In CIVR. 2009.
[20] Shokri, R., V. Shmatikov. Privacy-preserving deep learning. In Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security, CCS ’15, pages 1310–1321. ACM, New York, NY, USA, 2015.
[21] Yu, H., S. Yang, S. Zhu. Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pages 5693–5700. 2019.
[22] Li, X., K. Huang, W. Yang, et al. On the convergence of FedAvg on non-iid data. arXiv preprint arXiv:1907.02189, 2019.
[23] Ghadimi, S., G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
[24] Zhu, L., Z. Liu, S. Han. Deep leakage from gradients. In Advances in Neural Information Processing Systems 32, pages 14774–14784. Curran Associates, Inc., 2019.
[25] Hitaj, B., G. Ateniese, F. Perez-Cruz. Deep models under the gan: Information leakage from collaborative deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17, pages 603–618. ACM, New York, NY, USA, 2017.
[26] Melis, L., C. Song, E. D. Cristofaro, et al. Exploiting unintended feature leakage in collaborative learning. 2019 IEEE Symposium on Security and Privacy (SP), pages 691–706, 2018.
[27] Li, Q., Z. Wen, B. He. Practical federated gradient boosting decision trees. In 34th AAAI Conference on Artificial Intelligence (AAAI-20). 2020.
[28] Du, W., Y. Han, S. Chen. Privacy-preserving multivariate statistical analysis: Linear regression and classification. In M. Berry, U. Dayal, C. Kamath, D. Skillicorn, eds., SIAM Proceedings Series, pages 222–233. 2004.
[29] Vaidya, J., C. Clifton. Privacy preserving association rule mining in vertically partitioned data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, pages 639–644. ACM, New York, NY, USA, 2002.
[30] Mangasarian, O. L., E. W. Wild, G. M. Fung. Privacy-preserving classification of vertically partitioned data via random kernels. ACM Trans. Knowl. Discov. Data, 2(3):12:1–12:16, 2008.
[31] Harutyunyan, H., H. Khachatrian, D. C. Kale, et al. Multitask learning and benchmarking with clinical time series data. arXiv preprint arXiv:1703.07771, 2017.
[32] Li, T., A. K. Sahu, M. Zaheer, et al. Federated optimization for heterogeneous networks. arXiv preprint arXiv:1812.06127, 2019.
Lemma 1 For any vector . There exists infinite many of non-identity orthogonal matrix
Proof. First we construct a orthogonal satisfying (16) and (17) for
Then we complete the proof by generalizing the construction for an arbitrary With , we construct in the following way
where is any non-identity orthogonal matrix with
Condition (16) is satisfied since
and condition (17) is satisfied trivially, i.e.,
For any arbitrary , we apply the Householder transformation to "rotate" it to the basis vector , i.e.,
where P is the Householder transformation operator such as
Therefore from defined in (19) we can construct U by
Finally, we verifies that U satisfies condition (16)) and (17)):
where (34) holds since from (25) we have
A.1 Proof of Theorem 1
Proof: We first show that the conclusion holds for the case when k < K. With initial weight shows that we can find infinite number of non-identity matrix that
Let denotes the ith sample of the data set sampled at jth iteration. We show that for any that yields observations , we can construct another set of data
where U is chosen to satisfy condition (38). Let be observations generated by be weight variables with
We show in the following that for
It is easy to verify (41) for j = 0, i.e.,
From equation (11), we define
Now assuming that condition (41) and (42) hold for
We first show (42) holds for j = J + 1.
where (48) follows from (44) and (45). Note if Q local updates are performed, locally we have
where denotes the qth local update of jth iteration. it is thus easy to show that
Next we show (41) holds for j = J + 1.
Finally we complete the proof by showing the conclusion holds for k = K. Similarly for any , we construct a different solution as follows:
where is a non-identity orthogonal matrix satisfying (39) for k = K. Therefore we have
To begin with, let us first introduce some notations.
Notations. For notation simplicity, we divide the total data set [N] into , each with . Let r denote the iteration index, where each iteration one round of local update is performed among all the nodes; Let
denotes the updated parameters at each iteration of each node. Note that is only a sequence of “virtual" variable that helps the proof, but it is never explicitly formed in the algorithm.
Let denote the latest iteration before r with global communication, so and
Also we denote the full gradient of the loss function as . Further denote the stochastic gradient as
We also need the following important definition. Let us denote the variables updated in the inner loops, using mini-batch . That is, we have the following:
Further define
Note that if is sampled at , then the quantities are not realized in the algorithm. They are introduced only for analytical purposes.
By using these notations, Algorithm 2 can be written in the compact form as in Algorithm 4. Further, the uniform sampling assumption A2 implies the following:
A2’: Unbiased Gradient with Bounded Variance. Suppose Assumption A2 holds. Then we have the following:
where the expectation is taken on the choice of i at iteration , and conditioning all the past histories of the algorithm up to iteration ; Further, the following holds:
where the expectation is taken over the random selection of the data point at iteration is the variance introduced by sampling from a single data point.
To begin our proof, we first present a lemma that bounds the difference of the gradients evaluated at the global vector and the local vector
Lemma 2 Bounded difference of gradient. Under Assumption A1, A2, the difference between the partial gradients evaluated at is bounded by the following:
where is some constant related to the variances of the sampling process.
B.1 Proof of Lemma 2
Proof. First we take the expectation and apply assumption A1,A2
Note that
Notice that and substitute (64) to (63), by taking expectation over , we obtain
where in (i) we used the definition of ; in (ii) we use the fact that ; in the last inequality is true because summing from to is always less that summing from . Next we sum over K, we have
This completed the proof of this result.
B.2 Proof of Theorem 2
Proof. First apply the Lipschitz condition of L, we have
Next we apply the update step in Algorithm 4 and take the expectation
Also we have
Substitute (69) into (68), choose and apply Lemma 2, we have
Average over T and reorganize the terms, let
The proof is completed.