36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"2406.02959","publisher":"arxiv","paperJSON":{"title":"Adversarial Moment-Matching Distillation of Large Language Models","paperID":"2406.02959","avgLineHeight":10.89,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"$3c","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"Large language models (LLMs) like GPT-4 [","element":"span"},{"href":"#id-0","referenceIndex":1,"text":"1","element":"a"},{"text":"] and LLaMA [","element":"span"},{"href":"#id-1","referenceIndex":32,"text":"32","element":"a"},{"text":"] have revolutionized natural language processing, significantly enhancing the quality of text generation across various tasks. This success is largely due to the extensive scale of training data and the substantial increase in model parameters [","element":"span"},{"href":"#id-2","referenceIndex":17,"text":"17","element":"a"},{"text":"]. However, the high computational and memory requirements of these models present significant challenges for practical deployment. To address these issues, knowledge distillation (KD) [","element":"span"},{"href":"#id-3","referenceIndex":15,"text":"15","element":"a"},{"text":"] has emerged as a key technique. KD involves transferring knowledge from a large, complex teacher model to a smaller, more efficient student model, thereby maintaining high performance while reducing resource demands. Most distillation methods for auto-regressive text generation models, including LLMs, employ metrics of probability distribution distance, such as Kullback-Leibler (KL) divergence [","element":"span"},{"href":"#id-4","referenceIndex":18,"text":"18","element":"a"},{"text":"] and reverse KL divergence [","element":"span"},{"href":"#id-5","referenceIndex":13,"text":"13","element":"a"},{"text":"], aiming to align the token-level probability distributions between the teacher and student models.","element":"span"}],[{"text":"The distribution matching-based distillation methods can be viewed as behavior cloning on a decision-making problem from the perspective of imitation learning [","element":"span"},{"href":"#id-6","referenceIndex":21,"text":"21","element":"a"},{"text":"; ","element":"span"},{"href":"#id-5","referenceIndex":13,"text":"13","element":"a"},{"text":"; ","element":"span"},{"href":"#id-7","referenceIndex":2,"text":"2","element":"a"},{"text":"]. Based on this concept, early works based on the teacher-generated outputs [","element":"span"},{"href":"#id-4","referenceIndex":18,"text":"18","element":"a"},{"text":"] or a supervised dataset [","element":"span"},{"href":"#id-8","referenceIndex":27,"text":"27","element":"a"},{"text":"] can be viewed as an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"off-policy ","element":"span"},{"text":"approach. Recent works further incorporate an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"on-policy ","element":"span"},{"text":"approach, training the student on its self-generated outputs [","element":"span"},{"href":"#id-6","referenceIndex":21,"text":"21","element":"a"},{"text":"], using KL-based divergence [","element":"span"},{"href":"#id-5","referenceIndex":13,"text":"13","element":"a"},{"text":"; ","element":"span"},{"href":"#id-7","referenceIndex":2,"text":"2","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":19,"text":"19","element":"a"},{"text":"] and total variation (TV) distance [","element":"span"},{"href":"#id-10","referenceIndex":35,"text":"35","element":"a"},{"text":"]. Accordingly, such distribution matching-based methods face the sub-optimality problem. The objective functions aimed at aligning the probability distributions between the teacher and student models can be straightforward but cannot fully capture the goal of distilling language knowledge. First, intuitively, the correct output for an input can vary, and thus behavior cloning cannot capture the full knowledge of a teacher. Besides, there is no standardized definition for the quality of a generated output given an input, which makes it difficult to define the objective of knowledge distillation. This","element":"span"}],[{"style":{"width":"93%"},"width":1490,"height":590,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02959/images/1-0.png","element":"img"}],[{"id":"id-12","text":"Figure 1: The comparison between the distribution-matching-based distillation and the action-value ","element":"figcaption","subtype":"caption"},{"text":"moment-matching distillation is outlined. ","element":"figcaption","subtype":"caption"},{"style":{"height":13.59},"width":156.19,"height":33.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02959/images/1-1.png","element":"img","alt":" πθ and π∗","inline":true,"padRight":true},{"text":"denote the student policy and the teacher policy, respectively. For both on-policy (using student-generated outputs) and off-policy (using teacher-generated outputs) perspectives, our approach optimizes moment-matching of action-value functions (","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"Q","element":"figcaption","subtype":"caption"},{"text":"-functions) instead of minimizing the distribution distance measured by ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"M ","element":"figcaption","subtype":"caption"},{"text":"= KL, RKL, TV, etc.","element":"figcaption","subtype":"caption"}],[{"text":"imposes a significant limitation on the generalization performance of the student model through distillation.","element":"span"}],[{"text":"To address the aforementioned issues, we employ a reinforcement learning (RL) formulation for the auto-regressive text generation problem and utilize the definition of imitation gap to describe the high-level goal of knowledge distillation. Additionally, we address the imitation gap for KD by matching moments of the action-value function, which reflects the quality of token-level predictions for the entire output. In addressing the action-value function, we adopt the approach of Swamy et al. [","element":"span"},{"href":"#id-11","referenceIndex":30,"text":"30","element":"a"},{"text":"], considering a two-player minimax game between the language policy and the action-value functions, aiming to minimize an upper bound of the moment-matching objective. For this purpose, we introduce an adversarial training algorithm based on the policy gradient to jointly optimize the on-/off-policy objectives. Figure ","element":"span"},{"href":"#id-12","text":"1 ","element":"a"},{"text":"illustrates the overall approach.","element":"span"}],[{"text":"We evaluate our approach on both the instruction-following dataset and three task-specific datasets for text summarization, machine translation, and commonsense reasoning. Results demonstrate that the proposed adversarial moment-matching approach effectively optimizes the moment-matching distance of the imitation gap and outperforms state-of-the-art KD methods and a range of distribution-matching-based methods. The code and implementation is released at ","element":"span"},{"href":"https://github.com/jiachenwestlake/MMKD","text":"https://github.com/ ","element":"a"},{"href":"https://github.com/jiachenwestlake/MMKD","text":"jiachenwestlake/MMKD","element":"a"},{"text":".","element":"span"}]]},{"heading":"2 Related Work","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Distillation of large language models. ","element":"span"},{"text":"There has been an increasing interest in knowledge distillation (KD) of auto-regressive LMs, especially concerning large language models (LLMs) [","element":"span"},{"href":"#id-13","referenceIndex":37,"text":"37","element":"a"},{"text":"; ","element":"span"},{"href":"#id-14","referenceIndex":38,"text":"38","element":"a"},{"text":"]. This process effectively transfers elicited knowledge from teacher LLMs to smaller student models, aiming to compress the large size of neural network parameters and make LLMs more efficient. Sequencelevel KD (SeqKD) [","element":"span"},{"href":"#id-4","referenceIndex":18,"text":"18","element":"a"},{"text":"] is a variation of supervised fine-tuning (SFT) in KD. It can be viewed as the simplest method for distillation of black-box LLMs by fine-tuning the student model with teacher-generated outputs. This method has been extensively used for LLMs and has achieved success [","element":"span"},{"href":"#id-15","referenceIndex":31,"text":"31","element":"a"},{"text":"; ","element":"span"},{"href":"#id-16","referenceIndex":5,"text":"5","element":"a"},{"text":"]. In contrast, distillation of white-box LLMs can make full use of internal information of the teacher model, such as logits [","element":"span"},{"href":"#id-8","referenceIndex":27,"text":"27","element":"a"},{"text":"; ","element":"span"},{"href":"#id-10","referenceIndex":35,"text":"35","element":"a"},{"text":"] and hidden states [","element":"span"},{"href":"#id-17","referenceIndex":20,"text":"20","element":"a"},{"text":"], for distribution alignment, making it more effective and efficient for KD. However, unlike previous work that explicitly clones the distribution of teacher LLMs into student models, this work learns an auxiliary ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-value function to guide KD.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Distillation via distribution matching. ","element":"span"},{"text":"Most promising results in the distillation of white-box LLMs are achieved by minimizing divergence between the probability distributions of the teacher model and student models. Kullback-Leibler (KL) divergence, reverse Kullback-Leibler (RKL) divergence, and Jensen–Shannon (JS) divergence are three widely used KD objectives for auto-regressive LMs ","element":"span"},{"text":"[","element":"span"},{"href":"#id-10","referenceIndex":35,"text":"35","element":"a"},{"text":"; ","element":"span"},{"href":"#id-5","referenceIndex":13,"text":"13","element":"a"},{"text":"; ","element":"span"},{"href":"#id-7","referenceIndex":2,"text":"2","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":19,"text":"19","element":"a"},{"text":"; ","element":"span"},{"href":"#id-13","referenceIndex":37,"text":"37","element":"a"},{"text":"]. Wen et al. [","element":"span"},{"href":"#id-10","referenceIndex":35,"text":"35","element":"a"},{"text":"] have shown the equivalent formulations of sequence-level KL, RKL, JS divergences, and the step-wise terms. Additionally, they also present the strong performance of step-wise total variation (TV) distance for KD, which can upper bound the sequence-level term. As a result, most recent works focus on on-policy approaches for KD [","element":"span"},{"href":"#id-7","referenceIndex":2,"text":"2","element":"a"},{"text":"] and combine the real-time-generated outputs by students (on-policy) with the real-time-generated outputs by teachers (or from supervised datasets) (off-policy). Following this line, Gu et al. [","element":"span"},{"href":"#id-5","referenceIndex":13,"text":"13","element":"a"},{"text":"] further propose a policy gradient-based method to address the high variance issues of RKL-based methods, while Ko et al. [","element":"span"},{"href":"#id-9","referenceIndex":19,"text":"19","element":"a"},{"text":"] propose a more efficient and effective method using a skew KL divergence loss and an adaptive off-policy approach. We also focus on a combination of on-policy and off-policy objectives for KD, but we introduce a more sophisticated moment-matching approach instead of directly using the well-studied distribution-matching metrics such as KL, RKL, JS divergences, and TV distance.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Distillation via reinforcement learning. ","element":"span"},{"text":"In a common formulation of RL in text generation [","element":"span"},{"href":"#id-18","referenceIndex":40,"text":"40","element":"a"},{"text":"; ","element":"span"},{"href":"#id-19","referenceIndex":23,"text":"23","element":"a"},{"text":"; ","element":"span"},{"href":"#id-20","referenceIndex":14,"text":"14","element":"a"},{"text":"], an auto-regressive model can be viewed as a language policy, making decisions on the next token (action) based on the currently generated sequence (state). From this perspective, KD corresponds to behavior cloning in imitation learning [","element":"span"},{"href":"#id-4","referenceIndex":18,"text":"18","element":"a"},{"text":"; ","element":"span"},{"href":"#id-21","referenceIndex":6,"text":"6","element":"a"},{"text":"; ","element":"span"},{"href":"#id-5","referenceIndex":13,"text":"13","element":"a"},{"text":"; ","element":"span"},{"href":"#id-7","referenceIndex":2,"text":"2","element":"a"},{"text":"]. For imitation learning in text generation, early works such as SeqGAN ","element":"span"},{"href":"#id-18","referenceIndex":40,"text":"[40] ","element":"a"},{"text":"and TextGAIL ","element":"span"},{"href":"#id-22","referenceIndex":36,"text":"[36] ","element":"a"},{"text":"utilize a generative adversarial framework to balance between the reward model, optimized by discriminating generated/real-word text, and the language policy, optimized by policy gradient-based methods using the reward model. Existing work on KD via imitation learning refers to ImitKD [","element":"span"},{"href":"#id-6","referenceIndex":21,"text":"21","element":"a"},{"text":"], which optimizes the student policy by learning from demonstrations of the teacher model. RL-based distillation can also be especially relevant for leveraging the feedback from the teacher to train student models [","element":"span"},{"href":"#id-23","referenceIndex":3,"text":"3","element":"a"},{"text":"; ","element":"span"},{"href":"#id-24","referenceIndex":7,"text":"7","element":"a"},{"text":"], in which teacher models are used to generate the feedback data for training a reward model. We build our method upon a RL-based imitation learning framework. However, unlike previous work [","element":"span"},{"href":"#id-4","referenceIndex":18,"text":"18","element":"a"},{"text":"; ","element":"span"},{"href":"#id-5","referenceIndex":13,"text":"13","element":"a"},{"text":"; ","element":"span"},{"href":"#id-7","referenceIndex":2,"text":"2","element":"a"},{"text":"], we propose an adversarial moment-matching approach to enhance behavior cloning.","element":"span"}]]},{"heading":"3 Method","paragraphs":[[{"id":"id-27","style":{"fontWeight":"bold"},"text":"3.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Notations and Definitions","element":"span"}],[{"text":"In this section, we consider the text generation task as a decision-making process and give a corresponding reinforcement learning (RL) formulation.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Text generation. ","element":"span"},{"text":"Given an input ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x","element":"span"},{"text":", the auto-regressive generation task in our work aims to generate a sequence of tokens as the output ","element":"span"},{"style":{"height":16},"width":357.38,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02959/images/2-0.png","element":"img","alt":" (y1, . . . , yT ), where yt","inline":true,"padRight":true},{"text":"comes from a vocabulary ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V","element":"span"},{"text":". For simplicity, we define ","element":"span"},{"style":{"height":16},"width":337.37,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02959/images/2-1.png","element":"img","alt":" y = (y0, y1, . . . , yT )","inline":true,"padRight":true},{"text":"as the full input-output sequence, where ","element":"span"},{"style":{"height":10.4},"width":118.55,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02959/images/2-2.png","element":"img","alt":" y0 = x","inline":true},{"text":". The generator is modeled by a conditional probability distribution ","element":"span"},{"style":{"height":18.48},"width":448,"height":46.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02959/images/2-3.png","element":"img","alt":" pθ(y|x) = ΠTt=1pθ(yt|y