Self-correction has emerged as a popular approach to improve responses from large language models (LLMs) by refining them using LLMs during inference (Bai et al., 2022; Madaan et al., 2023), under the hypothesis that recognizing errors is easier than avoiding them (Saunders et al., 2022). Extensive studies on self-correction have been conducted in various tasks, including arithmetic reasoning, code generation, and question answering (Gao et al., 2023; Shinn et al., 2023). The simplest approach of self-correction prompts LLMs to provide feedback on their own responses and refine the responses
Figure 1: Self-correction in three stages: initial response generation, feedback, and refinement.
using the feedback (Huang et al., 2024). As in Figure 1, self-correction has also been studied using additional information for improving feedback, including external tools such as code interpreters (Chen et al., 2024c; Gou et al., 2024), external knowledge retrieved via web search (Gao et al., 2023; Jiang et al., 2023b), or fine-tuning (Welleck et al., 2023; Ye et al., 2023). However, recent studies also report negative results indicating that LLMs cannot self-correct (Huang et al., 2024; Gou et al., 2024) or even self-detect (Chen and Shu, 2024; Tyen et al., 2024; Hong et al., 2024; Kamoi et al., 2024) their own mistakes at least in certain conditions. These conflicting observations indicate that further analysis of self-correction is needed.
In this work, we provide a critical survey to investigate the conditions required for successful self-correction. First, our analysis finds that prior studies often do not define their research questions in detail. As a result, many papers fail to provide appropriate experiments to evaluate the research questions they implicitly target. To address this issue, we establish categories of research questions in self-correction research (§3.1) and discuss frameworks that should be used for verifying each research question (§3.2). Finally, we provide a checklist for designing appropriate experiments (§8).
Next, we analyze prior work to identify when LLMs can self-correct their mistakes, using the new definitions of the research questions. Our analysis highlights that the bottleneck is in the feedback generation (§7). Specifically, (1) no prior work shows successful self-correction with feedback from prompted LLMs in general tasks (§4), (2) self-correction works well in tasks where reliable external feedback is available (§5), (3) large-scale fine-tuning enables self-correction (§6), and (4) some tasks have properties exceptionally suitable for self-correction (§4). In summary, our analysis identifies the properties required for successful self-correction as follows:
[RQ1] When can LLMs self-correct based solely
on the inherent capabilities of LLMs?
• In general tasks, no prior work shows reliable evidence of successful self-correction with in-context learning. (§4)
• In tasks with properties that are exceptionally favorable for self-correction (e.g., responses are decomposable), self-correction is effective even with in-context learning. (§4)
[RQ2] When can LLMs self-correct the best-possible initial responses with external feedback?
• Self-correction is effective in tasks where reliable external feedback is available. (§5)
• Fine-tuning is effective with large training data but unexplored for small training data. (§6)
[RQ3] When are the final outputs of self-correction
better than other approaches?
• Self-correction is often not compared with sufficiently strong baselines, and it is still unclear whether it is better than other approaches. (§3.3)
This survey is organized as follows. Section 2 provides an overview of self-correction. Section 3 introduces a new approach to classify research questions and frameworks in self-correction research. Sections 4, 5, and 6 analyze prior work in self-correction with in-context learning, external information, and fine-tuning, respectively. Section 7 summarizes our findings from the analysis. Section 8 provides a checklist for self-correction research. Section 9 explains differences from other surveys. Section 10 provides studies related to self-correction. Section 11 provides future directions.
The term “self-correction” is used in a wide range of scenarios, from a strict definition in which LLMs refine their own responses by themselves (Madaan et al., 2023; Huang et al., 2024) to broader concepts that also involve feedback from external tools or knowledge (Shinn et al., 2023; Gou et al., 2024). In this work, we define self-correction as a framework that refines responses from LLMs by using LLMs during inference, possibly using external tools or knowledge. As shown in Table 1, Figure 2, and Figure 3, self-correction has been studied in various frameworks with different sources of feedback.
2.1 Frameworks Explicit Feedback vs. Direct Refinement. Self-
correction often consists of three stages including feedback generation (Kim et al., 2023; Madaan et al., 2023; Shinn et al., 2023; Huang et al., 2024): • Initial Response Generation is a stage of generating initial responses from an LLM.
• Feedback model generates feedback given the original input and initial response. This stage may use external tools or knowledge.
• Refinement model generates a refined response,
given the input, initial response, and feedback. Direct refinement is another approach that refines responses without generating feedback explicitly (Saunders et al., 2022; Bai et al., 2022; Welleck et al., 2023; Akyurek et al., 2023).
Post-hoc vs. Generation-time. Post-hoc correction refines responses after they are generated (Pan et al., 2024). Generation-time correction or step-level correction (Paul et al., 2024; Jiang et al., 2023b) improves step-by-step reasoning by providing feedback on intermediate reasoning steps. Posthoc correction is more flexible and applicable to broader tasks, although generation-time correction is popular for reasoning tasks (Pan et al., 2024).
Same-model vs. Cross-model. Cross-model correction generates feedback or refines the responses using models different from the model that generates initial responses. Cross-model correction has been mostly studied in the settings of correcting mistakes of large proprietary LLMs using small fine-tuned models (Welleck et al., 2023; Akyurek et al., 2023; Paul et al., 2024) or multi-agent debate of multiple models with similar capabilities (Liang et al., 2023; Li et al., 2023; Cohen et al., 2023b; Du et al., 2023; Zhang et al., 2023a; Chen et al., 2024b; Chan et al., 2024; Wang et al., 2024).
2.2 Sources of Feedback
Intrinsic (§4). Intrinsic self-correction prompts LLMs to generate feedback on their own responses. Prompting strategies include simple zero-shot or
Table 1: Representative studies in self-correction of LLMs. Gray color represents unrealistic settings. : Weak prompts for generating initial responses. FB: Feedback models for cross-model correction.
few-shot prompts (Madaan et al., 2023; Kim et al., 2023), decomposing the responses (Dhuliawala et al., 2023), and evaluating confidence (Varshney et al., 2023; Jiang et al., 2023b; Wu et al., 2024).
External Information (§5). Self-correction often relies on external information, including external tools such as code executors (Jiang et al., 2023a; Gou et al., 2024; Chen et al., 2024c; Stengel- Eskin et al., 2024), symbolic reasoners (Pan et al., 2023), proof assistant (First et al., 2023), or taskspecific metrics (Xu et al., 2023), external knowledge from search engines (Jiang et al., 2023b; Gao et al., 2023; Zhao et al., 2023), Wikipedia (Yu et al., 2023; Zhao et al., 2023), or other corpora (Peng et al., 2023; Zhao et al., 2023), oracle information such as ground-truth answers (Kim et al., 2023; Shinn et al., 2023), human feedback (Chen et al., 2024a), or stronger models (Zhang et al., 2024).
Fine-tuning (§6). Models fine-tuned for self-correction are another source of feedback, which are trained via supervised fine-tuning (Welleck et al., 2023; Ye et al., 2023; First et al., 2023; Paul et al., 2024; Han et al., 2024) or reinforcement learning (Le et al., 2022; Akyurek et al., 2023).
2.3 Tasks
Self-correction has been studied in Reasoning: arithmetic reasoning (Madaan et al., 2023; Nathani et al., 2023; Gou et al., 2024), code generation (Jiang et al., 2023a; Charalambous et al., 2023;
Figure 2: LLM self-correction frameworks, categorized by information used for generating feedback and whether they use best-possible initial responses (§3.2). This figure illustrates representative architectures.
Figure 3: Taxonomy of LLM self-correction, categorized by information used for generating feedback and whether they use best-possible initial responses (fair or unfair). Refer to Section 3.2 for the definitions.
Gou et al., 2024; Chen et al., 2024c; Olausson et al., 2024), proof generation (First et al., 2023), logical reasoning (Pan et al., 2023), Knowledge: closed-book QA (Shinn et al., 2023; Gao et al., 2023; Jiang et al., 2023b; Gou et al., 2024), Contextbased Generation: dialogue generation (Madaan et al., 2023; Peng et al., 2023), text summarization (Saunders et al., 2022), Open-ended Generation: conditional text generation (Ye et al., 2023; Schick et al., 2023), story generation (Yang et al., 2022b), detoxification (Schick et al., 2021; Bai et al., 2022; Gou et al., 2024; Phute et al., 2024), Others: machine translation (Chen et al., 2023b; Raunak et al., 2023; Ki and Carpuat, 2024), information retrieval (Gero et al., 2023), visual instruction following (Lee et al., 2023), and prompt optimization (Pryzant et al., 2023; Mehrabi et al., 2023; Yang et al., 2024).
2.4 Differences from Related Approaches
In this work, we define self-consistency (Wang et al., 2023) or generate-and-rank (Shen et al.,
Table 2: Research questions that prior studies implicitly target by claiming they are verified or
Table 3: Requirements for experiments to verify each research question in Section 3.1.
2021; Weng et al., 2023) to be different from self-correction because these approaches do not refine responses and assume that LLMs generate correct answers with a reasonable probability. We discuss these methods in Section 3.3 as strong baselines.
We provide a new approach to classify research questions and frameworks in self-correction.
3.1 RQs in Self-Correction Research
Prior studies often simply state their research questions as whether LLMs can self-correct their mistakes (e.g., Kim et al., 2023; Madaan et al., 2023). However, we claim that research questions in self-correction research should be defined in more detail. We identify the following research questions implicitly targeted in prior studies, as in Table 2.
RQ Group A. The capability of self-correcting their best-possible initial responses.
• [RQ1] Can LLMs self-correct their best-possible initial responses based solely on the inherent capabilities?
• [RQ2] Can LLMs self-correct their best-possible initial responses assisted by external feedback?
RQ Group B. The performance of the final outputs of self-correction. • [RQ3] Are the final outputs from self-correction
better than other methods? We define the best-possible initial responses as initial responses generated with best effort, using information that self-correction modules can access, such as external tools, knowledge, or fine-tuning.
Requirements. Experiments for verifying these research questions need to satisfy different requirements, as in Table 3. Evaluation: RQ1 and RQ2 only require to show that self-correction improves performance from the initial responses. RQ3 requires comparison with strong baselines (§3.3). Initial Responses: RQ1 should be evaluated on frameworks that refine responses using the same model without additional information. RQ2 needs to be evaluated on frameworks that use the best-possible initial responses. RQ3 only focuses on the final performance, so it is not necessary to use strong initial responses.
Confusion in Prior Work. Some prior studies implicitly target different research questions in a single work without clearly distinguishing them. As in Table 2, Kim et al. (2023) target RQ1 for arithmetic reasoning by comparing self-corrected responses only with initial responses, but they target RQ3 for MiniWoB++ by comparing self-correction with baseline methods. Similarly, Gou et al. (2024) target RQ2 for arithmetic reasoning but target RQ3 for detoxification.
3.2 Frameworks for Verifying RQs
Prior work often categorizes self-correction frameworks based on approaches for generating feedback (§2). However, we point out that they also need to be categorized by the quality of initial responses
because the frameworks we need to use for verifying different research questions vary by whether they use the best-possible initial responses (§3.1).
We propose categories of (same-model) self-correction that correspond to different research questions (§3.1), as shown in Figure 2.
• Realistic: Can be used in real-world applications.
– Fair: Using best-possible initial responses – Unfair: Using sub-optimal initial responses
• Unrealistic: Using information that is not accessible in real-world applications.
In this work, we focus on categorizing self-correction frameworks that do not involve multiple language models with different architectures. Cross-model correction uses different models for initial response generation and self-correction, so it is unsuitable for evaluating whether LLMs can improve their own initial responses [RQ1, RQ2]. However, we note that it can be used to evaluate [RQ3] whether the final responses from self-correction are better than other methods.
Realistic vs. Unrealistic. Some prior studies propose unrealistic self-correction, which cannot be implemented in real-world applications, by using oracle information such as ground-truth answers (Kim et al., 2023; Shinn et al., 2023). These methods cannot be used to verify any research questions.
Fair vs. Unfair. Realistic frameworks can be categorized by whether they use the best-possible initial responses. Fair self-correction represents frameworks that refine the best-possible initial responses. (1) Intrinsic self-correction (Huang et al., 2024) uses the same model and information for initial response generation and self-correction. This framework can assess [RQ1] whether LLMs can self-correct based solely on their inherent capabilities. (2) Fair-asymmetric self-correction uses additional information for self-correction, but also uses information to improve initial response generation as much as possible. For example, self-correction with code interpreters (Chen et al., 2024c; Gou et al., 2024) is not intrinsic but fair because we cannot easily use code interpreters to directly improve the initial response generation. This framework can evaluate [RQ2] whether LLMs can self-correct the best-possible initial responses using external feedback. Unfair self-correction (or unfairasymmetric self-correction) represents frameworks that are practical but do not use the best-possible initial responses. For example, methods that use search engines only for self-correction (Gao et al., 2023; Yu et al., 2023) are unfair because they can use search engines to directly improve the initial response generation. Unfair self-correction can evaluate [RQ3] whether the final responses from self-correction outperform other methods but cannot evaluate [RQ2] whether self-correction can improve the best-possible initial responses.
3.3 Strong Baselines
Self-correction involves multiple LLM calls for generating feedback and refinement. Therefore, to claim that [RQ3] the performance of the final outputs from self-correction frameworks is better than other approaches, it should be compared with sufficiently strong baselines, possibly relying on additional computation. Many studies do not compare their methods with strong baselines, although some studies pointed out this issue by comparing self-correction with self-consistency (Gou et al., 2024; Huang et al., 2024) or pass@k in code generation (Zhang et al., 2023b; Olausson et al., 2024).
Self-Consistency (Wang et al., 2023) generates multiple responses for the same input and takes the majority vote of the final answers in reasoning tasks. The idea of selecting good responses using consistency between multiple responses has been extended to other tasks such as text generation (Manakul et al., 2023; Elaraby et al., 2023; Chen et al., 2023c) and code generation (Shi et al., 2022).
Generate-and-Rank is an approach to generating multiple responses and selecting the best response using verifiers. Post-hoc approach ranks responses using self-evaluation (Weng et al., 2023; Zhang et al., 2023d), confidence (Manakul et al., 2023), fine-tuned verifiers (Cobbe et al., 2021; Shen et al., 2021; Lightman et al., 2024), or verifiers with external tools (Shi et al., 2022; Chen et al., 2023a; Ni et al., 2023). Feedback-guided decoding generates multiple responses and selects the best response for each reasoning step using generation probability (Hao et al., 2023; Tyen et al., 2024), prompted self-evaluation (Jung et al., 2022; Creswell and Shanahan, 2022; Xie et al., 2023; Yao et al., 2023; Miao et al., 2024), or fine-tuned verifiers (Uesato et al., 2022; Tafjord et al., 2022; Yang et al., 2022a; Asai et al., 2024).
Several studies propose intrinsic self-correction methods, which self-correct responses from LLMs by prompting themselves to generate feedback and refine the responses. Bai et al. (2022) propose self-correcting harmful responses from LLMs by prompting themselves. Self-Refine (Madaan et al., 2023) and RCI Prompting (Kim et al., 2023) iteratively prompt LLMs to self-correct their own responses in tasks such as arithmetic reasoning.
Negative Results. However, recent studies report that intrinsic self-correction does not improve or even degrade the performance in tasks such as arithmetic reasoning, closed-book QA (Huang et al., 2024; Gou et al., 2024), code generation (Gou et al., 2024; Olausson et al., 2024), plan genera-
Table 4: Unfair settings in prior studies of self-correction with prompting, over-evaluating self-correction.
tion (Valmeekam et al., 2023), and graph coloring (Stechly et al., 2023). Several studies claim that a bottleneck is in the feedback generation, and it is difficult to generate reliable feedback on their responses only by prompting themselves (Gou et al., 2024; Huang et al., 2024; Olausson et al., 2024).
Unfair Settings. The conflicting positive and negative results motivate us to analyze when LLMs can self-correct only by prompting themselves. Specifically, we assess whether prior studies satisfy the requirements to verify that [RQ1] LLMs can self-correct their responses based solely on their inherent capabilities. As in Table 4, we find that many studies use either oracle information in the self-correction processes (unrealistic frameworks) or weak prompts for generating initial responses (unfair settings), which over-evaluate self-correction. Consequently, we conclude that no major work shows successful self-correction of responses from LLMs using feedback generated by
prompting themselves under fair settings in general
tasks. Oracle Information: RCI Prompting (Kim et al., 2023) uses ground-truth answers and does not apply self-correction when the initial responses are correct, which unfairly ignores mistakes of incorrectly updating correct responses. Reflexion (Shinn et al., 2023) generates feedback by using an exact match between the generated and ground-truth answers, which cannot be accessed in real-world applications. Weak Initial Responses: Bai et al. (2022) propose self-correction for detoxifying harmful responses, but initial response generation is not instructed to generate harmless responses. Although detecting harmful contents using LLMs is a reasonable approach, this setting is not the self-correction from best-possible initial responses. Self-Refine (Madaan et al., 2023) uses instructions or few-shot examples that do not correctly correspond to the target task only for initial response generation, while using appropriate instructions for self-correction, as shown in Table 9 and 10. These settings evaluate improvement from weak initial responses, which over-evaluate self-correction.
Tasks in which Self-Correction is Exceptionally Effective. Although prior studies show that intrinsic self-correction is generally difficult, some tasks have properties that enable self-correction. First, tasks with decomposable responses, whose feedback generation can be reduced into easier tasks, allow to generate reasonable feedback only by prompting LLMs. For example, CoVe (Dhuli- awala et al., 2023) targets tasks of generating multiple answers, such as Who are some politicians who were born in Boston? The feedback generation can be decomposed into easier sub-tasks of verifying each answer. Second, verifiable tasks, in which the correctness of the responses can be verified easily without external information, are suitable for self-correction. For example, tree-of-thought (Yao et al., 2023) is a generate-and-rank method for verifiable tasks, such as Game of 24, a task to find arithmetic operations to obtain 24 using provided four integers. We expect self-correction to work well on these tasks because the situation is identical to using the oracle information to generate feedback (e.g., the target answer is 24). These tasks are exceptionally suitable for self-correction, but many real-world tasks do not satisfy these properties.
Given the observation that feedback generation is a bottleneck of self-correction (§4), improving feedback using external tools or knowledge is a promising direction. External tools used for self-correction include code interpreters for code generation tasks (Chen et al., 2024c; Gou et al., 2024) and symbolic reasoners for logical reasoning tasks (Pan et al., 2023). A popular source of knowledge is search engines, which are often used with queries generated from initial responses to retrieve information for validating their correctness (Gao et al.,
Table 5: Self-correction with external tools or knowledge (with in-context learning).
2023; Jiang et al., 2023b). Prior work widely agrees that self-correction can improve LLM responses when reliable external tools or knowledge suitable for improving feedback are available [RQ2].
Unfair self-correction with external information. Although using external tools or knowledge is effective, we raise caution that the way of using external tools or knowledge influences the research questions we can verify (§3.1). As shown in Table 5, some prior studies (Gao et al., 2023; Yu et al., 2023; Zhao et al., 2023) use external knowledge only for self-correction, while they can also directly use external knowledge to improve initial responses. For example, RARR (Gao et al., 2023) uses external knowledge to detect mistakes in initial responses, while it does not use any external knowledge when generating initial responses. These methods are reasonable when only focusing
on [RQ3] the performance of final responses, but it is not fair to use them for evaluating [RQ2] whether
self-correction can improve from the best-possi-ble initial responses. In contrast, using code interpreters for self-correction (Gou et al., 2024; Chen et al., 2024c) can be regarded as using best-possible initial responses because there is no easy way to improve the initial response generation directly.
Prior work shows that fine-tuning LLMs for generating feedback or refining responses improves the self-correction capability. A common approach fine-tunes feedback models to generate reference feedback given initial responses and fine-tunes refinement models to generate reference answers given the initial responses and reference feedback (Ye et al., 2023; Lee et al., 2023; Saunders et al., 2022). Frameworks: The first approach fine-tunes the same model to correct its own responses. Most studies fine-tune models for all stages: initial responses, feedback, and refinement (Saunders et al., 2022; Ye et al., 2023; Lee et al., 2023). Another approach corrects responses from larger models using smaller fine-tuned models. This cross-model correction approach often instructs the larger models to refine their own responses using feedback by fine-tuned models (Yang et al., 2022b; Welleck et al., 2023; Akyurek et al., 2023; Paul et al., 2024), which can be viewed as using the small fine-tuned models as external feedback. Training Strategies: A popular approach is supervised fine-tuning, which fine-tunes self-correction modules on human-annotated feedback (Saunders et al., 2022), feedback from stronger models (Ye et al., 2023), or synthetic negative responses (Paul et al., 2024). To avoid the cost of collecting human feedback, self-corrective learning (Welleck et al., 2023) selects model-generated feedback that successfully refines responses as training data and RL4L (Akyurek et al., 2023) uses reinforcement-learning. External Tools: Some works fine-tune models to refine responses given feedback from external tools. SelfEdit (Zhang et al., 2023b) uses the results on test cases evaluated by code executors for code generation, and Baldur (First et al., 2023) uses proof assistants for improving proof generation.
Large Training Data for SFT of Feedback. As shown in Table 6, many methods with supervised fine-tuning for feedback generation rely on training data with more than 100K instances. These methods require large-scale human annotations to be implemented on state-of-the-art models. We expect future research to explore approaches that do not require large-scale human annotations (§11).
Unfair Fine-tuning. Some studies (Welleck et al., 2023) apply stronger fine-tuning for self-correction models than initial response generation models, which do not use best-possible initial responses in the available resources (§3.2). These approaches can be used to evaluate [RQ3] the performance of the final responses to compare with other methods but cannot be used to evaluate [RQ2] the improvement from best-possible initial responses.
Table 6: Self-correction with supervised fine-tuning. Most methods require large training datasets. “–” represents no fine-tuning.
Bottleneck is in Feedback Generation. Prior studies widely agree that LLMs can refine their responses given reliable feedback (§5). However, generating reliable feedback on their own responses is still observed to be challenging for LLMs without using additional information (§4), although self-correction is motivated by the hypothesis that recognizing errors is easier than avoiding them (Saun- ders et al., 2022). We suggest that self-correction research analyze the quality of generated feedback in more detail, not only evaluate refined responses.
Tasks Suitable for Self-Correction. Our analysis identifies the properties of tasks that are suitable for self-correction under different conditions.
• Intrinsic Self-Correction (§4)
– Tasks whose verification tasks are much easier than the original tasks (e.g., tasks whose responses are decomposable or verifiable)
• Self-Correction with External Information (§5)
– Tasks for which external tools that provide reliable feedback exist (e.g., code generation)
– Tasks for which responses can be utilized to obtain useful information that is difficult to obtain before generating initial responses (e.g., generate queries from responses to retrieve documents for verifying information)
• Self-Correction with Fine-tuning (§6)
– Self-correction works in many tasks when large train data for self-correction is available
– Tasks that can use reinforcement learning or self-corrective learning (Welleck et al., 2023), i.e., tasks whose responses can be easily evaluated given ground-truth answers
Our analysis shows that many studies do not clearly define their research questions and fail to conduct appropriate experiments (§3.1, 4). To tackle these issues, we provide a checklist for self-correction research that provides requirements for designing appropriate experiments for verifying target RQs and recommended experiments for comprehensive analysis. Table 7 provides a checklist for verifying different RQs identified in Section 3.1. Table 8 provides a checklist for reporting negative results.
Pan et al. (2024) provide a comprehensive survey on broad studies related to self-correction. Our work specifically focuses on (inference-time) self-correction and provides a more detailed and critical analysis of prior work. Huang et al. (2024) analyze problems in evaluation settings of self-correction research, which motivates our work. They focus on analyzing a few papers on intrinsic self-correction in reasoning tasks. We provide a more comprehensive analysis of self-correction with in-context learning, external tools, and fine-tuning.
Self-Detection of mistakes in LLM responses using LLMs (possibly with external information) has been studied in various domains, including misinformation detection (Zhang et al., 2023c; Chern et al., 2023; Chen and Shu, 2024; Mishra et al., 2024), context-faithfulness (Wang et al., 2020; Dur- mus et al., 2020; Scialom et al., 2021), harmful content detection (Rauh et al., 2022), and bias detection (Blodgett et al., 2020; Feng et al., 2023).
Editing Human-Written Text by using language models has been studied in various domains,
Table 8: Checklist for reporting negative results of self-correction.
including information update (Shah et al., 2020; Iv et al., 2022; Schick et al., 2023), grammatical error correction (Ng et al., 2014; Lichtarge et al., 2019), factual error correction (Cao et al., 2020; Thorne and Vlachos, 2021), and code repair (Gupta et al., 2017; Mesbah et al., 2019; Bader et al., 2019; Chen et al., 2021; Yasunaga and Liang, 2020, 2021).
Self-Training or self-improvement is an approach to train models using their responses. Some studies use self-evaluation or self-correction for creating training data (Bai et al., 2022; Gulcehre et al., 2023) or use self-evaluation as training signals (Pang et al., 2024). Another approach improves the reasoning of LLMs using LLM-generated reasoning by selecting high-quality outputs using ground-truth final answers (Zelikman et al., 2022) or self-consistency (Huang et al., 2023). As another direction, Meng et al. (2022) use sentences generated by LLMs with high confidence for training classifiers.
Improving Feedback. Prior studies indicate that it is difficult for LLMs to generate feedback on their own responses with in-context learning (§4, 7). However, most studies in intrinsic self-correction use simple prompts, and there is room for improvement. A possible direction to improve feedback is to apply (reference-free and point-wise) LLMbased evaluation metrics. Recent approaches include using human-written evaluation criteria (Chi- ang and Lee, 2023; Liu et al., 2023) and decomposing responses (Saha et al., 2023; Min et al., 2023). Another direction is to evaluate the confidence in their responses using generation probabilities (Varshney et al., 2023; Jiang et al., 2023b) or generating new questions from their answers to evaluate logical consistency (Jung et al., 2022; Tafjord et al., 2022; Wu et al., 2024).
Unexplored Tasks. Difficulty of self-evaluation differs from task to task (§4), while many studies assume that verification is consistently easier than generation (e.g., Saunders et al., 2022). We expect that there are unexplored tasks in which intrinsic self-correction works well. For example, LLMbased evaluation is often studied in open-ended text generation, such as dialogue generation (Fu et al., 2023; Liu et al., 2023), suggesting that reasonable model-based feedback is available.
Fine-tuning with Small Training Data. Finetuning of feedback generation often relies on large training data, which requires large-scale human annotations (§6). We expect future work to explore self-correction with smaller training data. Although reinforcement learning (Akyurek et al., 2023) or self-corrective learning (Welleck et al., 2023) do not require human feedback, they require reasonable reward functions for evaluating LLM responses, which are not available in many tasks. For example, RL4F (Akyurek et al., 2023) uses ROUGE as a reward function for text summarization and action planning, which is sub-optimal.
Pre-training for Improving Self-Correction. Prior studies show that large-scale fine-tuning on reference feedback improves the self-correction capability of LLMs (§6). This observation suggests that the current approach or datasets for pre-training LLMs are insufficient to make LLMs acquire self-correction capability. We expect future work to explore pre-training strategies to improve the intrinsic self-correction capability of LLMs.
This work was supported by a Cisco Research Grant.
Afra Feyza Akyurek, Ekin Akyurek, Ashwin Kalyan, Peter Clark, Derry Tanti Wijaya, and Niket Tandon. 2023. RL4F: Generating natural language feedback with reinforcement learning for repairing model outputs. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7716–7733, Toronto, Canada. Association for Computational Linguistics.
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2024. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations.
Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra. 2019. Getafix: learning to fix bugs automatically. Proc. ACM Program. Lang., 3(OOPSLA).
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine
Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli TranJohnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. 2022. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (technol- ogy) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Online. Association for Computational Linguistics.
Meng Cao, Yue Dong, Jiapeng Wu, and Jackie Chi Kit Cheung. 2020. Factual error correc- tion for abstractive summarization models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6251–6258, Online. Association for Computational Linguistics.
Chi-Min Chan, Weize Chen, Yusheng Su, Jianx- uan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2024. Chateval: Towards better LLM-based evaluators through multi-agent de- bate. In The Twelfth International Conference on Learning Representations.
Yiannis Charalambous, Norbert Tihanyi, Ridhi Jain, Youcheng Sun, Mohamed Amine Ferrag, and Lucas C. Cordeiro. 2023. A new era in software security: Towards self-healing software via large language models and formal verification. arXiv preprint arXiv:2305.14752.
Angelica Chen, Jérémy Scheurer, Jon Ander Cam- pos, Tomasz Korbak, Jun Shern Chan, Samuel R. Bowman, Kyunghyun Cho, and Ethan Perez. 2024a. Learning from natural language feed- back. Transactions on Machine Learning Research.
Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2023a. Codet: Code generation with gen- erated tests. In The Eleventh International Conference on Learning Representations.
Canyu Chen and Kai Shu. 2024. Can LLM- generated misinformation be detected? In The Twelfth International Conference on Learning Representations.
Justin Chih-Yao Chen, Swarnadeep Saha, and Mohit Bansal. 2024b. Reconcile: Roundtable conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007.
Pinzhen Chen, Zhicheng Guo, Barry Haddow, and Kenneth Heafield. 2023b. Iterative translation refinement with large language models. arXiv preprint arXiv:2306.03856.
Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou. 2023c. Universal self-consistency for large language model generation. arXiv preprint arXiv:2311.17311.
Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2024c. Teaching large language models to self-debug. In The Twelfth International Conference on Learning Representations.
Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, and Martin Monperrus. 2021. Sequencer: Sequence- to-sequence learning for end-to-end program re- pair. IEEE Transactions on Software Engineering, 47(9):1943–1959.
I-Chun Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, and Pengfei Liu. 2023. Factool: Factuality detection in generative ai – a tool augmented framework for multi-task and multi-domain scenarios. arXiv preprint arXiv:2307.13528.
Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to hu- man evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
pages 15607–15631, Toronto, Canada. Association for Computational Linguistics.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavar- ian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
Roi Cohen, May Hamri, Mor Geva, and Amir Globerson. 2023a. LM vs LM: Detecting factual errors via cross examination. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12621– 12640, Singapore. Association for Computational Linguistics.
Roi Cohen, May Hamri, Mor Geva, and Amir Globerson. 2023b. LM vs LM: Detecting factual errors via cross examination. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12621– 12640, Singapore. Association for Computational Linguistics.
Antonia Creswell and Murray Shanahan. 2022. Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271.
Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2023. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.
Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. 2023. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325.
Esin Durmus, He He, and Mona Diab. 2020. FEQA: A question answering evaluation frame- work for faithfulness assessment in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5055–5070, Online. Association for Computational Linguistics.
Mohamed Elaraby, Mengyin Lu, Jacob Dunn, Xueying Zhang, Yu Wang, Shizhu Liu, Pingchuan Tian, Yuping Wang, and Yuxuan Wang. 2023. Halo: Estimation and
reduction of hallucinations in open-source weak large language models. arXiv preprint arXiv:2308.11764.
Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. 2023. From pretraining data to language models to downstream tasks: Track- ing the trails of political biases leading to un- fair NLP models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11737–11762, Toronto, Canada. Association for Computational Linguistics.
Emily First, Markus Rabe, Talia Ringer, and Yuriy Brun. 2023. Baldur: Whole-proof generation and repair with large language models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, page 1229–1241, New York, NY, USA. Association for Computing Machinery.
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2023. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
Luyu Gao, Zhuyun Dai, Panupong Pasupat, An- thony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, DaCheng Juan, and Kelvin Guu. 2023. RARR: Researching and revising what language mod- els say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16477–16508, Toronto, Canada. Association for Computational Linguistics.
Zelalem Gero, Chandan Singh, Hao Cheng, Tris- tan Naumann, Michel Galley, Jianfeng Gao, and Hoifung Poon. 2023. Self-verification improves few-shot clinical information extraction. In ICML 3rd Workshop on Interpretable Machine Learning in Healthcare (IMLH).
Zhibin Gou, Zhihong Shao, Yeyun Gong, yelong shen, Yujiu Yang, Nan Duan, and Weizhu Chen. 2024. CRITIC: Large language models can self- correct with tool-interactive critiquing. In The Twelfth International Conference on Learning Representations.
Caglar Gulcehre, Tom Le Paine, Srivatsan Srini- vasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. 2023. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998.
Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. 2017. Deepfix: Fixing com- mon c language errors by deep learning. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1).
Haixia Han, Jiaqing Liang, Jie Shi, Qianyu He, and Yanghua Xiao. 2024. Small language model can self-correct. arXiv preprint arXiv:2401.07301.
Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. 2023. Rea- soning with language model is planning with world model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173, Singapore. Association for Computational Linguistics.
Ruixin Hong, Hongming Zhang, Xinyu Pang, Dong Yu, and Changshui Zhang. 2024. A closer look at the self-verification abilities of large language models in logical reasoning. arXiv preprint arXiv:2311.07954.
Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2023. Large language models can self-improve. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1051–1068, Singapore. Association for Computational Linguistics.
Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2024. Large language models cannot self-correct reasoning yet. In The Twelfth International Conference on Learning Representations.
Robert Iv, Alexandre Passos, Sameer Singh, and Ming-Wei Chang. 2022. FRUIT: Faithfully re- flecting updated information in text. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo-
gies, pages 3670–3686, Seattle, United States. Association for Computational Linguistics.
Shuyang Jiang, Yuhao Wang, and Yu Wang. 2023a. Selfevolve: A code evolution framework via large language models. arXiv preprint arXiv:2306.02907.
Zhengbao Jiang, Frank Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023b. Ac- tive retrieval augmented generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7969–7992, Singapore. Association for Computational Linguistics.
Jaehun Jung, Lianhui Qin, Sean Welleck, Faeze Brahman, Chandra Bhagavatula, Ronan Le Bras, and Yejin Choi. 2022. Maieutic prompting: Log- ically consistent reasoning with recursive expla- nations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1266–1279, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Ryo Kamoi, Sarkar Snigdha Sarathi Das, Renze Lou, Jihyun Janice Ahn, Yilun Zhao, Xiaoxin Lu, Nan Zhang, Yusen Zhang, Ranran Haoran Zhang, Sujeeth Reddy Vummanthala, Salika Dave, Shaobo Qin, Arman Cohan, Wenpeng Yin, and Rui Zhang. 2024. Evaluating llms at detecting errors in llm responses. arXiv preprint arXiv:2404.03602.
Dayeon Ki and Marine Carpuat. 2024. Guiding large language models to post-edit machine translation with error annotations. arXiv preprint arXiv:2404.07851.
Geunwoo Kim, Pierre Baldi, and Stephen McAleer. 2023. Language models can solve computer tasks. In Advances in Neural Information Processing Systems, volume 36, pages 39648– 39677. Curran Associates, Inc.
Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. 2022. Coderl: Mastering code generation through pre- trained models and deep reinforcement learning. In Advances in Neural Information Processing Systems, volume 35, pages 21314–21328. Curran Associates, Inc.
Seongyun Lee, Sue Hyun Park, Yongrae Jo, and Minjoon Seo. 2023. Volcano: mitigating multimodal hallucination through self-feedback guided revision. arXiv preprint arXiv:2311.07362.
Ruosen Li, Teerth Patel, and Xinya Du. 2023. Prd: Peer rank and discussion improve large language model based evaluations. arXiv preprint arXiv:2307.02762.
Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. 2023. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118.
Jared Lichtarge, Chris Alberti, Shankar Kumar, Noam Shazeer, Niki Parmar, and Simon Tong. 2019. Corpora generation for grammatical error correction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3291–3301, Minneapolis, Minnesota. Association for Computational Linguistics.
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2024. Let’s verify step by step. In The Twelfth International Conference on Learning Representations.
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Systems, volume 36, pages 46534–46594. Curran Associates, Inc.
Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large lan- guage models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, Singapore. Association for Computational Linguistics.
Ninareh Mehrabi, Palash Goyal, Christophe Dupuy, Qian Hu, Shalini Ghosh, Richard Zemel, KaiWei Chang, Aram Galstyan, and Rahul Gupta. 2023. Flirt: Feedback loop in-context red teaming. arXiv preprint arXiv:2308.04265.
Yu Meng, Jiaxin Huang, Yu Zhang, and Jiawei Han. 2022. Generating training data with language models: Towards zero-shot language understand- ing. In Advances in Neural Information Processing Systems, volume 35, pages 462–477. Curran Associates, Inc.
Ali Mesbah, Andrew Rice, Emily Johnston, Nick Glorioso, and Edward Aftandilian. 2019. Deep- delta: learning to repair compilation errors. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2019, page 925–936, New York, NY, USA. Association for Computing Machinery.
Ning Miao, Yee Whye Teh, and Tom Rainforth. 2024. Selfcheck: Using LLMs to zero-shot check their own step-by-step reasoning. In The Twelfth International Conference on Learning Representations.
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine-grained atomic evalu- ation of factual precision in long form text gen- eration. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
Abhika Mishra, Akari Asai, Vidhisha Balachan- dran, Yizhong Wang, Graham Neubig, Yulia Tsvetkov, and Hannaneh Hajishirzi. 2024. Fine-grained hallucination detection and editing for language models. arXiv preprint arXiv:2401.06855.
Deepak Nathani, David Wang, Liangming Pan, and William Wang. 2023. MAF: Multi-aspect feed- back for improving reasoning in large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6591–6616, Singapore. Association for Computational Linguistics.
Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Chris- tian Hadiwinoto, Raymond Hendy Susanto, and Christopher Bryant. 2014. The CoNLL-2014 shared task on grammatical error correction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pages 1–14, Baltimore, Maryland. Association for Computational Linguistics.
Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-Tau Yih, Sida Wang, and Xi Victoria Lin. 2023. LEVER: Learning to verify language-to-code generation with execution. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 26106–26128. PMLR.
Theo X. Olausson, Jeevana Priya Inala, Cheng- long Wang, Jianfeng Gao, and Armando SolarLezama. 2024. Is self-repair a silver bullet for code generation? In The Twelfth International Conference on Learning Representations.
Liangming Pan, Alon Albalak, Xinyi Wang, and William Wang. 2023. Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3806–3824, Singapore. Association for Computational Linguistics.
Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, and William Yang Wang. 2024. Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies. Transactions of the Association for Computational Linguistics, 12:484–506.
Jing-Cheng Pang, Pengyuan Wang, Kaiyuan Li, Xiong-Hui Chen, Jiacheng Xu, Zongzhang Zhang, and Yang Yu. 2024. Language model self-improvement by reinforcement learning con- templation. In The Twelfth International Conference on Learning Representations.
Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert West, and Boi Faltings. 2024. REFINER: Reasoning feedback on intermediate representations. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1100–1126, St. Julian’s, Malta. Association for Computational Linguistics.
Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, and Jianfeng Gao. 2023. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813.
Mansi Phute, Alec Helbling, Matthew Hull, ShengYun Peng, Sebastian Szyller, Cory Cornelius, and Duen Horng Chau. 2024. Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308.
Reid Pryzant, Dan Iter, Jerry Li, Yin Lee, Chen- guang Zhu, and Michael Zeng. 2023. Automatic prompt optimization with “gradient descent” and beam search. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7957–7968, Singapore. Association for Computational Linguistics.
Maribeth Rauh, John F J Mellor, Jonathan Ue- sato, Po-Sen Huang, Johannes Welbl, Laura Weidinger, Sumanth Dathathri, Amelia Glaese, Geoffrey Irving, Iason Gabriel, William Isaac, and Lisa Anne Hendricks. 2022. Characteristics of harmful text: Towards rigorous benchmarking of language models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
Vikas Raunak, Amr Sharaf, Yiren Wang, Hany Awadalla, and Arul Menezes. 2023. Leveraging GPT-4 for automatic translation post-editing. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12009–12024, Singapore. Association for Computational Linguistics.
Swarnadeep Saha, Omer Levy, Asli Celikyilmaz, Mohit Bansal, Jason Weston, and Xian Li. 2023. Branch-solve-merge improves large language
model evaluation and generation. arXiv preprint arXiv:2310.15123.
William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. 2022. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802.
Timo Schick, Sahana Udupa, and Hinrich Schütze. 2021. Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP. Transactions of the Association for Computational Linguistics, 9:1408–1424.
Timo Schick, Jane A. Yu, Zhengbao Jiang, Fabio Petroni, Patrick Lewis, Gautier Izacard, Qingfei You, Christoforos Nalmpantis, Edouard Grave, and Sebastian Riedel. 2023. PEER: A collabo- rative language model. In The Eleventh International Conference on Learning Representations.
Thomas Scialom, Paul-Alexis Dray, Sylvain Lam- prier, Benjamin Piwowarski, Jacopo Staiano, Alex Wang, and Patrick Gallinari. 2021. QuestE- val: Summarization asks for fact-based eval- uation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6594–6604, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Darsh Shah, Tal Schuster, and Regina Barzilay. 2020. Automatic fact-guided sentence modifi- cation. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8791–8798.
Jianhao Shen, Yichun Yin, Lin Li, Lifeng Shang, Xin Jiang, Ming Zhang, and Qun Liu. 2021. Generate & rank: A multi-task framework for math word problems. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2269–2279, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Freda Shi, Daniel Fried, Marjan Ghazvininejad, Luke Zettlemoyer, and Sida I. Wang. 2022. Nat- ural language to code translation with execution. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3533–3546, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Re- flexion: language agents with verbal reinforce- ment learning. In Advances in Neural Information Processing Systems, volume 36, pages 8634–8652. Curran Associates, Inc.
Kaya Stechly, Matthew Marquez, and Subbarao Kambhampati. 2023. GPT-4 doesn’t know it’s wrong: An analysis of iterative prompting for reasoning problems. In NeurIPS 2023 Foundation Models for Decision Making Workshop.
Elias Stengel-Eskin, Archiki Prasad, and Mohit Bansal. 2024. Regal: Refactoring programs to discover generalizable abstractions. arXiv preprint arXiv:2401.16467.
Oyvind Tafjord, Bhavana Dalvi Mishra, and Peter Clark. 2022. Entailer: Answering questions with faithful and truthful chains of reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2078–2093, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
James Thorne and Andreas Vlachos. 2021. Evidence-based factual error correction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3298–3309, Online. Association for Computational Linguistics.
Gladys Tyen, Hassan Mansoor, Victor C˘arbune, Pe- ter Chen, and Tony Mak. 2024. Llms cannot find reasoning errors, but can correct them! arXiv preprint arXiv:2311.08516.
Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. 2022. Solving math word problems with process- and outcome-based feedback. arXiv preprint arXiv:2211.14275.
Karthik Valmeekam, Matthew Marquez, and Sub- barao Kambhampati. 2023. Investigating the effectiveness of self-critiquing in LLMs solving planning tasks. In NeurIPS 2023 Foundation Models for Decision Making Workshop.
Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu. 2023. A stitch in
time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987.
Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020. Asking and answering questions to evalu- ate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008–5020, Online. Association for Computational Linguistics.
Qineng Wang, Zihao Wang, Ying Su, Hanghang Tong, and Yangqiu Song. 2024. Rethinking the bounds of llm reasoning: Are multi-agent discussions the key? arXiv preprint arXiv:2402.18272.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency improves chain of thought rea- soning in language models. In The Eleventh International Conference on Learning Representations.
Sean Welleck, Ximing Lu, Peter West, Faeze Brah- man, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. 2023. Generating sequences by learning to self-correct. In The Eleventh International Conference on Learning Representations.
Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. 2023. Large language models are better reasoners with self-verification. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2550–2575, Singapore. Association for Computational Linguistics.
Zhenyu Wu, Qingkai Zeng, Zhihan Zhang, Zhaoxuan Tan, Chao Shen, and Meng Jiang. 2024. Large language models can self-correct with minimal effort. arXiv preprint arXiv:2405.14092.
Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, Min-Yen Kan, Junxian He, and Qizhe Xie. 2023. Self-evaluation guided beam search for reason- ing. In Thirty-seventh Conference on Neural Information Processing Systems.
Wenda Xu, Danqing Wang, Liangming Pan, Zhen- qiao Song, Markus Freitag, William Yang Wang,
and Lei Li. 2023. INSTRUCTSCORE: Towards explainable text generation evaluation with au- tomatic feedback. In The 2023 Conference on Empirical Methods in Natural Language Processing.
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanx- iao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. 2024. Large language models as optimiz- ers. In The Twelfth International Conference on Learning Representations.
Kaiyu Yang, Jia Deng, and Danqi Chen. 2022a. Generating natural language proofs with verifier- guided search. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 89–105, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Kevin Yang, Yuandong Tian, Nanyun Peng, and Dan Klein. 2022b. Re3: Generating longer sto- ries with recursive reprompting and revision. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4393–4479, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, volume 36, pages 11809–11822. Curran Associates, Inc.
Michihiro Yasunaga and Percy Liang. 2020. Graph- based, self-supervised program repair from di- agnostic feedback. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 10799–10808. PMLR.
Michihiro Yasunaga and Percy Liang. 2021. Break- it-fix-it: Unsupervised learning for program re- pair. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11941–11952. PMLR.
Seonghyeon Ye, Yongrae Jo, Doyoung Kim, Sung- dong Kim, Hyeonbin Hwang, and Minjoon Seo. 2023. Selfee: Iterative self-revising llm empow- ered by self-feedback generation. Blog post.
Wenhao Yu, Zhihan Zhang, Zhenwen Liang, Meng Jiang, and Ashish Sabharwal. 2023. Improving language models via plug-and-play retrieval feedback. arXiv preprint arXiv:2305.14002.
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. 2022. Star: Bootstrapping reasoning with reasoning. In Advances in Neural Information Processing Systems, volume 35, pages 15476–15488. Curran Associates, Inc.
Jintian Zhang, Xin Xu, and Shumin Deng. 2023a. Exploring collaboration mechanisms for llm agents: A social psychology view. arXiv preprint arXiv:2310.02124.
Kechi Zhang, Zhuo Li, Jia Li, Ge Li, and Zhi Jin. 2023b. Self-edit: Fault-aware code editor for code generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 769–787, Toronto, Canada. Association for Computational Linguistics.
Muru Zhang, Ofir Press, William Merrill, Alisa Liu, and Noah A. Smith. 2023c. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534.
Tianyi Zhang, Tao Yu, Tatsunori Hashimoto, Mike Lewis, Wen-Tau Yih, Daniel Fried, and Sida Wang. 2023d. Coder reviewer reranking for code generation. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 41832–41846. PMLR.
Yunxiang Zhang, Muhammad Khalifa, Lajanu- gen Logeswaran, Jaekyeom Kim, Moontae Lee, Honglak Lee, and Lu Wang. 2024. Small language models need strong verifiers to self-correct reasoning. arXiv preprint arXiv:2404.17140.
Ruochen Zhao, Xingxuan Li, Shafiq Joty, Cheng- wei Qin, and Lidong Bing. 2023. Verify-and- edit: A knowledge-enhanced chain-of-thought framework. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5823–5840, Toronto, Canada. Association for Computational Linguistics.
Table 9: Prompts for Dialogue Response Generation used in Self-Refine (Madaan et al., 2023). Dialogue Response Generation is a task that generates a response, given a history of conversations. Prompts used by Madaan et al. (2023) for generating initial responses instruct to generate responses that are not interesting and not very engaging, which are contradicting to the task goal. They unfairly instruct the models to generate initial responses that have problems intentionally, over-evaluating self-correction performance. Prompts for generating initial responses: https://github.com/madaan/ self-refine/blob/main/src/responsegen/task_init.py and feedback: https://github.com/ madaan/self-refine/blob/main/src/responsegen/feedback.py. Few-shot examples for generating initial responses: https://github.com/madaan/self-refine/blob/main/data/prompt/ responsegen/init.jsonl and feedback: https://github.com/madaan/self-refine/blob/main/ data/prompt/responsegen/feedback.jsonl.
Table 10: Few-shot examples in prompts for the Sentiment Reversal task (positive to negative) used in Self-Refine (Madaan et al., 2023). Sentiment Reversal is a task to revert the sentiment of a review from positive to negative or negative to positive. Few-shot examples for generating initial responses include examples in settings different from the target task (positive to negative), while all few-shot examples for refinement are positive to negative. The few-shot examples used by
Madaan et al. (2023) for generating initial responses unfairly have different properties from the tar-
get task. Prompts for initial responses: https://github.com/madaan/self-refine/blob/main/src/
sentiment_reversal/task_init.py and refinement: https://github.com/madaan/self-refine/