36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"2406.02106","publisher":"arxiv","paperJSON":{"title":"MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset","paperID":"2406.02106","avgLineHeight":13.55,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"To enable Large Language Models (LLMs) to function as conscious agents with generalizable reasoning capabilities, it is crucial that they possess the ability to comprehend ","element":"span"},{"style":{"fontStyle":"italic"},"text":"situational changes (transitions) in distribution ","element":"span"},{"text":"triggered by environmental factors or actions from other agents. Despite its fundamental significance, this ability remains underexplored due to the complexity of modeling infinite possible changes in an event and their associated distributions, coupled with the lack of benchmark data with situational transitions. Addressing these gaps, we propose a novel formulation of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"reasoning with distributional changes ","element":"span"},{"text":"as a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"three-step discriminative process","element":"span"},{"text":", termed as ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"MetAphysical ReaSoning","element":"span"},{"text":". We then introduce the first-ever benchmark, ","element":"span"},{"text":"M","element":"span"},{"text":"ARS","element":"span"},{"text":", comprising three tasks corresponding to each step. These tasks systematically assess LLMs’ capabilities in reasoning the plausibility of (i) changes in actions, (ii) states caused by changed actions, and (iii) situational transitions driven by changes in action. Extensive evaluations with 20 (L)LMs of varying sizes and methods indicate that all three tasks in this process pose significant challenges, even after fine-tuning. ","element":"span"},{"text":"Further analyses reveal potential causes for the underperformance of LLMs and demonstrate that pre-training on large-scale conceptualization taxonomies can potentially enhance LMs’ metaphysical reasoning capabilities. ","element":"span"},{"text":"Our data and models are publicly accessible at ","element":"span"},{"href":"https://github.com/HKUST-KnowComp/MARS","text":"https://github.com/HKUST- ","element":"a"},{"href":"https://github.com/HKUST-KnowComp/MARS","text":"KnowComp/MARS","element":"a"},{"text":".","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"Recent advances in LLMs have demonstrated superior performance in a variety of reasoning tasks (","element":"span"},{"href":"#id-0","referenceIndex":42,"text":"Liu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":42,"text":"2023b","element":"a"},{"text":"; ","element":"span"},{"href":"#id-1","referenceIndex":13,"text":"Chan et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":13,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-2","referenceIndex":37,"text":"Ko et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":37,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-3","referenceIndex":53,"text":"Qin et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","referenceIndex":53,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-4","referenceIndex":32,"text":"Jain et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","referenceIndex":32,"text":"2023","element":"a"},{"text":"). However, to truly achieve conscious processing (","element":"span"},{"href":"#id-5","referenceIndex":2,"text":"Andreas","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","referenceIndex":2,"text":"2022","element":"a"},{"text":"), the integration of System II reasoning ability (","element":"span"},{"href":"#id-6","referenceIndex":58,"text":"Sloman","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":58,"text":"1996","element":"a"},{"text":"; ","element":"span"},{"href":"#id-7","referenceIndex":35,"text":"Kahneman","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","referenceIndex":35,"text":"2011","element":"a"},{"text":") is essential as","element":"span"}],[{"id":"id-36","style":{"width":"99%"},"width":872,"height":1165,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/0-0.png","element":"img"}],[{"text":"Figure 1: Examples of changes in event in our formulation. After changes occur, events may become ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"metaphysical ","element":"figcaption","subtype":"caption"},{"text":"as components are abstracted into high-level concepts, while some remain plausible in reality.","element":"figcaption","subtype":"caption"}],[{"text":"it enables LLMs to perform out-of-distribution generalization when encountered with unfamiliar scenarios (","element":"span"},{"href":"#id-8","referenceIndex":9,"text":"Bengio et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":9,"text":"2021","element":"a"},{"text":"). Among several components that make up System II reasoning, a critical element of it is the ability to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"reason with situational changes in distribution","element":"span"},{"text":", triggered by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"environmental factors ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"actions by themselves or other agents","element":"span"},{"text":", when dealing with non-stationarities (","element":"span"},{"href":"#id-9","referenceIndex":8,"text":"Ben- ","element":"a"},{"href":"#id-9","referenceIndex":8,"text":"gio","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":8,"text":"2017","element":"a"},{"text":"). It serves as the core ability in planning tasks (","element":"span"},{"href":"#id-10","referenceIndex":30,"text":"Huang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":30,"text":"2024","element":"a"},{"text":"), which can be achieved by dynamically recombining existing concepts in the given environment or action and learning from the resultant situational changes (","element":"span"},{"href":"#id-11","referenceIndex":38,"text":"Lake and Baroni","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":38,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-12","referenceIndex":5,"text":"Bahdanau et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":5,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-13","referenceIndex":17,"text":"de Vries et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":17,"text":"2019","element":"a"},{"text":"). For instance, in the event that “PersonX is driv- ","element":"span"},{"text":"ing a car in a sunny day,” a change in the weather from sunny to rainy could cause a different outcome, such as “PersonX becomes more cautious and drives slower.” This illustrates that a change in weather conditions can lead to a change in the driver’s behavior, which represents an environmental change that triggers situational changes within the distribution of different weathers.","element":"span"}],[{"text":"Though fundamental, the exploration of this ability has been limited due to several factors. First, the scope for change within an event is vast, with numerous components capable of altering in a wide variety of ways. This results in an overwhelmingly large number of potential changes that are impossible to fully cover with existing knowledge bases. Second, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"reasoning with changes in distribution ","element":"span"},{"text":"lacks a clear formulation due to its complexity. Unlike one-step inference reasoning tasks (","element":"span"},{"href":"#id-14","referenceIndex":56,"text":"Sap ","element":"a"},{"href":"#id-14","referenceIndex":56,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":56,"text":"2019","element":"a"},{"text":"), changes in action may lead to implausible events that cannot occur in reality, thus terminating the reasoning process. Such type of changes require extra care when designing evaluation protocols. Lastly, there is a lack of a reliable evaluation benchmark. Existing benchmarks (","element":"span"},{"href":"#id-15","referenceIndex":64,"text":"Valmeekam ","element":"a"},{"href":"#id-15","referenceIndex":64,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","referenceIndex":64,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-16","referenceIndex":26,"text":"He et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","referenceIndex":26,"text":"2023b","element":"a"},{"text":") typically focus on a limited number of changes within a few scenarios, thus limiting the coverage of formed distributions. The changes in actions and states are also formulated under planning or logical tasks, which neglect transitions (consequences) caused by changes.","element":"span"}],[{"text":"To address these gaps, we take a step forward by formally defining ","element":"span"},{"style":{"fontStyle":"italic"},"text":"reasoning with changes in distribution ","element":"span"},{"text":"as a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"three-step discriminative process","element":"span"},{"text":". We start by defining seven categories of changes, each corresponding to different components within an event. To semantically cover more changes in a unified manner, we propose implementing changes by altering each component within the event using their abstractions or numerical variations. This approach creates a hierarchical distribution of various changes, with the abstracted ones offering a more generalized coverage. Inspired by ","element":"span"},{"href":"#id-8","referenceIndex":9,"text":"Bengio ","element":"a"},{"href":"#id-8","referenceIndex":9,"text":"et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-8","referenceIndex":9,"text":"2021","element":"a"},{"text":"), we formulate ","element":"span"},{"style":{"fontStyle":"italic"},"text":"reasoning with changes in distribution ","element":"span"},{"text":"as sequentially tasking the model to: (1) assess the plausibility of a potential change in a given event that describes an action, (2) evaluate the plausibility of an inferential state resulting from the modified action, and (3) determine the necessary change in an action to convert an implausible inferential state into a plausible one. We refer to this process as ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"metaphysical reasoning","element":"span"},{"text":"–a term we adopt to describe a mode of reasoning that deals ","element":"span"},{"text":"with highly improbable or abstract scenarios distinct from its traditional philosophical meaning or counterfactual reasoning (see Appendix ","element":"span"},{"text":"A","element":"span"},{"text":")–as it also requires models to distinguish implausible actions, states, and transitions that exist only in this abstract “metaphysical” realm, indicating their rare occurrence in reality (","element":"span"},{"href":"#id-17","referenceIndex":27,"text":"Heidegger","element":"a"},{"text":", ","element":"span"},{"href":"#id-17","referenceIndex":27,"text":"2014","element":"a"},{"text":").","element":"span"}],[{"text":"We then construct the first evaluation benchmark, ","element":"span"},{"text":"M","element":"span"},{"text":"ARS","element":"span"},{"text":", featuring 355K annotated data across three tasks corresponding to each step. It is constructed by sequentially instructing an LLM to extract events from Wikitext (","element":"span"},{"href":"#id-18","referenceIndex":45,"text":"Merity et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-18","referenceIndex":45,"text":"2017","element":"a"},{"text":") and BookCorpus (","element":"span"},{"href":"#id-19","referenceIndex":84,"text":"Zhu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-19","referenceIndex":84,"text":"2015","element":"a"},{"text":"), identify mutable components within each event, generate abstractions and numerical variations for those components, create a metaphysical inference state based on the changes, and generate the necessary modifications to make the metaphysical inference plausible in reality. Large-scale human annotations are then conducted to provide labels of evaluation data entries and verify the quality of our benchmark. Extensive experiments with over 20 (L)LMs demonstrate that all three tasks in this process present significant challenges, even for LMs after fine-tuning. Further analyses reveal potential reasons for such underperformance and identify possible solutions for enhancing the metaphysical reasoning abilities of language models.","element":"span"}]]},{"heading":"2 Backgrounds and Related Works","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Reasoning about Changes in Distribution. ","element":"span"},{"text":"Enabling LMs to understand distributional changes due to localized causal interventions, particularly in semantic spaces, has long been a crucial objective in the pursuit of conscious machine intelligence (","element":"span"},{"href":"#id-20","referenceIndex":10,"text":"Bengio et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-20","referenceIndex":10,"text":"2019","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":9,"text":"2021","element":"a"},{"text":"). Previous works have mainly explored this within the context of discriminating changes between actions and states with methods such as commonsense knowledge injection (","element":"span"},{"href":"#id-21","referenceIndex":60,"text":"Tandon et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-21","referenceIndex":60,"text":"2018","element":"a"},{"text":"), event calculus (","element":"span"},{"href":"#id-22","referenceIndex":7,"text":"Basina ","element":"a"},{"href":"#id-22","referenceIndex":7,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-22","referenceIndex":7,"text":"2022","element":"a"},{"text":"), and fuzzy reasoning (","element":"span"},{"href":"#id-23","referenceIndex":82,"text":"Zhang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-23","referenceIndex":82,"text":"2013","element":"a"},{"text":"). Other studies aim to benchmark this reasoning process through logical reasoning tasks (","element":"span"},{"href":"#id-16","referenceIndex":26,"text":"He ","element":"a"},{"href":"#id-16","referenceIndex":26,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","referenceIndex":26,"text":"2023b","element":"a"},{"text":") and planning tasks (","element":"span"},{"href":"#id-15","referenceIndex":64,"text":"Valmeekam et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","referenceIndex":64,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-24","referenceIndex":75,"text":"Wu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-24","referenceIndex":75,"text":"2021","element":"a"},{"text":"). However, these studies only cover changes in limited formats and scenarios and also overlook the significance of representing changes as a distribution in relation to different variables in actions. Such loss restricts the out-of-distribution generalizability of the resulting LMs when facing unfamiliar scenarios. Moreover, previous evaluations do not cover transitions caused by changes, making subsequent evaluations around reasoning with changes incomplete.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Benchmarking LLMs. ","element":"span"},{"text":"The advent of LLMs (","element":"span"},{"href":"#id-25","referenceIndex":47,"text":"Ope- ","element":"a"},{"href":"#id-25","referenceIndex":47,"text":"nAI","element":"a"},{"text":", ","element":"span"},{"href":"#id-25","referenceIndex":47,"text":"2022","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":48,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-27","referenceIndex":63,"text":"Touvron et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":63,"text":"2023b","element":"a"},{"text":",","element":"span"},{"href":"#id-28","referenceIndex":62,"text":"a","element":"a"},{"text":"; ","element":"span"},{"href":"#id-29","referenceIndex":55,"text":"Reid ","element":"a"},{"href":"#id-29","referenceIndex":55,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-29","referenceIndex":55,"text":"2024","element":"a"},{"text":") has sparked various studies in investigating LLM’s potential in a variety of tasks (","element":"span"},{"href":"#id-30","referenceIndex":15,"text":"Chen ","element":"a"},{"href":"#id-30","referenceIndex":15,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-30","referenceIndex":15,"text":"2024b","element":"a"},{"text":",","element":"span"},{"href":"#id-31","referenceIndex":14,"text":"a","element":"a"},{"text":"; ","element":"span"},{"href":"#id-32","referenceIndex":80,"text":"Yuan et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-32","referenceIndex":80,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-1","referenceIndex":13,"text":"Chan et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":13,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-4","referenceIndex":32,"text":"Jain et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","referenceIndex":32,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-3","referenceIndex":53,"text":"Qin et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","referenceIndex":53,"text":"2023","element":"a"},{"text":"). These studies have significantly contributed to our understanding of LLMs by evaluating their performance across diverse tasks, using different scales of parameters and prompting methods (","element":"span"},{"href":"#id-33","referenceIndex":52,"text":"Qiao et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-33","referenceIndex":52,"text":"2023","element":"a"},{"text":"). However, there is an absence of a comprehensive benchmark for assessing the ability of (L)LMs to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"reason with changes in distribution","element":"span"},{"text":". This inspires us to formally define it and introduce the first benchmark that evaluates such reasoning capabilities of (L)LMs.","element":"span"}]]},{"heading":"3 Definitions of Changes in Event and Metaphysical Reasoning","paragraphs":[[{"text":"Modeling changes within an event is inherently complex due to the infinite number of changes that can occur. For simplicity, we only consider events that represent an action and study changes between their inferential states. Given an event ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e","element":"span"},{"text":", we first define seven types of changes that could transpire within ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e","element":"span"},{"text":". These changes are represented as components of the event, including its subject ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":", verb ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"text":", object ","element":"span"},{"style":{"fontStyle":"italic"},"text":"o","element":"span"},{"text":", temporal quantifier ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", spatial quantifier ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":", numerical properties ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":", and sub-events ","element":"span"},{"style":{"fontStyle":"italic"},"text":"se","element":"span"},{"text":". The original event is denoted as a function of these seven components, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, v, o, t, l, n, se","element":"span"},{"text":")","element":"span"},{"text":". A change in the event can be represented by altering one of its components, for instance, ","element":"span"},{"style":{"height":17.6},"width":538.42,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/2-0.png","element":"img","alt":" e′ = f(s′, v, o, t, l, n, se) if the","inline":true,"padRight":true},{"text":"change impacts the subject ","element":"span"},{"style":{"height":8.4},"width":42.63,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/2-1.png","element":"img","alt":" s′.","inline":true}],[{"text":"To effectively model the distribution of changes across different types of components, we leverage two types of hierarchical formulations. Specifically, for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, v, o, se","element":"span"},{"text":", we define changes in these components as conceptualizing their original instance into three concepts with progressively increased abstractedness (","element":"span"},{"href":"#id-34","referenceIndex":23,"text":"Giunchiglia and Walsh","element":"a"},{"text":", ","element":"span"},{"href":"#id-34","referenceIndex":23,"text":"1992","element":"a"},{"text":"; ","element":"span"},{"href":"#id-35","referenceIndex":61,"text":"Tenenbaum et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-35","referenceIndex":61,"text":"2011","element":"a"},{"text":"). For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t, l, n","element":"span"},{"text":", we define their changes as modifications from their original values to three distinct numerical or spatial values with progressively increased units. This brings a hierarchical structure to changes of a certain component, forming a distribution that gradually covers more possible changes. Abstracted components, as high-level concepts, can semantically represent a broader range of combinations for altering an event. Some running examples of how changes impact an action are shown in Figure ","element":"span"},{"href":"#id-36","text":"1","element":"a"},{"text":". We then propose a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"three-step discriminative process","element":"span"},{"text":", which we term as ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Metaphysical Reasoning ","element":"span"},{"text":"(see Appendix ","element":"span"},{"text":"A","element":"span"},{"text":"), to formulate ","element":"span"},{"style":{"fontStyle":"italic"},"text":"reason with changes in distribution","element":"span"},{"text":". The three steps, as shown in Figure ","element":"span"},{"href":"#id-37","text":"2","element":"a"},{"text":", are:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"(1) Metaphysical Event Discrimination: ","element":"span"},{"text":"The first step answers the question, “Will the change happen in reality?” It aims to determine the plausibility of a change based on a given event, as alterations in components may lead to implausible events that defy reality. We refer to such an event, which rarely occurs in reality due to these changes, as a ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"metaphysical event","element":"span"},{"text":". The goal of the first task is to discriminate whether the modified event ","element":"span"},{"style":{"height":10.4},"width":129.65,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/2-2.png","element":"img","alt":" e′, con-","inline":true,"padRight":true},{"text":"ditioned on the original event e with a single altered component ","element":"span"},{"style":{"height":17.6},"width":383.48,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/2-3.png","element":"img","alt":" c ∈ (s, v, o, t, l, n, se)","inline":true},{"text":", is metaphysical or not by making a binary prediction.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"(2) Metaphysical Inference Discrimination: ","element":"span"},{"text":"Considering that distributional changes occur in nonstationary environments, a conscious agent should be able to predict the potential outcomes of the modified event for future reasoning scenarios. Therefore, the second step aims to answer the question, “What will the altered event result in?” Similarly, we term the inferences of an event that rarely occurs in reality as ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"metaphysical inference","element":"span"},{"text":". The objective of the second task is to determine whether an inferential state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":", triggered by the altered event ","element":"span"},{"style":{"height":8.4},"width":36.32,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/2-4.png","element":"img","alt":"e′","inline":true},{"text":", is metaphysical or not by predicting a binary answer. Note that ","element":"span"},{"style":{"height":8.4},"width":36.32,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/2-5.png","element":"img","alt":" e′","inline":true,"padRight":true},{"text":"could be either metaphysical or not, as inferences in both cases can be evaluated.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"(3) Metaphysical Transition Reasoning: ","element":"span"},{"text":"Finally, with some inferences remain metaphysical, a conscious agent should be able to plan what change is necessary to make such inference plausible in reality. This completes the reasoning chain by covering the feasibility, consequence, and motivation of distributional changes. Thus, the last task answers the question, “What change is needed to make a metaphysical inference plausible?” We refer to this as ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"metaphysical transition reasoning ","element":"span"},{"text":"and set the objective as to determine whether another change, denoted as ","element":"span"},{"style":{"height":8.4},"width":34.88,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/2-6.png","element":"img","alt":" c′","inline":true},{"text":", can make a metaphysical inference ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"plausible in relation to a changed event ","element":"span"},{"style":{"height":16.4},"width":187.26,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/2-7.png","element":"img","alt":" e′ by mak-","inline":true,"padRight":true},{"text":"ing a binary prediction regarding ","element":"span"},{"style":{"height":8.4},"width":41.06,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/2-8.png","element":"img","alt":" c′.","inline":true}],[{"id":"id-37","style":{"width":"99%"},"width":1813,"height":495,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/3-0.png","element":"img"}],[{"text":"Figure 2: The three steps in metaphysical reasoning. Our motivation behind this is that, by conquering all steps sequentially, a conscious agent could answer: (1) Will the change occur in reality? (2) What will the change cause? (3) What change can make a ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"metaphysical ","element":"figcaption","subtype":"caption"},{"text":"(desired) inference plausible?","element":"figcaption","subtype":"caption"}]]},{"heading":"4","paragraphs":[[{"text":"We then introduce our sequential pipeline for curating the ","element":"span"},{"text":"M","element":"span"},{"text":"ARS ","element":"span"},{"text":"benchmark. An overview of our curation pipeline is shown in Appendix Figure ","element":"span"},{"href":"#id-38","text":"5","element":"a"},{"text":". To guarantee a comprehensive coverage of events across various domains and topics, we source original text from two publicly available large corpora: Wikitext (","element":"span"},{"href":"#id-18","referenceIndex":45,"text":"Merity et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-18","referenceIndex":45,"text":"2017","element":"a"},{"text":") and BookCorpus (","element":"span"},{"href":"#id-19","referenceIndex":84,"text":"Zhu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-19","referenceIndex":84,"text":"2015","element":"a"},{"text":"). We filter out noisy text that includes hashtags and hyperlinks and segment long text into sentences with no more than 200 tokens to facilitate future processing.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Text Decomposition and Extraction","element":"span"}],[{"text":"We first perform text decomposition (","element":"span"},{"href":"#id-39","referenceIndex":78,"text":"Ye et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-39","referenceIndex":78,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-40","referenceIndex":33,"text":"Jhamtani et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-40","referenceIndex":33,"text":"2023","element":"a"},{"text":") to break down lengthy text into semantically complete short events, which are then used for fine-grained component extraction. To enable large-scale processing, we use ChatGPT (","element":"span"},{"href":"#id-25","referenceIndex":47,"text":"OpenAI","element":"a"},{"text":", ","element":"span"},{"href":"#id-25","referenceIndex":47,"text":"2022","element":"a"},{"text":"), a powerful LLM with strong text understanding abilities, as the core processor for all stages. For each stage, we guide it with a few-shot prompt (","element":"span"},{"href":"#id-41","referenceIndex":73,"text":"West et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-41","referenceIndex":73,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-42","referenceIndex":12,"text":"Brown et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":12,"text":"2020","element":"a"},{"text":") by creating task-specific explanations and exemplars (detailed prompts are in Appendix ","element":"span"},{"text":"B","element":"span"},{"text":"):","element":"span"}],[{"style":{"width":"92%"},"width":806,"height":291,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/3-1.png","element":"img"}],[{"text":"To perform text decomposition, ","element":"span"},{"style":{"fontWeight":"bold"},"text":" ","element":"span"},{"text":"clarifies the goal to ChatGPT, which involves extracting semantically complete actions from the given text. ","element":"span"},{"style":{"height":13.42},"width":232.06,"height":33.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/3-2.png","element":"img","alt":" ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.82},"width":253.87,"height":34.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/3-3.png","element":"img","alt":" ","inline":true,"padRight":true},{"text":"are filled with 10 pairs of human-crafted examples, each containing several action events extracted from text sampled from Wikitext and BookCorpus. ChatGPT is expected to learn from these examples and use them as a guide to extract action events (","element":"span"},{"style":{"height":17.95},"width":319.91,"height":44.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/3-4.png","element":"img","alt":"","inline":true},{"text":") from the final input text (","element":"span"},{"style":{"height":13.42},"width":188.78,"height":33.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/3-5.png","element":"img","alt":"","inline":true},{"text":"). For component extraction, we adjust ","element":"span"},{"style":{"fontWeight":"bold"},"text":" ","element":"span"},{"text":"to define the task of extracting the seven components from a given event. We populate ","element":"span"},{"style":{"height":13.42},"width":232.06,"height":33.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/3-6.png","element":"img","alt":"","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.82},"width":253.88,"height":34.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/3-7.png","element":"img","alt":" ","inline":true,"padRight":true},{"text":"with 10 pairs of events and seven comma-separated lists of components extracted from the event, each corresponding to one type of components defined in §","element":"span"},{"text":"3","element":"span"},{"text":". ChatGPT then extracts seven lists of components for the final given event (","element":"span"},{"style":{"height":13.42},"width":188.77,"height":33.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/3-8.png","element":"img","alt":"","inline":true},{"text":"). If any type of component is absent, “None” will be generated instead.","element":"span"}],[{"id":"id-44","style":{"fontWeight":"bold"},"text":"4.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Component Abstraction and Variation","element":"span"}],[{"text":"The next step is designed to implement changes within the event by altering its components, extracted from the previous step, by generating their abstractions or numerical variations. Following ","element":"span"},{"href":"#id-43","referenceIndex":67,"text":"Wang et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-43","referenceIndex":67,"text":"2024b","element":"a"},{"text":"), we guide ChatGPT by modifying ","element":"span"},{"style":{"fontWeight":"bold"},"text":" ","element":"span"},{"text":"with the objective of generating abstract concepts for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, v, o, se ","element":"span"},{"text":"and numerical variations for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t, l, n ","element":"span"},{"text":"within a specified event. For each ","element":"span"},{"style":{"height":16},"width":714.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/3-9.png","element":"img","alt":" and pair, we","inline":true,"padRight":true},{"text":"populate the input with a specific event and one of its components. The output consists of three human-authored component abstractions or numerical variations that align with the event’s context. Subsequently, ChatGPT is tasked with generating three abstractions or numerical variations for the final pair of the given event and a component within the event (","element":"span"},{"style":{"height":13.42},"width":188.78,"height":33.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/3-10.png","element":"img","alt":"","inline":true},{"text":"). Replacing the original components in the event with their generated changes forms changed event candidates for the metaphysical event discrimination task.","element":"span"}],[{"id":"id-48","style":{"width":"99%"},"width":1799,"height":443,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/4-0.png","element":"img"}],[{"text":"Table 1: Statistics of the M","element":"figcaption","subtype":"caption"},{"text":"ARS ","element":"figcaption","subtype":"caption"},{"text":"benchmark in comparison against other benchmarks. Meta. refers to three tasks in M","element":"figcaption","subtype":"caption"},{"text":"ARS","element":"figcaption","subtype":"caption"},{"text":". Expert. refers to expert verification results.","element":"figcaption","subtype":"caption"}],[{"id":"id-45","style":{"fontWeight":"bold"},"text":"4.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Inference Generation","element":"span"}],[{"text":"We then collect inferential states of the modified events by similarly instructing ChatGPT to autonomously generate them. For each altered event, we prompt ChatGPT to separately generate one plausible inference and one metaphysical inference. We first modify ","element":"span"},{"style":{"fontWeight":"bold"},"text":" ","element":"span"},{"text":"to generate a state that could potentially be caused by the altered event, and populate ","element":"span"},{"style":{"height":13.42},"width":232.06,"height":33.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/4-1.png","element":"img","alt":" ","inline":true,"padRight":true},{"text":"with 10 modified events and ","element":"span"},{"style":{"height":13.82},"width":253.88,"height":34.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/4-2.png","element":"img","alt":" ","inline":true,"padRight":true},{"text":"with 10 corresponding plausible inferences authored by human experts. ChatGPT is then requested to generate an additional plausible state inference for the given changed event (","element":"span"},{"style":{"height":13.42},"width":188.78,"height":33.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/4-3.png","element":"img","alt":"","inline":true},{"text":"). Next, we adjust ","element":"span"},{"style":{"fontWeight":"bold"},"text":" ","element":"span"},{"text":"to generate a metaphysical state that is infrequently caused by the changed event in reality, yet remains contextually relevant. We replace ","element":"span"},{"style":{"height":13.82},"width":253.88,"height":34.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/4-4.png","element":"img","alt":" ","inline":true,"padRight":true},{"text":"with 10 metaphysical inferences and then collect a metaphysical inference from ChatGPT. This, along with the generated plausible inference, forms two candidate data entries for each changed event in the metaphysical inference discrimination task.","element":"span"}],[{"id":"id-46","style":{"fontWeight":"bold"},"text":"4.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Metaphysical Transition Generation","element":"span"}],[{"text":"Given that half of the inferential states generated in the previous step remain metaphysical, we then collect the additional changes necessary to transform these states into plausible real-world inferences. We adjust the ","element":"span"},{"style":{"fontWeight":"bold"},"text":" ","element":"span"},{"text":"to describe such required changes and populate ","element":"span"},{"style":{"height":13.42},"width":232.06,"height":33.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/4-5.png","element":"img","alt":" ","inline":true,"padRight":true},{"text":"with 10 pairs of modified events and their corresponding metaphysical inferences. ","element":"span"},{"style":{"height":13.82},"width":253.88,"height":34.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/4-6.png","element":"img","alt":" ","inline":true,"padRight":true},{"text":"are filled with 10 corresponding human-authored changes in events that can render the inferences plausible. Subsequently, ChatGPT generates the required change for the final pair of the modified event and its metaphysical inference (","element":"span"},{"style":{"height":13.42},"width":188.78,"height":33.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/4-7.png","element":"img","alt":"","inline":true},{"text":"). Note that the generated change still needs to be one","element":"span"}],[{"id":"id-50","style":{"width":"79%"},"width":700,"height":699,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/4-8.png","element":"img"}],[{"text":"Figure 3: Hypernym distribution of the top 5,000 popular component variations.","element":"figcaption","subtype":"caption"}],[{"text":"of the seven types we defined in §","element":"span"},{"text":"3","element":"span"},{"text":". We collect one additional change for each metaphysical inference and use it as a candidate data entry for the last task. However, we discard event and inference pairs that ChatGPT deems impossible to render plausible, even with an additional change.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.5 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Human Annotations","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Annotation: ","element":"span"},{"text":"Finally, we carry out large-scale human annotations to label candidate data for each task via Amazon Mechanical Turk (AMT). We provide detailed instructions with examples to qualified workers and task them with annotating (1) the plausibility of the changed events generated in §","element":"span"},{"href":"#id-44","text":"4.2","element":"a"},{"text":", (2) the plausibility of the plausible/metaphysical inferences produced in §","element":"span"},{"href":"#id-45","text":"4.3","element":"a"},{"text":", and (3) the plausibility of the transitions generated in §","element":"span"},{"href":"#id-46","text":"4.4","element":"a"},{"text":". We collect five votes for each entry and the majority vote is used as the final label. The overall inter-annotator agreement (IAA) is 81% in terms of pairwise agreement, and the Fleiss Kappa (","element":"span"},{"href":"#id-47","referenceIndex":21,"text":"Fleiss","element":"a"},{"text":", ","element":"span"},{"href":"#id-47","referenceIndex":21,"text":"1971","element":"a"},{"text":") is 0.56, indicating sufficient agreement (see Appendix ","element":"span"},{"text":"D","element":"span"},{"text":"). ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Expert Verification: ","element":"span"},{"text":"To verify the quality of our ","element":"span"},{"text":"collected labels, we recruit three postgraduate students with rich experience in NLP to perform a second round annotation. Each of them is asked to annotate a sample of 100 data entries for each task, following the same instructions provided to the AMT annotators. Results in Table ","element":"span"},{"href":"#id-48","text":"1 ","element":"a"},{"text":"show that, on average, 93.67% labels collected from human annotations align with the expert’s vote, demonstrating the reliability of our collected labels.","element":"span"}]]},{"heading":"5 Evaluations and Analysis","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"5.1","element":"span"}],[{"text":"Table ","element":"span"},{"href":"#id-48","text":"1 ","element":"a"},{"text":"presents statistics of the M","element":"span"},{"text":"ARS ","element":"span"},{"text":"benchmark, which comprises a total of 355,617 annotated data distributed across three tasks. We partition the annotated data into training, development, and testing splits following an 8:1:1 ratio, ensuring there is no overlap of text and events between the different splits to preserve the evaluation’s generalizability. On average, 1.04 tokens are generated to describe changes in action for the metaphysical event and transition discrimination tasks, while 10.4 tokens are used for inferences in the metaphysical inference discrimination task. To the best of our knowledge, we are the first in proposing such a triad of tasks concurrently within a single benchmark. To compare M","element":"span"},{"text":"ARS ","element":"span"},{"text":"with other datasets, we select those with analogous task objectives for each task and compare them individually. We find M","element":"span"},{"text":"ARS ","element":"span"},{"text":"tends to be significantly larger than other benchmarks, covering a broader range of events and providing training sets for evaluating the performance of fine-tuned models. To further illustrate the diverse coverage of events and changes in M","element":"span"},{"text":"ARS","element":"span"},{"text":", we match each component variation against hypernyms in Probase (","element":"span"},{"href":"#id-49","referenceIndex":76,"text":"Wu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-49","referenceIndex":76,"text":"2012","element":"a"},{"text":") and plot their distribution according to their number of occurrences in Figure ","element":"span"},{"href":"#id-50","text":"3","element":"a"},{"text":". Our results indicate that M","element":"span"},{"text":"ARS ","element":"span"},{"text":"covers over 170,000 hypernyms in Probase, spanning broad categories such as event, activity, concept, unit, etc.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Main Evaluations on","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.2.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Task Setup and Model Selections","element":"span"}],[{"text":"We then experiment with a selection of (L)LMs to investigate their performances on our curated M","element":"span"},{"text":"ARS ","element":"span"},{"text":"benchmark. ","element":"span"},{"text":"Accuracy, AUC, and Macro-F1 scores are used as evaluation metrics.","element":"span"}],[{"text":"The evaluation of different models are categorized into three types: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(1) Z","element":"span"},{"style":{"fontWeight":"bold"},"text":"ERO","element":"span"},{"style":{"fontWeight":"bold"},"text":"-","element":"span"},{"style":{"fontWeight":"bold"},"text":"SHOT","element":"span"},{"style":{"fontWeight":"bold"},"text":": ","element":"span"},{"text":"We first evaluate several (L)LMs in a zero-shot manner. For small-sized Pre-Trained Language Mod- ","element":"span"},{"text":"els (PTLMs), we evaluate DeBERTa-v3 (","element":"span"},{"href":"#id-51","referenceIndex":25,"text":"He et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-51","referenceIndex":25,"text":"2023a","element":"a"},{"text":"), GPT2 (","element":"span"},{"href":"#id-52","referenceIndex":54,"text":"Radford et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-52","referenceIndex":54,"text":"2019","element":"a"},{"text":"), CAR (","element":"span"},{"href":"#id-53","referenceIndex":66,"text":"Wang ","element":"a"},{"href":"#id-53","referenceIndex":66,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-53","referenceIndex":66,"text":"2023a","element":"a"},{"text":"), CANDLE (","element":"span"},{"href":"#id-43","referenceIndex":67,"text":"Wang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-43","referenceIndex":67,"text":"2024b","element":"a"},{"text":"), and VERA (","element":"span"},{"href":"#id-54","referenceIndex":41,"text":"Liu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-54","referenceIndex":41,"text":"2023a","element":"a"},{"text":"), following the design of zero-shot question answering (","element":"span"},{"href":"#id-55","referenceIndex":44,"text":"Ma et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-55","referenceIndex":44,"text":"2021","element":"a"},{"text":"). For LLMs, we evaluate LLaMa2, LLaMa3, LLaMa3.1 (","element":"span"},{"href":"#id-28","referenceIndex":62,"text":"Touvron et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-28","referenceIndex":62,"text":"2023a","element":"a"},{"text":",","element":"span"},{"href":"#id-27","referenceIndex":63,"text":"b","element":"a"},{"text":"; ","element":"span"},{"href":"#id-56","referenceIndex":18,"text":"Dubey et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-56","referenceIndex":18,"text":"2024","element":"a"},{"text":"), Gemma (","element":"span"},{"href":"#id-57","referenceIndex":46,"text":"Mesnard et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-57","referenceIndex":46,"text":"2024","element":"a"},{"text":"), Falcon (","element":"span"},{"href":"#id-58","referenceIndex":1,"text":"Al- ","element":"a"},{"href":"#id-58","referenceIndex":1,"text":"mazrouei et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-58","referenceIndex":1,"text":"2023","element":"a"},{"text":"), and Mistral (","element":"span"},{"href":"#id-59","referenceIndex":34,"text":"Jiang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-59","referenceIndex":34,"text":"2023","element":"a"},{"text":") using direct zero-shot prompting (","element":"span"},{"href":"#id-3","referenceIndex":53,"text":"Qin et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","referenceIndex":53,"text":"2023","element":"a"},{"text":"). ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(2) F","element":"span"},{"style":{"fontWeight":"bold"},"text":"INETUNING","element":"span"},{"style":{"fontWeight":"bold"},"text":": ","element":"span"},{"text":"We then assess the performance of (L)LMs when fine-tuned on the training set of M","element":"span"},{"text":"ARS","element":"span"},{"text":". For PTLMs, we fine-tune DeBERTa, GPT2-xl, and VERA. For LLMs, we fine-tune LLaMa2, LLaMa3, Gemma, and Mistral using LoRA (","element":"span"},{"href":"#id-60","referenceIndex":28,"text":"Hu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-60","referenceIndex":28,"text":"2022","element":"a"},{"text":"). ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(3) LLM API: ","element":"span"},{"text":"Finally, we evaluate the performance of GPT-4 (","element":"span"},{"href":"#id-26","referenceIndex":48,"text":"OpenAI","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":48,"text":"2023","element":"a"},{"text":") and GPT-4o-mini (","element":"span"},{"href":"#id-61","referenceIndex":49,"text":"OpenAI","element":"a"},{"text":", ","element":"span"},{"href":"#id-61","referenceIndex":49,"text":"2024","element":"a"},{"text":"), which represent proprietary LLMs, under zero-shot, five-shots, Chain-of-Thought prompting (COT; ","element":"span"},{"href":"#id-62","referenceIndex":72,"text":"Wei et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-62","referenceIndex":72,"text":"2022","element":"a"},{"text":"), and Self-Consistent COT (SC-COT; ","element":"span"},{"href":"#id-63","referenceIndex":70,"text":"Wang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-63","referenceIndex":70,"text":"2023c","element":"a"},{"text":") settings. For LLaMa3.1-70B and GPT-4o-mini, we also test their performances with RAG (","element":"span"},{"href":"#id-64","referenceIndex":22,"text":"Gao et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-64","referenceIndex":22,"text":"2023","element":"a"},{"text":"), Multiagent Calibration (","element":"span"},{"href":"#id-65","referenceIndex":77,"text":"Yang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-65","referenceIndex":77,"text":"2024","element":"a"},{"text":"), and Self Reflection (","element":"span"},{"href":"#id-66","referenceIndex":51,"text":"Pan et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-66","referenceIndex":51,"text":"2024","element":"a"},{"text":"). Please find implementation details in Appendix ","element":"span"},{"text":"C","element":"span"},{"text":", multi-task fine-tuning experiments in Appendix ","element":"span"},{"href":"#id-67","text":"E.1","element":"a"},{"text":", and few-shot fine-tuning experiments in Appendix ","element":"span"},{"href":"#id-68","text":"E.2","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.2.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Results and Analysis","element":"span"}],[{"text":"Evaluation results are reported in Table ","element":"span"},{"href":"#id-69","text":"2","element":"a"},{"text":". From the results, we observe that: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(1) Most models exhibit subpar performance under the zero-shot setting. ","element":"span"},{"text":"Among PTLMs, only VERA delivers acceptable results across all three tasks, while the rest significantly underperform. Though models fine-tuned on commonsense knowledge and conceptualizations, such as CAR and CANDLE, show some improvement compared to their DeBERTa-v3-Large backbone, these performances are still unsatisfactory, even falling below the level of majority voting. For LLMs, improving training paradigms and increasing the number of parameters can indeed help achieve better performance. Nevertheless, all models perform poorly across all tasks in M","element":"span"},{"text":"ARS","element":"span"},{"text":", emphasizing the difficulty of our tasks. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(2) Fine-tuning only offers limited benefits. ","element":"span"},{"text":"With fine-tuning, all models improve significantly. For example, DeBERTa-Large’s accuracy increases by 16.18%, 21.84%, and 22.2% on three tasks, respectively. However, the best results for all tasks are","element":"span"}],[{"id":"id-69","style":{"width":"98%"},"width":1797,"height":1857,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/6-0.png","element":"img"}],[{"text":"Table 2: Evaluation results (%) of various language models on the testing sets of M","element":"figcaption","subtype":"caption"},{"text":"ARS","element":"figcaption","subtype":"caption"},{"text":". The best performances within each method are underlined and the best among all methods are ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"bold-faced","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"text":"still capped at around 74%, indicating a shared difficulty and significant room for future enhancements. One potential reason for this is that, since we split the data according to the source of text in Wikitext and BookCorpus, the distribution between different splits may differ significantly, as the domain and topics could be diverse from each other. We also discuss the reasons for PTLMs’ strong performance compared to LLMs after fine-tuning in Appendix ","element":"span"},{"href":"#id-70","text":"E.3","element":"a"},{"text":". ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(3) The GPT series models underperform compared to other LLMs, and COT does not consistently aid performance. ","element":"span"},{"text":"Surprisingly, GPT series models fall short when compared to open LLMs, such as LLaMa-3-70B. One possi- ","element":"span"},{"text":"ble explanation is that negative examples in M","element":"span"},{"text":"ARS ","element":"span"},{"text":"are sourced from ChatGPT’s generation and are obtained via post-human annotation. This makes it challenging to discriminate as these negative examples contradict GPT’s internal knowledge. Advanced prompting methods only offer limited improvement in performances.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Analysis","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.3.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Transferring from Conceptualization","element":"span"}],[{"text":"Improving the performance of LLMs on M","element":"span"},{"text":"ARS ","element":"span"},{"text":"requires extensive fine-tuning on large-scale human-annotated data, making it non-trivial. Since we observe that approximately 80% of action changes","element":"span"}],[{"id":"id-73","style":{"width":"98%"},"width":1797,"height":712,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/7-0.png","element":"img"}],[{"text":"Table 3: Evaluation results (%) of transfering knowledge from CANDLE to aid M","element":"figcaption","subtype":"caption"},{"text":"ARS","element":"figcaption","subtype":"caption"},{"text":". The best performances among each method is underlined and best ones among all methods are ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"bold-faced","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"text":"are executed by modifying a component along with its abstracted concepts (see Table ","element":"span"},{"href":"#id-71","text":"5","element":"a"},{"text":"), we first study whether exposing LLMs to more conceptualizations and abstract knowledge can enhance their metaphysical reasoning capabilities. ","element":"span"},{"text":"For this purpose, we select CANDLE (","element":"span"},{"href":"#id-43","referenceIndex":67,"text":"Wang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-43","referenceIndex":67,"text":"2024b","element":"a"},{"text":") as the knowledge source, which is an automatically constructed knowledge base containing 382K conceptualizations of events and abstract inferential knowledge. ","element":"span"},{"text":"We first convert eventconceptualization pairs into the task format of metaphysical event discrimination and reformat commonsense inferential knowledge to align with the objectives of the metaphysical inference and transition discrimination tasks. More details are in Appendix ","element":"span"},{"href":"#id-72","text":"C.2","element":"a"},{"text":".","element":"span"}],[{"text":"Three backbone models are then fine-tuned separately on CANDLE and M","element":"span"},{"text":"ARS","element":"span"},{"text":". Another group is sequentially fine-tuned on CANDLE and then on M","element":"span"},{"text":"ARS","element":"span"},{"text":". All models are then evaluated on the testing set of M","element":"span"},{"text":"ARS","element":"span"},{"text":", with the results reported in Table ","element":"span"},{"href":"#id-73","text":"3","element":"a"},{"text":". ","element":"span"},{"text":"From the results, a significant improvement is observed across all tasks when the models are sequentially fine-tuned on CANDLE and M","element":"span"},{"text":"ARS","element":"span"},{"text":", compared to solely fine-tuning on CANDLE or M","element":"span"},{"text":"ARS","element":"span"},{"text":".","element":"span"}],[{"text":"These findings indicate that the transfer of conceptualizations and abstract knowledge from CANDLE effectively enhances the performance of LMs in metaphysical reasoning tasks. Since CANDLE is constructed by distilling from an LLM without human labor, this opens up a scalable and cost-efficient approach to improving the metaphysical reasoning capabilities of LLMs.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.3.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Impact of Component Types","element":"span"}],[{"text":"We then analyze the performance of LLMs on each component type to understand the reasons for their subpar performance. We select LLaMa-3-8B as the representative model and compare its accuracy on each component type when fine-tuned on M","element":"span"},{"text":"ARS ","element":"span"},{"text":"and CANDLE + M","element":"span"},{"text":"ARS","element":"span"},{"text":". The results are illustrated in Figure ","element":"span"},{"href":"#id-74","text":"4","element":"a"},{"text":". We observe that while pre-training the model on CANDLE consistently enhances performance, LLaMa3 still struggles when reasoning with changes in spatial quantifiers, temporal quantifiers, and numerical properties. This is in line with recent studies that demonstrate weaknesses in temporal and numerical reasoning for LLMs (","element":"span"},{"href":"#id-75","referenceIndex":59,"text":"Tan ","element":"a"},{"href":"#id-75","referenceIndex":59,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-75","referenceIndex":59,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-76","referenceIndex":57,"text":"Shi et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-76","referenceIndex":57,"text":"2023","element":"a"},{"text":"). Another possible reason is that since CANDLE only contains conceptualizations for subjects, verbs, objects, and sub-events in social events, pre-training models on it cannot provide benefits for the aforementioned aspects of change. Moreover, we only observe limited improvement for the metaphysical event discrimination task. Future works could focus on how to further enhance LLM’s metaphysical reasoning capabilities in these weaker dimensions.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.3.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Error Analysis of GPT-Series Models","element":"span"}],[{"text":"Finally, we select GPT4 as a representative model and conduct a manual analysis to identify the causes of errors by categorizing the mistakes found in their COT responses. We sample 150 COT responses from each task, all of which result in inconsistent results compared to human annotated labels and present our classifications of these errors as follows: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(1) Hallucinations","element":"span"},{"text":": 41.7% of errors are caused by factual or metaphysical halluci-","element":"span"}],[{"id":"id-74","style":{"width":"99%"},"width":1815,"height":354,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/8-0.png","element":"img"}],[{"text":"Figure 4: Performances by component types of fine-tuned LLaMa3-8B on three tasks of M","element":"figcaption","subtype":"caption"},{"text":"ARS","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"text":"nations by GPT4, where it creates a context that accommodates changes in actions and inferences that are not mentioned in the original text. For instance, in the event “The poet enjoys writing poems about western festivals,” GPT4 incorrectly interprets the poet as Du Fu. This leads to a conflict when reasoning about his life and the subsequent inference “He was famous in the west,” resulting in faulty reasoning. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(2) Confusion between Concepts and Hypernyms","element":"span"},{"text":": 36.3% errors are attributed to GPT4’s tendency to perceive abstract components within changed actions as hypernyms that fulfill the change, without considering all potential entities within the original concept. For instance, in a modified event, “He jumps down from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"very high altitude ","element":"span"},{"text":"and lands peacefully,” GPT4 interprets ","element":"span"},{"style":{"fontStyle":"italic"},"text":"very high altitude ","element":"span"},{"text":"as a diving platform, deeming it plausible. However, this concept could also encompass high buildings, which would not be suitable for the event. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(3) Internal Conflict","element":"span"},{"text":": 17.7% errors are attributed to internal conflicts within GPT4’s reasoning rationales, as well as inconsistencies between the binary predictions made and the corresponding reasoning rationales. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(4) Annotation Error","element":"span"},{"text":": 4.3% errors are erroneously identified due to incorrect labels, potentially caused by spamming or a misunderstanding of the task by human annotators.","element":"span"}]]},{"heading":"6 Conclusions","paragraphs":[[{"text":"In conclusion, this paper proposes ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Metaphysical Reasoning ","element":"span"},{"text":"to delineate the process of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"reasoning with changes in distribution ","element":"span"},{"text":"and construct","element":"span"}],[{"text":"M","element":"span"},{"text":"ARS ","element":"span"},{"text":"as the associated evaluation benchmark in a non-trivial manner. Our experiments show the challenge of our task, which advanced prompting and fine-tuning can’t easily solve. Analysis reveals why LMs struggle with metaphysical reasoning and suggests a possible improvement. We hope to illuminate the path toward achieving conscious processing in LLMs through System II reasoning by effectively comprehending changes in distribution.","element":"span"}]]},{"heading":"Limitations","paragraphs":[[{"text":"Though we consider our work to be a fundamental step towards understanding the capabilities of LMs in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"reasoning with changes in distribution","element":"span"},{"text":", we do acknowledge that several limitations still exist that just cannot be covered within one single work. Here, we discuss some important limitations that future works can address:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"(1) Include more types of changes in our current formulation. ","element":"span"},{"text":"In our work, we primarily focus on seven types of changes, covering the subject, verb, object, spatial quantifier, temporal quantifier, numerical properties, and sub-events of the event. While these seven types encompass most of the potential changes, there are other uncovered components within an event that can be impacted by changes, such as adjectives, adverbs, and prepositional phrases. Nevertheless, our flexible and automated benchmark curation pipeline, empowered by an LLM, allows for future research to extend the benchmark to cover a broader range of component types.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"(2) Reliance of LLM on benchmark curation. ","element":"span"},{"text":"Our data construction process relies significantly on ChatGPT, an expensive and proprietary language model used for data collection, as well as human annotation for data verification. In Appendix ","element":"span"},{"href":"#id-77","text":"B.3","element":"a"},{"text":", we discussed the feasibility of leveraging open-sourced LLM as a replacement to ChatGPT to reduce cost and promote reproducibility. Future research could also consider utilizing robust open-source language models (","element":"span"},{"href":"#id-29","referenceIndex":55,"text":"Reid et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-29","referenceIndex":55,"text":"2024","element":"a"},{"text":") and general statement plausibility estimators (","element":"span"},{"href":"#id-54","referenceIndex":41,"text":"Liu ","element":"a"},{"href":"#id-54","referenceIndex":41,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-54","referenceIndex":41,"text":"2023a","element":"a"},{"text":") to replace these methods.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"(3) Solution and Downstream Applications of Metaphysical Reasoning. ","element":"span"},{"text":"While this paper establishes a comprehensive evaluation benchmark for metaphysical reasoning, we leave the exploration of a practical solution to aid LLMs in solving metaphysical reasoning tasks, as well as the potential benefits of utilizing metaphysical reasoning for downstream tasks into future works. These tasks ","element":"span"},{"text":"may include planning (","element":"span"},{"href":"#id-78","referenceIndex":81,"text":"Yuan et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-78","referenceIndex":81,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-79","referenceIndex":50,"text":"Ouyang ","element":"a"},{"href":"#id-79","referenceIndex":50,"text":"and Li","element":"a"},{"text":", ","element":"span"},{"href":"#id-79","referenceIndex":50,"text":"2023","element":"a"},{"text":") or reasoning with changes (","element":"span"},{"href":"#id-16","referenceIndex":26,"text":"He et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","referenceIndex":26,"text":"2023b","element":"a"},{"text":").","element":"span"}]]},{"heading":"Ethics Statement","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Offensive Content Elimination. ","element":"span"},{"text":"Our benchmark curation pipeline, which involves generating content with ChatGPT, necessitates stringent measures to ensure the absence of offensive content in both the prompts and the generated responses. For this purpose, we apply two strategies to eliminate offensive content. First, we use the highest level of Azure AI Content Safety Filter to filter out any content that contains personal privacy, promotes violence, racial discrimination, hate speech, sexual content, or self-harm. If any such unsafe content is detected in the prompts or generated responses, it automatically triggers a system failure, which prevents the inclusion of such data in our dataset. Second, we manually inspect a random sample of 500 data entries from three tasks in ","element":"span"},{"text":"M","element":"span"},{"text":"ARS ","element":"span"},{"text":"for offensive content. Based on our annotations, we have not detected any offensive content. We thus believe that our dataset is safe and will not yield any negative societal impact.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Licenses. ","element":"span"},{"text":"We will share our code and models under the MIT license, thereby granting other researchers free access to our assets for research purposes. Other datasets used in this paper, including Wikitext and Bookcorpus, are shared under the CC-SA license, permitting us to use them for research. As for language models, we access all open-source LMs via the Huggingface Hub (","element":"span"},{"href":"#id-80","referenceIndex":74,"text":"Wolf ","element":"a"},{"href":"#id-80","referenceIndex":74,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-80","referenceIndex":74,"text":"2020","element":"a"},{"text":"). All associated licenses permit user access for research purposes, and we have agreed and committed to follow all terms of use.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Annotations. ","element":"span"},{"text":"We conduct large scale human annotations on the Amazon Mechanical Turk (AMT) platform. We invite annotation workers from the US, Europe, and India due to their proficiency in English. The annotators are paid on average at an hourly rate of 19 USD, which is comparable to the minimum wages in the US. The selection of these annotators is solely based on their performance on the evaluation set, and we do not collect any personal information about the participants from AMT. For expert verifications, we have secured IRB approval and support from our institution’s department, which allows us to invite expert graduate students to validate the quality of our data. They all agree to participate voluntarily without being compensated. We have made concerted efforts to eliminate offensive content, thereby ensuring that no annotators are offended.","element":"span"}]]},{"heading":"Acknowledgements","paragraphs":[[{"text":"We thank the anonymous reviewers and the area chair for their constructive comments. The authors of this paper were supported by the ITSP Platform Research Project (ITS/189/23FP) from the ITC of Hong Kong, SAR, China, as well as the AoE (AoE/E-601/24-N), the RIF (R6021-20), and the GRF (16205322) from the RGC of Hong Kong, SAR, China.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-58","text":"Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Al- ","element":"span"},{"text":"shamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. 2023. ","element":"span"},{"href":"https://doi.org/10.48550/ARXIV.2311.16867","text":"The falcon series of open language ","element":"a"},{"href":"https://doi.org/10.48550/ARXIV.2311.16867","text":"models","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/2311.16867.","element":"span"}],[{"id":"id-5","text":"Jacob Andreas. 2022. ","element":"span"},{"href":"https://doi.org/10.18653/V1/2022.FINDINGS-EMNLP.423","text":"Language models as agent mod- ","element":"a"},{"href":"https://doi.org/10.18653/V1/2022.FINDINGS-EMNLP.423","text":"els","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022","element":"span"},{"text":", pages 5769–5779. Association for Computational Linguistics.","element":"span"}],[{"id":"id-107","text":"Anthropic. 2024. ","element":"span"},{"href":"https://www.anthropic.com/news/claude-3-family","text":"Introducing the next generation of ","element":"a"},{"href":"https://www.anthropic.com/news/claude-3-family","text":"claude","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Anthropic Announcements","element":"span"},{"text":".","element":"span"}],[{"id":"id-81","text":"Aristotle Aristotle and Aristotle. 1933. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Metaphysics","element":"span"},{"text":", volume 1. Harvard University Press Cambridge, MA.","element":"span"}],[{"id":"id-12","text":"Dzmitry ","element":"span"},{"text":"Bahdanau, ","element":"span"},{"text":"Shikhar ","element":"span"},{"text":"Murty, ","element":"span"},{"text":"Michael Noukhovitch, Thien Huu Nguyen, Harm de Vries, and Aaron C. Courville. 2019. ","element":"span"},{"href":"https://openreview.net/forum?id=HkezXnA9YX","text":"Systematic gener- ","element":"a"},{"href":"https://openreview.net/forum?id=HkezXnA9YX","text":"alization: What is required and can it be learned? ","element":"a"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019","element":"span"},{"text":". OpenReview.net.","element":"span"}],[{"id":"id-104","text":"Jiaxin Bai, Xin Liu, Weiqi Wang, Chen Luo, and ","element":"span"},{"text":"Yangqiu Song. 2023. ","element":"span"},{"href":"http://papers.nips.cc/paper_files/paper/2023/hash/6174c67b136621f3f2e4a6b1d3286f6b-Abstract-Conference.html","text":"Complex query answering on ","element":"a"},{"href":"http://papers.nips.cc/paper_files/paper/2023/hash/6174c67b136621f3f2e4a6b1d3286f6b-Abstract-Conference.html","text":"eventuality knowledge graph with implicit logical ","element":"a"},{"href":"http://papers.nips.cc/paper_files/paper/2023/hash/6174c67b136621f3f2e4a6b1d3286f6b-Abstract-Conference.html","text":"constraints","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023","element":"span"},{"text":".","element":"span"}],[{"id":"id-22","text":"Nena Basina, Theodore Patkos, and Dimitris Plex- ","element":"span"},{"text":"ousakis. 2022. ","element":"span"},{"href":"https://doi.org/10.1007/978-3-030-93547-4_20","text":"ECAVI: an assistant for reasoning ","element":"a"},{"href":"https://doi.org/10.1007/978-3-030-93547-4_20","text":"about actions and change with the event calculus","element":"a"},{"text":". In Dimitris Karagiannis, Moonkun Lee, Knut Hinkelmann, and Wilfrid Utz, editors, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Domain-Specific Conceptual Modeling - Concepts, Methods and ADOxx Tools","element":"span"},{"text":", pages 457–477. Springer.","element":"span"}],[{"id":"id-9","text":"Yoshua Bengio. 2017. ","element":"span"},{"href":"https://arxiv.org/abs/1709.08568","text":"The consciousness prior","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/1709.08568.","element":"span"}],[{"id":"id-8","text":"Yoshua Bengio, Yann LeCun, and Geoffrey E. Hin- ","element":"span"},{"text":"ton. 2021. ","element":"span"},{"href":"https://doi.org/10.1145/3448250","text":"Deep learning for AI","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Commun. ACM","element":"span"},{"text":", 64(7):58–65.","element":"span"}],[{"id":"id-20","text":"Yoshua Bengio et al. 2019. From system 1 deep learning ","element":"span"},{"text":"to system 2 deep learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Neural Information Processing Systems","element":"span"},{"text":".","element":"span"}],[{"id":"id-82","text":"Henri Bergson. 1999. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"An introduction to metaphysics","element":"span"},{"text":". Hackett Publishing Company.","element":"span"}],[{"id":"id-42","text":"Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie ","element":"span"},{"text":"Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. ","element":"span"},{"href":"https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html","text":"Language models are few-shot learners","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual","element":"span"},{"text":".","element":"span"}],[{"id":"id-1","text":"Chunkit Chan, Jiayang Cheng, Weiqi Wang, Yuxin ","element":"span"},{"text":"Jiang, Tianqing Fang, Xin Liu, and Yangqiu Song. 2024. ","element":"span"},{"href":"https://aclanthology.org/2024.findings-eacl.47","text":"Exploring the potential of chatgpt on sentence ","element":"a"},{"href":"https://aclanthology.org/2024.findings-eacl.47","text":"level relations: A focus on temporal, causal, and ","element":"a"},{"href":"https://aclanthology.org/2024.findings-eacl.47","text":"discourse relations","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Findings of the Association for Computational Linguistics: EACL 2024, St. Julian’s, Malta, March 17-22, 2024","element":"span"},{"text":", pages 684–721. Association for Computational Linguistics.","element":"span"}],[{"id":"id-31","text":"Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. ","element":"span"},{"text":"2024a. ","element":"span"},{"href":"https://doi.org/10.1609/AAAI.V38I16.29728","text":"Benchmarking large language models in ","element":"a"},{"href":"https://doi.org/10.1609/AAAI.V38I16.29728","text":"retrieval-augmented generation","element":"a"},{"text":". ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada","element":"span"},{"text":", pages 17754–17762. AAAI Press.","element":"span"}],[{"id":"id-30","text":"Yihan Chen, Benfeng Xu, Quan Wang, Yi Liu, and ","element":"span"},{"text":"Zhendong Mao. 2024b. ","element":"span"},{"href":"https://doi.org/10.1609/AAAI.V38I16.29734","text":"Benchmarking large lan- ","element":"a"},{"href":"https://doi.org/10.1609/AAAI.V38I16.29734","text":"guage models on controllable generation under di- ","element":"a"},{"href":"https://doi.org/10.1609/AAAI.V38I16.29734","text":"versified instructions","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, ThirtySixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada","element":"span"},{"text":", pages 17808–17816. AAAI Press.","element":"span"}],[{"text":"Bhavana Dalvi, Lifu Huang, Niket Tandon, Wen-tau ","element":"span"},{"text":"Yih, and Peter Clark. 2018. ","element":"span"},{"href":"https://doi.org/10.18653/V1/N18-1144","text":"Tracking state changes in ","element":"a"},{"href":"https://doi.org/10.18653/V1/N18-1144","text":"procedural text: a challenge dataset and models for ","element":"a"},{"href":"https://doi.org/10.18653/V1/N18-1144","text":"process paragraph comprehension","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018,","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers)","element":"span"},{"text":", pages 1595–1604. Association for Computational Linguistics.","element":"span"}],[{"id":"id-13","text":"Harm de Vries, Dzmitry Bahdanau, Shikhar Murty, ","element":"span"},{"text":"Aaron C. Courville, and Philippe Beaudoin. 2019. ","element":"span"},{"href":"https://vigilworkshop.github.io/static/papers/28.pdf","text":"CLOSURE: assessing systematic generalization of ","element":"a"},{"href":"https://vigilworkshop.github.io/static/papers/28.pdf","text":"CLEVR models","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Visually Grounded Interaction and Language (ViGIL), NeurIPS 2019 Workshop, Vancouver, Canada, December 13, 2019","element":"span"},{"text":".","element":"span"}],[{"id":"id-56","text":"Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, ","element":"span"},{"text":"$3c","element":"span"},{"href":"https://doi.org/10.48550/ARXIV.2407.21783","text":"The llama 3 herd of models","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/2407.21783.","element":"span"}],[{"id":"id-100","text":"Tianqing Fang, Weiqi Wang, Sehyun Choi, Shibo Hao, ","element":"span"},{"text":"Hongming Zhang, Yangqiu Song, and Bin He. 2021a. ","element":"span"},{"href":"https://doi.org/10.18653/V1/2021.EMNLP-MAIN.705","text":"Benchmarking commonsense knowledge base pop- ","element":"a"},{"href":"https://doi.org/10.18653/V1/2021.EMNLP-MAIN.705","text":"ulation with an effective evaluation dataset","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021","element":"span"},{"text":", pages 8949–8964. Association for Computational Linguistics.","element":"span"}],[{"id":"id-101","text":"Tianqing Fang, ","element":"span"},{"text":"Hongming Zhang, ","element":"span"},{"text":"Weiqi Wang, Yangqiu Song, and Bin He. 2021b. ","element":"span"},{"href":"https://doi.org/10.1145/3442381.3450117","text":"DISCOS: bridg- ","element":"a"},{"href":"https://doi.org/10.1145/3442381.3450117","text":"ing the gap between discourse knowledge and com- ","element":"a"},{"href":"https://doi.org/10.1145/3442381.3450117","text":"monsense knowledge","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"WWW ’21: The Web Conference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021","element":"span"},{"text":", pages 2648–2659. ACM / IW3C2.","element":"span"}],[{"id":"id-47","text":"Joseph L Fleiss. 1971. Measuring nominal scale agree- ","element":"span"},{"text":"ment among many raters. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Psychological bulletin","element":"span"},{"text":", 76(5):378.","element":"span"}],[{"id":"id-64","text":"Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, ","element":"span"},{"text":"Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. 2023. ","element":"span"},{"href":"https://doi.org/10.48550/ARXIV.2312.10997","text":"Retrieval- ","element":"a"},{"href":"https://doi.org/10.48550/ARXIV.2312.10997","text":"augmented generation for large language models: A ","element":"a"},{"href":"https://doi.org/10.48550/ARXIV.2312.10997","text":"survey","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/2312.10997.","element":"span"}],[{"id":"id-34","text":"Fausto Giunchiglia and Toby Walsh. 1992. A theory of ","element":"span"},{"text":"abstraction. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Artificial intelligence","element":"span"},{"text":", 57(2-3):323–389.","element":"span"}],[{"id":"id-87","text":"Mutian He, Tianqing Fang, Weiqi Wang, and Yangqiu ","element":"span"},{"text":"Song. 2024. ","element":"span"},{"href":"https://doi.org/10.1016/J.ARTINT.2024.104149","text":"Acquiring and modeling abstract com- ","element":"a"},{"href":"https://doi.org/10.1016/J.ARTINT.2024.104149","text":"monsense knowledge via conceptualization","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Artif. Intell.","element":"span"},{"text":", 333:104149.","element":"span"}],[{"id":"id-51","text":"Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2023a. ","element":"span"},{"href":"https://openreview.net/pdf?id=sE7-XhLxHA","text":"Debertav3: Improving deberta using electra-style ","element":"a"},{"href":"https://openreview.net/pdf?id=sE7-XhLxHA","text":"pre-training with gradient-disentangled embedding ","element":"a"},{"href":"https://openreview.net/pdf?id=sE7-XhLxHA","text":"sharing","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023","element":"span"},{"text":". OpenReview.net.","element":"span"}],[{"id":"id-16","text":"Weinan He, Canming Huang, Zhanhao Xiao, and Yong- ","element":"span"},{"text":"mei Liu. 2023b. ","element":"span"},{"href":"https://doi.org/10.18653/V1/2023.ACL-LONG.255","text":"Exploring the capacity of pretrained ","element":"a"},{"href":"https://doi.org/10.18653/V1/2023.ACL-LONG.255","text":"language models for reasoning about actions and ","element":"a"},{"href":"https://doi.org/10.18653/V1/2023.ACL-LONG.255","text":"change","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023","element":"span"},{"text":", pages 4629–4643. Association for Computational Linguistics.","element":"span"}],[{"id":"id-17","text":"Martin Heidegger. 2014. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Introduction to metaphysics","element":"span"},{"text":". Yale University Press.","element":"span"}],[{"id":"id-60","text":"Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan ","element":"span"},{"text":"Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. ","element":"span"},{"href":"https://openreview.net/forum?id=nZeVKeeFYf9","text":"Lora: Low-rank adaptation of ","element":"a"},{"href":"https://openreview.net/forum?id=nZeVKeeFYf9","text":"large language models","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022","element":"span"},{"text":". OpenReview.net.","element":"span"}],[{"id":"id-84","text":"Wenyue Hua, Jiang Guo, Mingwen Dong, Henghui Zhu, ","element":"span"},{"text":"Patrick Ng, and Zhiguo Wang. 2024. ","element":"span"},{"href":"https://doi.org/10.18653/V1/2024.FINDINGS-ACL.743","text":"Propagation ","element":"a"},{"href":"https://doi.org/10.18653/V1/2024.FINDINGS-ACL.743","text":"and pitfalls: Reasoning-based assessment of knowl- ","element":"a"},{"href":"https://doi.org/10.18653/V1/2024.FINDINGS-ACL.743","text":"edge editing through counterfactual tasks","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024","element":"span"},{"text":", pages 12503–12525. Association for Computational Linguistics.","element":"span"}],[{"id":"id-10","text":"Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei ","element":"span"},{"text":"Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024. ","element":"span"},{"href":"https://doi.org/10.48550/ARXIV.2402.02716","text":"Understand- ","element":"a"},{"href":"https://doi.org/10.48550/ARXIV.2402.02716","text":"ing the planning of LLM agents: A survey","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/2402.02716.","element":"span"}],[{"id":"id-102","text":"Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, ","element":"span"},{"text":"Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, and Yejin Choi. 2021. ","element":"span"},{"href":"https://doi.org/10.1609/AAAI.V35I7.16792","text":"(comet-) atomic 2020: On sym- ","element":"a"},{"href":"https://doi.org/10.1609/AAAI.V35I7.16792","text":"bolic and neural commonsense knowledge graphs","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021","element":"span"},{"text":", pages 6384–6392. AAAI Press.","element":"span"}],[{"id":"id-4","text":"Raghav Jain, Daivik Sojitra, Arkadeep Acharya, Sri- ","element":"span"},{"text":"parna Saha, Adam Jatowt, and Sandipan Dandapat. 2023. ","element":"span"},{"href":"https://doi.org/10.18653/V1/2023.EMNLP-MAIN.418","text":"Do language models have a common sense ","element":"a"},{"href":"https://doi.org/10.18653/V1/2023.EMNLP-MAIN.418","text":"regarding time? revisiting temporal commonsense ","element":"a"},{"href":"https://doi.org/10.18653/V1/2023.EMNLP-MAIN.418","text":"reasoning in the era of large language models","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023","element":"span"},{"text":", pages 6750– 6774. Association for Computational Linguistics.","element":"span"}],[{"id":"id-40","text":"Harsh Jhamtani, Hao Fang, Patrick Xia, Eran Levy, Ja- ","element":"span"},{"text":"cob Andreas, and Benjamin Van Durme. 2023. ","element":"span"},{"href":"https://doi.org/10.48550/ARXIV.2305.08677","text":"Nat- ","element":"a"},{"href":"https://doi.org/10.48550/ARXIV.2305.08677","text":"ural language decomposition and interpretation of ","element":"a"},{"href":"https://doi.org/10.48550/ARXIV.2305.08677","text":"complex utterances","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/2305.08677.","element":"span"}],[{"id":"id-59","text":"Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- ","element":"span"},{"text":"sch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timo-thée Lacroix, and William El Sayed. 2023. ","element":"span"},{"href":"https://doi.org/10.48550/ARXIV.2310.06825","text":"Mistral ","element":"a"},{"href":"https://doi.org/10.48550/ARXIV.2310.06825","text":"7b","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/2310.06825.","element":"span"}],[{"id":"id-7","text":"Daniel Kahneman. 2011. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Thinking, fast and slow","element":"span"},{"text":". macmillan.","element":"span"}],[{"id":"id-94","text":"Diederik P. Kingma and Jimmy Ba. 2015. ","element":"span"},{"href":"http://arxiv.org/abs/1412.6980","text":"Adam: A ","element":"a"},{"href":"http://arxiv.org/abs/1412.6980","text":"method for stochastic optimization","element":"a"},{"text":". ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings","element":"span"},{"text":".","element":"span"}],[{"id":"id-2","text":"Dohwan Ko, Ji Soo Lee, Woo-Young Kang, Byungseok ","element":"span"},{"text":"Roh, and Hyunwoo Kim. 2023. ","element":"span"},{"href":"https://doi.org/10.18653/V1/2023.EMNLP-MAIN.261","text":"Large language mod- ","element":"a"},{"href":"https://doi.org/10.18653/V1/2023.EMNLP-MAIN.261","text":"els are temporal and causal reasoners for video ques- ","element":"a"},{"href":"https://doi.org/10.18653/V1/2023.EMNLP-MAIN.261","text":"tion answering","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023","element":"span"},{"text":", pages 4300–4316. Association for Computational Linguistics.","element":"span"}],[{"id":"id-11","text":"Brenden M. Lake and Marco Baroni. 2018. ","element":"span"},{"href":"http://proceedings.mlr.press/v80/lake18a.html","text":"General- ","element":"a"},{"href":"http://proceedings.mlr.press/v80/lake18a.html","text":"ization without systematicity: On the compositional ","element":"a"},{"href":"http://proceedings.mlr.press/v80/lake18a.html","text":"skills of sequence-to-sequence recurrent networks","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018","element":"span"},{"text":", volume 80 of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Machine Learning Research","element":"span"},{"text":", pages 2879–2888. PMLR.","element":"span"}],[{"id":"id-103","text":"Chunyang Li, Weiqi Wang, Tianshi Zheng, and Yangqiu ","element":"span"},{"text":"Song. 2025. ","element":"span"},{"href":"https://doi.org/10.48550/ARXIV.2502.16169","text":"Patterns over principles: The fragility of ","element":"a"},{"href":"https://doi.org/10.48550/ARXIV.2502.16169","text":"inductive reasoning in llms under noisy observations","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/2502.16169.","element":"span"}],[{"id":"id-83","text":"Jiaxuan Li, Lang Yu, and Allyson Ettinger. 2023. ","element":"span"},{"href":"https://doi.org/10.18653/V1/2023.ACL-SHORT.70","text":"Coun- ","element":"a"},{"href":"https://doi.org/10.18653/V1/2023.ACL-SHORT.70","text":"terfactual reasoning: Testing language models’ under- ","element":"a"},{"href":"https://doi.org/10.18653/V1/2023.ACL-SHORT.70","text":"standing of hypothetical scenarios","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2023, Toronto, Canada, July 9-14, 2023","element":"span"},{"text":", pages 804–815. Association for Computational Linguistics.","element":"span"}],[{"id":"id-54","text":"Jiacheng Liu, Wenya Wang, Dianzhuo Wang, Noah A. ","element":"span"},{"text":"Smith, Yejin Choi, and Hannaneh Hajishirzi. 2023a.","element":"span"}],[{"text":"Vera: ","element":"span"},{"href":"https://doi.org/10.18653/V1/2023.EMNLP-MAIN.81","text":"A general-purpose plausibility estimation ","element":"a"},{"href":"https://doi.org/10.18653/V1/2023.EMNLP-MAIN.81","text":"model for commonsense statements","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023","element":"span"},{"text":", pages 1264–1287. Association for Computational Linguistics.","element":"span"}],[{"id":"id-0","text":"Xiao Liu, Da Yin, Chen Zhang, Yansong Feng, and ","element":"span"},{"text":"Dongyan Zhao. 2023b. ","element":"span"},{"href":"https://doi.org/10.18653/V1/2023.FINDINGS-ACL.574","text":"The magic of IF: investigat- ","element":"a"},{"href":"https://doi.org/10.18653/V1/2023.FINDINGS-ACL.574","text":"ing causal reasoning abilities in large language mod- ","element":"a"},{"href":"https://doi.org/10.18653/V1/2023.FINDINGS-ACL.574","text":"els of code","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023","element":"span"},{"text":", pages 9009–9022. Association for Computational Linguistics.","element":"span"}],[{"id":"id-92","text":"Ilya Loshchilov and Frank Hutter. 2019. ","element":"span"},{"href":"https://openreview.net/forum?id=Bkg6RiCqY7","text":"Decoupled ","element":"a"},{"href":"https://openreview.net/forum?id=Bkg6RiCqY7","text":"weight decay regularization","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019","element":"span"},{"text":". OpenReview.net.","element":"span"}],[{"id":"id-55","text":"Kaixin Ma, Filip Ilievski, Jonathan Francis, Yonatan ","element":"span"},{"text":"Bisk, Eric Nyberg, and Alessandro Oltramari. 2021. ","element":"span"},{"href":"https://doi.org/10.1609/AAAI.V35I15.17593","text":"Knowledge-driven data construction for zero-shot ","element":"a"},{"href":"https://doi.org/10.1609/AAAI.V35I15.17593","text":"evaluation in commonsense question answering","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021","element":"span"},{"text":", pages 13507–13515. AAAI Press.","element":"span"}],[{"id":"id-18","text":"Stephen Merity, Caiming Xiong, James Bradbury, and ","element":"span"},{"text":"Richard Socher. 2017. ","element":"span"},{"href":"https://openreview.net/forum?id=Byj72udxe","text":"Pointer sentinel mixture mod- ","element":"a"},{"href":"https://openreview.net/forum?id=Byj72udxe","text":"els","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings","element":"span"},{"text":". OpenReview.net.","element":"span"}],[{"id":"id-57","text":"Thomas Mesnard, Cassidy Hardin, Robert Dadashi, ","element":"span"},{"text":"Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christopher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy Chen, Johan Ferret, Justin Chiu, and et al. 2024. ","element":"span"},{"href":"https://doi.org/10.48550/ARXIV.2403.08295","text":"Gemma: Open models based on gemini re- ","element":"a"},{"href":"https://doi.org/10.48550/ARXIV.2403.08295","text":"search and technology","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/2403.08295.","element":"span"}],[{"id":"id-25","text":"OpenAI. 2022. ","element":"span"},{"href":"https://openai.com/blog/chatgpt","text":"Chatgpt: Optimizing language models ","element":"a"},{"href":"https://openai.com/blog/chatgpt","text":"for dialogue","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"OpenAI","element":"span"},{"text":".","element":"span"}],[{"id":"id-26","text":"OpenAI. 2023. ","element":"span"},{"href":"https://doi.org/10.48550/arXiv.2303.08774","text":"GPT-4 technical report","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/2303.08774.","element":"span"}],[{"id":"id-61","text":"OpenAI. 2024. ","element":"span"},{"href":"https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/","text":"Gpt-4o mini: advancing cost-efficient ","element":"a"},{"href":"https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/","text":"intelligence","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"OpenAI","element":"span"},{"text":".","element":"span"}],[{"id":"id-79","text":"Siqi Ouyang and Lei Li. 2023. ","element":"span"},{"href":"https://doi.org/10.18653/V1/2023.FINDINGS-EMNLP.205","text":"Autoplan: Automatic ","element":"a"},{"href":"https://doi.org/10.18653/V1/2023.FINDINGS-EMNLP.205","text":"planning of interactive decision-making tasks with ","element":"a"},{"href":"https://doi.org/10.18653/V1/2023.FINDINGS-EMNLP.205","text":"large language models","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023","element":"span"},{"text":", pages 3114–3128. Association for Computational Linguistics.","element":"span"}],[{"id":"id-66","text":"Liangming Pan, Michael Saxon, Wenda Xu, Deepak ","element":"span"},{"text":"Nathani, Xinyi Wang, and William Yang Wang. 2024. ","element":"span"},{"href":"https://doi.org/10.1162/TACL_A_00660","text":"Automatically correcting large language models: ","element":"a"},{"style":{"fontStyle":"italic"},"text":"Sur- ","element":"span"},{"href":"https://doi.org/10.1162/TACL_A_00660","style":{"fontStyle":"italic"},"text":"veying the Landscape of Diverse Automated Correc- ","element":"a"},{"href":"https://doi.org/10.1162/TACL_A_00660","style":{"fontStyle":"italic"},"text":"tion Strategies","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Trans. Assoc. Comput. Linguistics","element":"span"},{"text":", 12:484–506.","element":"span"}],[{"id":"id-33","text":"Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, ","element":"span"},{"text":"Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. 2023. ","element":"span"},{"href":"https://doi.org/10.18653/V1/2023.ACL-LONG.294","text":"Reasoning with language ","element":"a"},{"href":"https://doi.org/10.18653/V1/2023.ACL-LONG.294","text":"model prompting: A survey","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023","element":"span"},{"text":", pages 5368– 5393. Association for Computational Linguistics.","element":"span"}],[{"id":"id-3","text":"Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao ","element":"span"},{"text":"Chen, Michihiro Yasunaga, and Diyi Yang. 2023. ","element":"span"},{"href":"https://doi.org/10.18653/V1/2023.EMNLP-MAIN.85","text":"Is chatgpt a general-purpose natural language pro- ","element":"a"},{"href":"https://doi.org/10.18653/V1/2023.EMNLP-MAIN.85","text":"cessing task solver? ","element":"a"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023","element":"span"},{"text":", pages 1339–1384. Association for Computational Linguistics.","element":"span"}],[{"id":"id-52","text":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, ","element":"span"},{"text":"Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"OpenAI blog","element":"span"},{"text":", 1(8):9.","element":"span"}],[{"id":"id-29","text":"Machel Reid, Nikolay Savinov, Denis Teplyashin, ","element":"span"},{"text":"Dmitry Lepikhin, Timothy P. Lillicrap, Jean-Baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, Ioannis Antonoglou, Rohan Anil, Sebastian Borgeaud, Andrew M. Dai, Katie Millican, Ethan Dyer, Mia Glaese, Thibault Sottiaux, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, James Molloy, Jilin Chen, Michael Isard, Paul Barham, Tom Hennigan, Ross McIlroy, Melvin Johnson, Johan Schalkwyk, Eli Collins, Eliza Rutherford, Erica Moreira, Kareem Ayoub, Megha Goel, Clemens Meyer, Gregory Thornton, Zhen Yang, Henryk Michalewski, Zaheer Abbas, Nathan Schucher, Ankesh Anand, Richard Ives, James Keeling, Karel Lenc, Salem Haykal, Siamak Shakeri, Pranav Shyam, Aakanksha Chowdhery, Roman Ring, Stephen Spencer, Eren Sezener, and et al. 2024. ","element":"span"},{"href":"https://doi.org/10.48550/ARXIV.2403.05530","text":"Gemini 1.5: Unlocking multimodal under- ","element":"a"},{"href":"https://doi.org/10.48550/ARXIV.2403.05530","text":"standing across millions of tokens of context","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/2403.05530.","element":"span"}],[{"id":"id-14","text":"Maarten Sap, Ronan Le Bras, Emily Allaway, Chan- ","element":"span"},{"text":"dra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A. Smith, and Yejin Choi. 2019. ","element":"span"},{"href":"https://doi.org/10.1609/AAAI.V33I01.33013027","text":"ATOMIC: an atlas of machine commonsense for","element":"a"}],[{"href":"https://doi.org/10.1609/AAAI.V33I01.33013027","text":"if-then reasoning","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019","element":"span"},{"text":", pages 3027–3035. AAAI Press.","element":"span"}],[{"id":"id-76","text":"Freda Shi, Xinyun Chen, Kanishka Misra, Nathan ","element":"span"},{"text":"Scales, David Dohan, Ed H. Chi, Nathanael Schärli, and Denny Zhou. 2023. ","element":"span"},{"href":"https://proceedings.mlr.press/v202/shi23a.html","text":"Large language models can ","element":"a"},{"href":"https://proceedings.mlr.press/v202/shi23a.html","text":"be easily distracted by irrelevant context","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA","element":"span"},{"text":", volume 202 of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Machine Learning Research","element":"span"},{"text":", pages 31210–31227. PMLR.","element":"span"}],[{"id":"id-6","text":"Steven A Sloman. 1996. The empirical case for two sys- ","element":"span"},{"text":"tems of reasoning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Psychological bulletin","element":"span"},{"text":", 119(1):3.","element":"span"}],[{"id":"id-75","text":"Qingyu Tan, Hwee Tou Ng, and Lidong Bing. 2023. ","element":"span"},{"href":"https://doi.org/10.18653/V1/2023.ACL-LONG.828","text":"Towards benchmarking and improving the temporal ","element":"a"},{"href":"https://doi.org/10.18653/V1/2023.ACL-LONG.828","text":"reasoning capability of large language models","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023","element":"span"},{"text":", pages 14820–14835. Association for Computational Linguistics.","element":"span"}],[{"id":"id-21","text":"Niket Tandon, Bhavana Dalvi, Joel Grus, Wen-tau Yih, ","element":"span"},{"text":"Antoine Bosselut, and Peter Clark. 2018. ","element":"span"},{"href":"https://doi.org/10.18653/V1/D18-1006","text":"Reasoning ","element":"a"},{"href":"https://doi.org/10.18653/V1/D18-1006","text":"about actions and state changes by injecting com- ","element":"a"},{"href":"https://doi.org/10.18653/V1/D18-1006","text":"monsense knowledge","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018","element":"span"},{"text":", pages 57–66. Association for Computational Linguistics.","element":"span"}],[{"id":"id-35","text":"Joshua B Tenenbaum, Charles Kemp, Thomas L Grif- ","element":"span"},{"text":"fiths, and Noah D Goodman. 2011. How to grow a mind: Statistics, structure, and abstraction. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"science","element":"span"},{"text":", 331(6022):1279–1285.","element":"span"}],[{"id":"id-28","text":"Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier ","element":"span"},{"text":"Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023a. ","element":"span"},{"href":"https://doi.org/10.48550/ARXIV.2302.13971","text":"Llama: Open ","element":"a"},{"href":"https://doi.org/10.48550/ARXIV.2302.13971","text":"and efficient foundation language models","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/2302.13971.","element":"span"}],[{"id":"id-27","text":"Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- ","element":"span"},{"text":"bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian CantonFerrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Moly-","element":"span"}],[{"text":"bog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023b. ","element":"span"},{"href":"https://doi.org/10.48550/ARXIV.2307.09288","text":"Llama 2: Open foundation and ","element":"a"},{"href":"https://doi.org/10.48550/ARXIV.2307.09288","text":"fine-tuned chat models","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/2307.09288.","element":"span"}],[{"id":"id-15","text":"Karthik Valmeekam, Matthew Marquez, Alberto Olmo ","element":"span"},{"text":"Hernandez, Sarath Sreedharan, and Subbarao Kambhampati. 2023. ","element":"span"},{"href":"http://papers.nips.cc/paper_files/paper/2023/hash/7a92bcdede88c7afd108072faf5485c8-Abstract-Datasets_and_Benchmarks.html","text":"Planbench: An extensible benchmark ","element":"a"},{"href":"http://papers.nips.cc/paper_files/paper/2023/hash/7a92bcdede88c7afd108072faf5485c8-Abstract-Datasets_and_Benchmarks.html","text":"for evaluating large language models on planning ","element":"a"},{"href":"http://papers.nips.cc/paper_files/paper/2023/hash/7a92bcdede88c7afd108072faf5485c8-Abstract-Datasets_and_Benchmarks.html","text":"and reasoning about change","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023","element":"span"},{"text":".","element":"span"}],[{"id":"id-85","text":"Weiqi Wang, Limeng Cui, Xin Liu, Sreyashi Nag, ","element":"span"},{"text":"Wenju Xu, Sheikh Sarwar, Chen Luo, Yang Laurence Li, Hansu Gu, Hui Liu, Changlong Yu, Jiaxin Bai, Yifan Gao, Haiyang Zhang, Qi He, Shuiwang Ji, and Yangqiu Song. 2024a. EcomScriptBench: A multi-task benchmark for e-commerce script planning via step-wise intention-driven product association. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":".","element":"span"}],[{"id":"id-53","text":"Weiqi Wang, Tianqing Fang, Wenxuan Ding, Baixuan ","element":"span"},{"text":"Xu, Xin Liu, Yangqiu Song, and Antoine Bosselut. 2023a. ","element":"span"},{"href":"https://doi.org/10.18653/V1/2023.FINDINGS-EMNLP.902","text":"CAR: conceptualization-augmented reasoner ","element":"a"},{"href":"https://doi.org/10.18653/V1/2023.FINDINGS-EMNLP.902","text":"for zero-shot commonsense question answering","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023","element":"span"},{"text":", pages 13520–13545. Association for Computational Linguistics.","element":"span"}],[{"id":"id-43","text":"Weiqi Wang, Tianqing Fang, Chunyang Li, Haochen ","element":"span"},{"text":"Shi, Wenxuan Ding, Baixuan Xu, Zhaowei Wang, Jiaxin Bai, Xin Liu, Jiayang Cheng, Chunkit Chan, and Yangqiu Song. 2024b. ","element":"span"},{"href":"https://arxiv.org/abs/2401.07286","text":"CANDLE: iterative concep- ","element":"a"},{"href":"https://arxiv.org/abs/2401.07286","text":"tualization and instantiation distillation from large ","element":"a"},{"href":"https://arxiv.org/abs/2401.07286","text":"language models for commonsense reasoning","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024","element":"span"},{"text":". Association for Computational Linguistics.","element":"span"}],[{"id":"id-89","text":"Weiqi Wang, Tianqing Fang, Haochen Shi, Baixuan ","element":"span"},{"text":"Xu, Wenxuan Ding, Liyu Zhang, Wei Fan, Jiaxin Bai, Haoran Li, Xin Liu, and Yangqiu Song. 2024c. ","element":"span"},{"href":"https://doi.org/10.48550/ARXIV.2406.10885","text":"On the role of entity and event level conceptualiza- ","element":"a"},{"href":"https://doi.org/10.48550/ARXIV.2406.10885","text":"tion in generalizable reasoning: A survey of tasks, ","element":"a"},{"href":"https://doi.org/10.48550/ARXIV.2406.10885","text":"methods, applications, and future directions","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/2406.10885.","element":"span"}],[{"id":"id-88","text":"Weiqi Wang, Tianqing Fang, Baixuan Xu, Chun ","element":"span"},{"text":"Yi Louis Bo, Yangqiu Song, and Lei Chen. 2023b. ","element":"span"},{"href":"https://doi.org/10.18653/V1/2023.ACL-LONG.733","text":"CAT: A contextualized conceptualization and instan- ","element":"a"},{"href":"https://doi.org/10.18653/V1/2023.ACL-LONG.733","text":"tiation framework for commonsense reasoning","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023","element":"span"},{"text":", pages 13111–13140. Association for Computational Linguistics.","element":"span"}],[{"id":"id-63","text":"Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. ","element":"span"},{"text":"Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023c. ","element":"span"},{"href":"https://openreview.net/pdf?id=1PL1NIMMrw","text":"Self-consistency ","element":"a"},{"href":"https://openreview.net/pdf?id=1PL1NIMMrw","text":"improves chain of thought reasoning in language ","element":"a"},{"href":"https://openreview.net/pdf?id=1PL1NIMMrw","text":"models","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023","element":"span"},{"text":". OpenReview.net.","element":"span"}],[{"text":"Zhaowei Wang, Haochen Shi, Weiqi Wang, Tianqing ","element":"span"},{"text":"Fang, Hongming Zhang, Sehyun Choi, Xin Liu, and Yangqiu Song. 2024d. ","element":"span"},{"href":"https://doi.org/10.18653/V1/2024.FINDINGS-NAACL.252","text":"Abspyramid: Benchmarking ","element":"a"},{"href":"https://doi.org/10.18653/V1/2024.FINDINGS-NAACL.252","text":"the abstraction ability of language models with a uni- ","element":"a"},{"href":"https://doi.org/10.18653/V1/2024.FINDINGS-NAACL.252","text":"fied entailment graph","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, June 16-21, 2024","element":"span"},{"text":", pages 3991–4010. Association for Computational Linguistics.","element":"span"}],[{"id":"id-62","text":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten ","element":"span"},{"text":"Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. ","element":"span"},{"href":"http://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html","text":"Chain-of-thought prompting ","element":"a"},{"href":"http://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html","text":"elicits reasoning in large language models","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022","element":"span"},{"text":".","element":"span"}],[{"id":"id-41","text":"Peter West, Chandra Bhagavatula, Jack Hessel, Jena D. ","element":"span"},{"text":"Hwang, Liwei Jiang, Ronan Le Bras, Ximing Lu, Sean Welleck, and Yejin Choi. 2022. ","element":"span"},{"href":"https://doi.org/10.18653/V1/2022.NAACL-MAIN.341","text":"Symbolic ","element":"a"},{"href":"https://doi.org/10.18653/V1/2022.NAACL-MAIN.341","text":"knowledge distillation: from general language mod- ","element":"a"},{"href":"https://doi.org/10.18653/V1/2022.NAACL-MAIN.341","text":"els to commonsense models","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022","element":"span"},{"text":", pages 4602– 4625. Association for Computational Linguistics.","element":"span"}],[{"id":"id-80","text":"Thomas Wolf, Lysandre Debut, Victor Sanh, Julien ","element":"span"},{"text":"Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. ","element":"span"},{"href":"https://doi.org/10.18653/V1/2020.EMNLP-DEMOS.6","text":"Transformers: ","element":"a"},{"href":"https://doi.org/10.18653/V1/2020.EMNLP-DEMOS.6","text":"State-of-the-art natural language processing","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020 - Demos, Online, November 16-20, 2020","element":"span"},{"text":", pages 38–45. Association for Computational Linguistics.","element":"span"}],[{"id":"id-24","text":"Bo Wu, Shoubin Yu, Zhenfang Chen, Josh Tenenbaum, ","element":"span"},{"text":"and Chuang Gan. 2021. ","element":"span"},{"href":"https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/5ef059938ba799aaa845e1c2e8a762bd-Abstract-round2.html","text":"STAR: A benchmark for sit- ","element":"a"},{"href":"https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/5ef059938ba799aaa845e1c2e8a762bd-Abstract-round2.html","text":"uated reasoning in real-world videos","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual","element":"span"},{"text":".","element":"span"}],[{"id":"id-49","text":"Wentao Wu, ","element":"span"},{"text":"Hongsong Li, ","element":"span"},{"text":"Haixun Wang, ","element":"span"},{"text":"and Kenny Qili Zhu. 2012. ","element":"span"},{"href":"https://doi.org/10.1145/2213836.2213891","text":"Probase: a probabilistic tax- ","element":"a"},{"href":"https://doi.org/10.1145/2213836.2213891","text":"onomy for text understanding","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20-24, 2012","element":"span"},{"text":", pages 481–492. ACM.","element":"span"}],[{"id":"id-65","text":"Ruixin Yang, Dheeraj Rajagopal, Shirley Anugrah Hay- ","element":"span"},{"text":"ati, Bin Hu, and Dongyeop Kang. 2024. ","element":"span"},{"href":"https://doi.org/10.48550/ARXIV.2404.09127","text":"Confidence ","element":"a"},{"href":"https://doi.org/10.48550/ARXIV.2404.09127","text":"calibration and rationalization for llms via multi- ","element":"a"},{"href":"https://doi.org/10.48550/ARXIV.2404.09127","text":"agent deliberation","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/2404.09127.","element":"span"}],[{"id":"id-39","text":"Yunhu Ye, Binyuan Hui, Min Yang, Binhua Li, Fei ","element":"span"},{"text":"Huang, and Yongbin Li. 2023. ","element":"span"},{"href":"https://doi.org/10.1145/3539618.3591708","text":"Large language mod- ","element":"a"},{"href":"https://doi.org/10.1145/3539618.3591708","text":"els are versatile decomposers: Decomposing evi- ","element":"a"},{"href":"https://doi.org/10.1145/3539618.3591708","text":"dence and questions for table-based reasoning","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23-27, 2023","element":"span"},{"text":", pages 174–184. ACM.","element":"span"}],[{"id":"id-86","text":"Changlong Yu, Weiqi Wang, Xin Liu, Jiaxin Bai, ","element":"span"},{"text":"Yangqiu Song, Zheng Li, Yifan Gao, Tianyu Cao, and Bing Yin. 2023. ","element":"span"},{"href":"https://doi.org/10.18653/V1/2023.FINDINGS-ACL.76","text":"Folkscope: Intention knowledge ","element":"a"},{"href":"https://doi.org/10.18653/V1/2023.FINDINGS-ACL.76","text":"graph construction for e-commerce commonsense ","element":"a"},{"href":"https://doi.org/10.18653/V1/2023.FINDINGS-ACL.76","text":"discovery","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023","element":"span"},{"text":", pages 1173–1191. Association for Computational Linguistics.","element":"span"}],[{"id":"id-32","text":"Chenhan Yuan, Qianqian Xie, Jimin Huang, and Sophia ","element":"span"},{"text":"Ananiadou. 2024. ","element":"span"},{"href":"https://doi.org/10.1145/3589334.3645376","text":"Back to the future: Towards ex- ","element":"a"},{"href":"https://doi.org/10.1145/3589334.3645376","text":"plainable temporal reasoning with large language ","element":"a"},{"href":"https://doi.org/10.1145/3589334.3645376","text":"models","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the ACM on Web Conference 2024, WWW 2024, Singapore, May 13-17, 2024","element":"span"},{"text":", pages 1963–1974. ACM.","element":"span"}],[{"id":"id-78","text":"Siyu Yuan, Jiangjie Chen, Ziquan Fu, Xuyang Ge, So- ","element":"span"},{"text":"ham Shah, Charles Robert Jankowski, Yanghua Xiao, and Deqing Yang. 2023. ","element":"span"},{"href":"https://doi.org/10.18653/V1/2023.ACL-LONG.236","text":"Distilling script knowledge ","element":"a"},{"href":"https://doi.org/10.18653/V1/2023.ACL-LONG.236","text":"from large language models for constrained language ","element":"a"},{"href":"https://doi.org/10.18653/V1/2023.ACL-LONG.236","text":"planning","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023","element":"span"},{"text":", pages 4303–4325. Association for Computational Linguistics.","element":"span"}],[{"id":"id-23","text":"Youzhi Zhang, Xudong Luo, and Yuping Shen. 2013. ","element":"span"},{"href":"https://doi.org/10.1002/INT.21602","text":"A fuzzy reasoning model for action and change in ","element":"a"},{"href":"https://doi.org/10.1002/INT.21602","text":"timed domains","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Int. J. Intell. Syst.","element":"span"},{"text":", 28(8):787–805.","element":"span"}],[{"id":"id-93","text":"Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan ","element":"span"},{"text":"Ye, Zheyan Luo, and Yongqiang Ma. 2024. ","element":"span"},{"href":"https://doi.org/10.48550/ARXIV.2403.13372","text":"Llamafac- ","element":"a"},{"href":"https://doi.org/10.48550/ARXIV.2403.13372","text":"tory: Unified efficient fine-tuning of 100+ language ","element":"a"},{"href":"https://doi.org/10.48550/ARXIV.2403.13372","text":"models","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/2403.13372.","element":"span"}],[{"id":"id-19","text":"Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan ","element":"span"},{"text":"Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. ","element":"span"},{"href":"https://doi.org/10.1109/ICCV.2015.11","text":"Aligning books and movies: ","element":"a"},{"href":"https://doi.org/10.1109/ICCV.2015.11","text":"Towards story-like visual explanations by watching ","element":"a"},{"href":"https://doi.org/10.1109/ICCV.2015.11","text":"movies and reading books","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015","element":"span"},{"text":", pages 19–27. IEEE Computer Society.","element":"span"}]]},{"heading":"Appendices A Differentiation from Philosophical Metaphysics and Counterfactual Reasoning","paragraphs":[[{"text":"In this work, we use the term “metaphysical” to describe a specific mode of reasoning that deals with highly improbable or abstract scenarios, distinct from both its traditional philosophical meaning and the concept of counterfactual reasoning. Philosophically, “metaphysics” refers to the study of the fundamental nature of reality, encompassing questions about existence, causality, and the nature of being (","element":"span"},{"href":"#id-81","referenceIndex":4,"text":"Aristotle and Aristotle","element":"a"},{"text":", ","element":"span"},{"href":"#id-81","referenceIndex":4,"text":"1933","element":"a"},{"text":"; ","element":"span"},{"href":"#id-82","referenceIndex":11,"text":"Bergson","element":"a"},{"text":", ","element":"span"},{"href":"#id-82","referenceIndex":11,"text":"1999","element":"a"},{"text":"). While this classical usage involves conceptual analysis and abstract thought, our focus diverges significantly. We adopt “metaphysical” to signify reasoning that examines transitions between plausible and highly improbable states, emphasizing the logical structure and abstracted nature of these transitions rather than ontological or existential inquiries.","element":"span"}],[{"text":"This distinction is important because our framework does not engage with the philosophical debates about the nature of reality or existence. Instead, it concentrates on how LLMs process and adapt to scenarios that are rare or abstract yet logically consistent. For example, while metaphysical reasoning in our context might involve reasoning about a scenario where “a civilization survives for 100,000 years,” it does not explore the metaphysical nature of time, existence, or causality in a philosophical sense.","element":"span"}],[{"text":"Furthermore, our concept of metaphysical reasoning is distinct from counterfactual reasoning. Counterfactual reasoning involves evaluating “what if” scenarios that diverge from known realities but remain bounded by plausible causal relationships (","element":"span"},{"href":"#id-83","referenceIndex":40,"text":"Li et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-83","referenceIndex":40,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-84","referenceIndex":29,"text":"Hua et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-84","referenceIndex":29,"text":"2024","element":"a"},{"text":"). For example, a counterfactual might consider, “What if Caesar had lost the battle of Pharsalus?”–a scenario grounded in historical plausibility. In contrast, metaphysical reasoning in our framework extends beyond plausibility to explore scenarios that are structurally coherent but unlikely or abstract, such as “What if Caesar ruled for a millennium?” Here, the focus is not on causal plausibility but on the ability to evaluate transitions to rare, abstract, or highly improbable states.","element":"span"}],[{"text":"This differentiation between “metaphysical” in our framework, metaphysics in philosophy, and ","element":"span"},{"text":"counterfactual reasoning underscores the novel challenges our benchmarks aim to address. By pushing LLMs to reason about transitions into abstract or improbable scenarios, we aim to probe and enhance their capabilities for adaptive, out-of-distribution reasoning – a necessary step toward achieving generalizable System II reasoning.","element":"span"}]]},{"heading":"B MARS Benchmark Curation Details","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"B.1","element":"span"}],[{"text":"An overview of our benchmark construction pipeline is shown in Figure ","element":"span"},{"href":"#id-38","text":"5","element":"a"},{"text":". We first present our prompts used in each step for sequentially instructing ChatGPT to generate candidate data for ","element":"span"},{"text":"M","element":"span"},{"text":"ARS ","element":"span"},{"text":"(","element":"span"},{"href":"#id-85","referenceIndex":65,"text":"Wang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-85","referenceIndex":65,"text":"2024a","element":"a"},{"text":").","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"B.1.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Text Decomposition and Event","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Component Extraction ","element":"span"},{"text":"To decompose a lengthy text from the source corpora into several action events, we use the following prompt to instruct ChatGPT.","element":"span"}],[{"style":{"width":"80%"},"width":706,"height":1603,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/15-0.png","element":"img"}],[{"id":"id-38","style":{"width":"99%"},"width":1800,"height":342,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/16-0.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Transition: ","element":"span"},{"text":"more than 10 years ","element":"span"},{"text":"-> less than 10 days ","element":"span"},{"style":{"fontStyle":"italic"},"text":"(temporal quantifier)","element":"span"},{"text":"MARS","element":"span"}],[{"text":"Figure 5: An overview of our benchmark curation pipeline with running examples.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"79%"},"width":698,"height":736,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/16-1.png","element":"img"}],[{"text":"We then use the following prompt to extract seven types of components from the decomposed events.","element":"span"}],[{"style":{"width":"80%"},"width":703,"height":1170,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/16-2.png","element":"img"}],[{"style":{"width":"80%"},"width":702,"height":1278,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/16-3.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"B.1.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Component Abstraction and Variation ","element":"span"},{"text":"For each type of component, we customize the prompt according to the nature of the component and whether the changes are implemented via abstraction or numerical variation. Here, we take the subject category with its abstraction as an example.","element":"span"}],[{"style":{"width":"79%"},"width":699,"height":465,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/16-4.png","element":"img"}],[{"style":{"width":"79%"},"width":698,"height":553,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/17-0.png","element":"img"}],[{"text":"Note that leveraging LLM to perform contextualized abstraction (","element":"span"},{"href":"#id-43","referenceIndex":67,"text":"Wang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-43","referenceIndex":67,"text":"2024b","element":"a"},{"text":"; ","element":"span"},{"href":"#id-86","referenceIndex":79,"text":"Yu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-86","referenceIndex":79,"text":"2023","element":"a"},{"text":") has been shown to result in better quality, larger coverage, and stronger downstream benefits compared to previous conceptualization methods (","element":"span"},{"href":"#id-87","referenceIndex":24,"text":"He et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-87","referenceIndex":24,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-88","referenceIndex":69,"text":"Wang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-88","referenceIndex":69,"text":"2023b","element":"a"},{"text":", ","element":"span"},{"href":"#id-89","referenceIndex":68,"text":"2024c","element":"a"},{"text":"), such as retrieving from a pre-defined concept taxonomy or human annotation. ","element":"span"},{"text":"Our knowledge distillation-based method is justifiable and enables large-scale benchmark construction.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"B.1.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Inference Generation","element":"span"}],[{"text":"We use different prompts to collect plausible inferential states and metaphysical inferential states for each changed action event. Here, we provide the prompt for generating a metaphysical inference as an example.","element":"span"}],[{"style":{"width":"80%"},"width":703,"height":1224,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/17-1.png","element":"img"}],[{"style":{"width":"79%"},"width":699,"height":193,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/17-2.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"B.1.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Metaphysical Transition Generation","element":"span"}],[{"text":"Finally, we use the prompt below to collect the change needed to transition a metaphysical inference into a plausible one.","element":"span"}],[{"style":{"width":"80%"},"width":706,"height":1332,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/17-3.png","element":"img"}],[{"id":"id-95","style":{"fontWeight":"bold"},"text":"B.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Main Evaluations on","element":"span"}],[{"text":"To evaluate LLMs on three tasks in ","element":"span"},{"text":"M","element":"span"},{"text":"ARS","element":"span"},{"text":", we show our evaluating prompts in zero-shot scenario in Table ","element":"span"},{"href":"#id-90","text":"6","element":"a"},{"text":". Note that we are aware that LLMs may not be familiar with the word “metaphysical.” Therefore, we also experimented with replacing the word with “implausible,” and the best performances from both types of prompts are reported. These models are consistent across all models’ evaluations for fair comparison.","element":"span"}],[{"text":"For few-shot evaluations, few shot examples are added after task descriptions and before the prompted test entry. The exemplars are randomly sampled for each different test entry. For COT prompting, we specifically ask LLMs to “think step","element":"span"}],[{"id":"id-91","style":{"width":"85%"},"width":1559,"height":211,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/18-0.png","element":"img"}],[{"text":"Table 4: Annotation results of evaluation data curated with different LLMs as backbones. Plaus. refers to plausible ","element":"figcaption","subtype":"caption"},{"id":"id-71","text":"event/inference/transition rate and Expert. refers to ratio of data accepted by expert annotators.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"80%"},"width":1459,"height":495,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/18-1.png","element":"img"}],[{"text":"Table 5: Number of unique components by type in annotated splits of M","element":"figcaption","subtype":"caption"},{"text":"ARS","element":"figcaption","subtype":"caption"},{"text":". #Avg. refers to the average number of unique identified/modified component per event.","element":"figcaption","subtype":"caption"}],[{"text":"by step and generate a short rationale to support your reasoning.” Then, we ask it to give an answer based on its generated rationale. The sampling temperature ","element":"span"},{"style":{"height":8},"width":23,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/18-2.png","element":"img","alt":" τ","inline":true,"padRight":true},{"text":"is set to 0.1 by default, and 5 COT responses are sampled with ","element":"span"},{"style":{"height":8},"width":23,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/18-3.png","element":"img","alt":" τ","inline":true,"padRight":true},{"text":"set to 0.7 in the SCCOT setting.","element":"span"}],[{"id":"id-77","style":{"fontWeight":"bold"},"text":"B.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Leveraging Open-sourced LLM for Benchmark Curation","element":"span"}],[{"text":"In this paper, we use proprietary LLMs and human annotation for data construction, which can be expensive and labor-intensive. However, this approach serves the best pursuit of data quality, which is crucial for an evaluation benchmark. Prior to our data collection, we tested a wide variety of LLMs, and ChatGPT outperformed almost all of them. Therefore, we opted to use it for data construction. Nevertheless, ","element":"span"},{"text":"with the recent advancements in state-of-the-art LLMs, we have found that ","element":"span"},{"text":"meta-llama/Llama-3.1-405B-Instruct ","element":"span"},{"text":"and ","element":"span"},{"text":"GPT-4o ","element":"span"},{"text":"also achieve satisfactory performance within our data collection framework. We sampled 500 original data entries and employed similar prompts and data collection processes to gather metaphysical reasoning evaluation data entries. We then asked expert annotators to rate the plausibility of the obtained data. The results are shown in Table ","element":"span"},{"href":"#id-91","text":"4","element":"a"},{"text":". We observe that LLAMA3.1-405B can achieve comparable performance to ChatGPT in terms of plausible data (evaluation data that reflects reality rather than metaphysics, similar to the majority vote results in Table 2) and expert acceptance rates. ","element":"span"},{"text":"Additionally, we find that GPT-4o can even improve the data collection process, resulting in higher quality data. Thus, we believe this represents a compromise between data quality, reproducibility, and cost. It would also be feasible for data collectors to use LLAMA3.1 in the future for collecting metaphysical data, although leveraging proprietary LLMs can be more reliable to some extent.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"B.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Additional Statistics on","element":"span"}],[{"text":"Table ","element":"span"},{"href":"#id-71","text":"5 ","element":"a"},{"text":"presents detailed statistics on the number of unique identified and modified components by type in the annotated splits of each task. The majority (approximately 80%) of the components focus on the subject, verb, and object, while the remainder (around 20%) concentrate on temporal quantifiers, spatial quantifiers, numerical properties, and sub-events. On average, each annotated event in M","element":"span"},{"text":"ARS ","element":"span"},{"text":"features 8.15 identified components for changes and 7.81 transitions.","element":"span"}]]},{"heading":"C Implementation Details","paragraphs":[[{"text":"This section provides further implementation details for the main evaluations and subsequent analyses.","element":"span"}],[{"text":"For all experiments, we use the Huggingface","element":"span"},{"text":"1 ","element":"span"},{"text":"Library (","element":"span"},{"href":"#id-80","referenceIndex":74,"text":"Wolf et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-80","referenceIndex":74,"text":"2020","element":"a"},{"text":") to build all models. For each LLM, we conduct experiments with","element":"span"}],[{"id":"id-90","style":{"width":"98%"},"width":1799,"height":918,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/19-0.png","element":"img"}],[{"text":"Table 6: Prompts used for evaluating LLMs across three tasks in ","element":"figcaption","subtype":"caption"},{"text":"M","element":"figcaption","subtype":"caption"},{"text":"ARS ","element":"figcaption","subtype":"caption"},{"text":"in zero-shot scenario. ME. MI., and MT. stand for three tasks, respectively.","element":"figcaption","subtype":"caption"}],[{"text":"both its instruction fine-tuned version (if any) and the original version. ","element":"span"},{"text":"The one achieving higher performances will be included in the reported results. For LLaMa2, the model code is ","element":"span"},{"text":"meta-llama/Llama-2-7b/13b/70b(-chat)-hf","element":"span"},{"text":". For ","element":"span"},{"text":"LLaMa3, ","element":"span"},{"text":"the ","element":"span"},{"text":"model ","element":"span"},{"text":"code ","element":"span"},{"text":"is ","element":"span"},{"text":"meta-llama/Meta-Llama-3-8B/70B(-Instruct)","element":"span"},{"text":". For ","element":"span"},{"text":"Mistral, ","element":"span"},{"text":"we ","element":"span"},{"text":"use ","element":"span"},{"text":"mistralai/ Mistral-7B(-Instruct)-v0.3","element":"span"},{"text":".","element":"span"}],[{"text":"For ChatGPT and GPT4, we access it through Microsoft Azure APIs","element":"span"},{"text":"2","element":"span"},{"text":". The code of the accessed version for ChatGPT is ","element":"span"},{"text":"gpt-35-turbo","element":"span"},{"text":", and for GPT4 is ","element":"span"},{"text":"gpt-4","element":"span"},{"text":". ","element":"span"},{"text":"Both models are of the version dated ","element":"span"},{"text":"2024-02-01","element":"span"},{"text":". The maximum generation length is set to 50 tokens in zero-shot and few-shot settings, while for COT and SC-COT evaluations, the maximum generation length is set at 200 tokens.","element":"span"}],[{"text":"All experiments are conducted on eight NVIDIA-V100 (32G) GPUs, with 8E disk space, 48 CPU cores, and 1T memory. Each experiment is repeated three times with different random seeds, and the average performances are reported. The variance across all experiments remains below 0.08, which is considered extremely small. Due to space constraints, we omit reporting this variance.","element":"span"}],[{"id":"id-97","style":{"fontWeight":"bold"},"text":"C.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Main Evaluations on","element":"span"}],[{"text":"First, we add random voting and majority voting as another two baselines for revealing the characteristics of the ","element":"span"},{"text":"M","element":"span"},{"text":"ARS ","element":"span"},{"text":"benchmark.","element":"span"}],[{"text":"To evaluate PTLMs in a zero-shot manner, we adopt the evaluation pipeline used for zero-shot question answering (","element":"span"},{"href":"#id-55","referenceIndex":44,"text":"Ma et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-55","referenceIndex":44,"text":"2021","element":"a"},{"text":"; ","element":"span"},{"href":"#id-53","referenceIndex":66,"text":"Wang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-53","referenceIndex":66,"text":"2023a","element":"a"},{"text":"). Specifically, we convert each discrimination data entry into two declarative statements, which serve as natural language assertions corresponding to ‘yes” or “no” options. ","element":"span"},{"text":"For instance, when determining whether an event is metaphysical, we generate two assertions: “The event ","element":"span"},{"text":" ","element":"span"},{"text":"is metaphysical as it’s unlikely to occur in reality,” and “The event ","element":"span"},{"text":" ","element":"span"},{"text":"is not metaphysical; it’s plausible in reality.” The models are then tasked with computing the loss of each assertion. The assertion with the lowest loss is considered as the model’s prediction. This approach allows any PTLM to be evaluated under classification tasks with an arbitrary number of options or even type classification based on a single assertion. We use the open code library","element":"span"},{"text":"3 ","element":"span"},{"text":"as our code base and follow the default hyperparameter settings. For VERA, we follow the exact same implementation","element":"span"},{"text":"4 ","element":"span"},{"text":"(","element":"span"},{"href":"#id-54","referenceIndex":41,"text":"Liu ","element":"a"},{"href":"#id-54","referenceIndex":41,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-54","referenceIndex":41,"text":"2023a","element":"a"},{"text":"). The accessed backbone model is ","element":"span"},{"text":"liujch1998/vera","element":"span"},{"text":", and all other hyperparameter settings follow the default implementation.","element":"span"}],[{"text":"For fine-tuning PTLMs, we connect each PTLM backbone with five fully connected classification layers. The entire model is then fine-tuned using a classification objective with cross-entropy loss. We employ a default setting of a learning rate of 5e-6 and a batch size of 64. The models are optimized using an AdamW optimizer (","element":"span"},{"href":"#id-92","referenceIndex":43,"text":"Loshchilov ","element":"a"},{"href":"#id-92","referenceIndex":43,"text":"and Hutter","element":"a"},{"text":", ","element":"span"},{"href":"#id-92","referenceIndex":43,"text":"2019","element":"a"},{"text":"), with the model’s performance evaluated every 50 steps. We set the maximum sequence lengths for the tokenizers to 70 for all three discriminative subtasks. Early stopping is also implemented to select the best checkpoint when the highest validation accuracy is achieved. To ensure convergence, we train all models with five epochs.","element":"span"}],[{"text":"For evaluating LLMs in a zero-shot manner, we transform the input for each task into assertions using natural language prompts, as illustrated in Table ","element":"span"},{"href":"#id-90","text":"6","element":"a"},{"text":". The models are then prompted to determine the plausibility of the provided assertions by answering yes or no questions. We parse their responses using pre-defined rules to derive binary predictions. When generating each token, we consider the top 10 tokens with the highest probabilities.","element":"span"}],[{"text":"For fine-tuning LLMs, we use LoRA for fine-tuning, and the LoRA rank and ","element":"span"},{"style":{"height":8.4},"width":28,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/20-0.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"are set to 16 and 32, respectively. We adopt the open code library from LlamaFactory","element":"span"},{"text":"5 ","element":"span"},{"text":"(","element":"span"},{"href":"#id-93","referenceIndex":83,"text":"Zheng et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-93","referenceIndex":83,"text":"2024","element":"a"},{"text":") for model training and evaluation. We similarly use an Adam (","element":"span"},{"href":"#id-94","referenceIndex":36,"text":"Kingma and Ba","element":"a"},{"text":", ","element":"span"},{"href":"#id-94","referenceIndex":36,"text":"2015","element":"a"},{"text":") optimizer with a learning rate of 5e-5 and a batch size of 8. The maximum sequence length for the tokenizer is set at 300. All models are fine-tuned over three epochs, selecting the checkpoint with the highest accuracy on the validation set.","element":"span"}],[{"text":"Finally, for evaluating proprietary LLMs, such as ChatGPT and GPT4, we similarly prompt them as with open LLMs. Detailed prompts are explained in Appendix ","element":"span"},{"href":"#id-95","text":"B.2","element":"a"},{"text":".","element":"span"}],[{"text":"We also include full evaluation results (with more baselines and models included) in Table ","element":"span"},{"href":"#id-96","text":"8","element":"a"},{"text":". Specifically, for RAG (","element":"span"},{"href":"#id-64","referenceIndex":22,"text":"Gao et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-64","referenceIndex":22,"text":"2023","element":"a"},{"text":"), we reformulate the traditional paradigm of retrieval-augmented generation for our task by asking an LLM to first identify important concepts from the evaluation data entry, retrieve relevant knowledge from an abstract knowledge base containing information about the concepts, and merge them into the evaluation prompt for making the final prediction on metaphysical reasoning tasks. This approach aligns with the design of our M","element":"span"},{"text":"ARS ","element":"span"},{"text":"benchmark and ","element":"span"},{"text":"provides insights into which method offers more benefits when comparing retrieval to fine-tuning conceptual knowledge into LLMs.","element":"span"}],[{"text":"For Multi-Agent Calibration, we adopt the multi-agent deliberation design from ","element":"span"},{"href":"#id-65","referenceIndex":77,"text":"Yang et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-65","referenceIndex":77,"text":"2024","element":"a"},{"text":"), which is a multi-agent confidence calibration system for multiple-choice QA. In this setting, we set up two LLMs. The first LLM generates the initial chain-of-thought response and prediction for each task. The second LLM is prompted with the first LLM’s chain-of-thought response and is asked to analyze the differences. Its reasoning rationale regarding these differences, particularly in the metaphysical realm, is then provided as feedback to the first LLM. The first LLM incorporates this feedback and is asked to regenerate the chain-of-thought rationale and final prediction. This loop continues until the second LLM agrees with the first LLM.","element":"span"}],[{"text":"For Self-Reflection (","element":"span"},{"href":"#id-66","referenceIndex":51,"text":"Pan et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-66","referenceIndex":51,"text":"2024","element":"a"},{"text":"), we adopt a straightforward approach to rectify LLM errors by using feedback provided by the LLM itself (self-reflection). In this setting, we first ask an LLM to generate a chain-of-thought response explaining the rationale behind a given metaphysical data entry. We then prompt it for a new round, deliberately asking it to analyze the correctness of its rationale and answer. This feedback is merged back into the original prompt and first response to generate a refined response after self-reflection.","element":"span"}],[{"id":"id-72","style":{"fontWeight":"bold"},"text":"C.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Improving Metaphysical Reasoning via Transferring from Conceptualization Taxonomy","element":"span"}],[{"text":"In this section, we elaborate further on how we transform CANDLE into the format of three tasks in ","element":"span"},{"text":"M","element":"span"},{"text":"ARS ","element":"span"},{"text":"for large-scale pre-training in improving LMs’ metaphysical reasoning abilities.","element":"span"}],[{"text":"CANDLE’s data is primarily divided into two sections. The first section comprises conceptualizations of instances or events, which can be reformatted into metaphysical event discrimination. Each data entry in CANDLE represents a conceptualization of an abstracted instance within an event or the abstraction of an entire event. Following our definition in Section ","element":"span"},{"text":"3","element":"span"},{"text":", we interpret each conceptualization as a change in the event. For each data entry, replacing the original instance with its conceptualization forms a plausible change that could occur in reality. Subsequently, we randomly select negative conceptualizations for an event from conceptualizations of other events that do not share any ","element":"span"},{"text":"common words with the anchor event. These negative conceptualizations form metaphysical events. Three models are then pre-trained on four million events, with a balanced ratio of plausible events and metaphysical events. The hyperparameters for fine-tuning all models remain consistent with the implementation details described above in Appendix ","element":"span"},{"href":"#id-97","text":"C.1","element":"a"},{"text":".","element":"span"}],[{"text":"The second part contains the commonsense inferential knowledge of abstracted events, which can be interpreted as inferential states of the modified events. To synchronize with our task structure, we exclusively select relations that imply a state in the inferential knowledge. We obtain negative inference samples in a similar manner by sampling from inference tails of events without common keywords. Subsequently, we pre-train models for both the metaphysical inference discrimination task and the metaphysical transition reasoning task. These models are trained to determine whether the inference is plausible or metaphysical in relation to the altered event. As CANDLE does not include transitions, this approach serves as the most accurate simulation of the metaphysical transition reasoning task. It’s also important to note that CANDLE is exclusively predicated on social events, covering only subject, object, and sub-events as types of abstraction changes. In contrast, ","element":"span"},{"text":"M","element":"span"},{"text":"ARS ","element":"span"},{"text":"contains a significantly wider array of events, incorporates more types of changes, and also evaluates (L)LMs’ capabilities in discerning what additional change is requisite to instigate a transition. These features make ","element":"span"},{"text":"M","element":"span"},{"text":"ARS ","element":"span"},{"text":"distinct from tasks in CANDLE.","element":"span"}]]},{"heading":"D Annotation Details","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"D.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Worker Selection Protocol","element":"span"}],[{"text":"To ensure the high quality of our human annotation, we implement strict quality control measures. Initially, we invite only those workers to participate in our qualification rounds who meet the following criteria: 1) a minimum of 1K HITs approved, and 2) an approval rate of at least 95%. We select workers separately for each task and conduct three qualification rounds per task to identify those with satisfactory performance. In each qualification round, we create a qualification test suite that includes both easy and challenging questions, each with a gold label from the authors. Workers are required to complete a minimum of 20 questions. To qualify, they must achieve an accuracy rate of at least 80% on the qualification test. After our selec- ","element":"span"},{"text":"tion process, we chose 36, 24, and 32 workers for three tasks, respectively, from a pool of 481, 377, and 409 unique annotators. On average, our worker selection rate stands at 7.26%. Following the qualification rounds, workers are required to complete another instruction round. This round contains complex questions selected by the authors, and workers are required to briefly explain the answer to each question. The authors will then doublecheck the explanations provided by the annotators and disqualify those with a poor understanding.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"D.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Annotation Interface","element":"span"}],[{"text":"For each task, we provide workers with comprehensive task explanations in layman’s terms to enhance their understanding. We also offer detailed definitions and several examples of each choice to help annotators understand how to make decisions. Each entry requires the worker to annotate using a four-point Likert scale. Workers are asked to rate the plausibility of the given question using such scale, where 1 signifies strong agreement and 4 indicates strong disagreement. We consider annotations with a value of 1 or 2 as plausible and those with a value of 3 or 4 as implausible. A snapshot of our annotation instructions, along with a snapshot showing the question released to the worker, are shown in Figure ","element":"span"},{"href":"#id-98","text":"6 ","element":"a"},{"text":"and Figure ","element":"span"},{"href":"#id-99","text":"7","element":"a"},{"text":". To ensure comprehension, we require annotators to confirm that they have thoroughly read the instructions by ticking a checkbox before starting the annotation task. We also manually monitor the performance of the annotators throughout the annotation process and provide feedback based on common errors. Spammers or underperforming workers will be disqualified. The overall inter-annotator agreement (IAA) stands at 81% in terms of pairwise agreement, and the Fleiss kappa (","element":"span"},{"href":"#id-47","referenceIndex":21,"text":"Fleiss","element":"a"},{"text":", ","element":"span"},{"href":"#id-47","referenceIndex":21,"text":"1971","element":"a"},{"text":") is 0.56. These statistics are generally comparable to or slightly higher than those of other high-quality dataset construction works (","element":"span"},{"href":"#id-14","referenceIndex":56,"text":"Sap et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":56,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-100","referenceIndex":19,"text":"Fang ","element":"a"},{"href":"#id-100","referenceIndex":19,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-100","referenceIndex":19,"text":"2021a","element":"a"},{"text":",","element":"span"},{"href":"#id-101","referenceIndex":20,"text":"b","element":"a"},{"text":"; ","element":"span"},{"href":"#id-102","referenceIndex":31,"text":"Hwang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-102","referenceIndex":31,"text":"2021","element":"a"},{"text":"; ","element":"span"},{"href":"#id-103","referenceIndex":39,"text":"Li et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-103","referenceIndex":39,"text":"2025","element":"a"},{"text":"; ","element":"span"},{"href":"#id-104","referenceIndex":6,"text":"Bai et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-104","referenceIndex":6,"text":"2023","element":"a"},{"text":"), which indicates that the annotators are close to achieving a strong internal agreement.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"D.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Expert Verification","element":"span"}],[{"text":"Finally, we enlist the help of three postgraduate students, each with extensive experience in NLP research, to validate the annotations. These students are given the same instructions as those provided to the crowd-sourcing workers and are asked to verify a sample of 100 annotations for each task. The ","element":"span"},{"text":"high level of consistency between our expert annotators and the AMT annotators, as demonstrated in Table ","element":"span"},{"href":"#id-48","text":"1","element":"a"},{"text":", suggests that our AMT annotation is of high quality.","element":"span"}]]},{"heading":"E Additional Experiments and Analysis","paragraphs":[[{"text":"In this section, we include additional analytical experiments to provide better support for our claims in M","element":"span"},{"text":"ARS","element":"span"},{"text":".","element":"span"}],[{"id":"id-67","style":{"fontWeight":"bold"},"text":"E.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Multi-task Fine-tuning on","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"E.1.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Setup","element":"span"}],[{"text":"To achieve conscious processing, an ideal language model should be capable of performing three tasks uniformly and sequentially. However, fine-tuning each task separately contradicts this objective, as it results in a model that can only perform one task after one training. Therefore, in this section, we investigate the possibility of enabling a language model to master all tasks simultaneously through multitask fine-tuning. Given that all three tasks are binary classification tasks, we adopt a straightforward approach. The language model is trained using a randomly shuffled combination of training data from all three tasks. This anticipates that the model will learn all tasks collectively. The best checkpoint is chosen based on achieving the highest accuracy on the validation sets of all three tasks. After training, the model performance is evaluated separately on the testing sets of each task. All training details remain consistent with those explained in the Appendix ","element":"span"},{"href":"#id-97","text":"C.1","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"E.1.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Results and Analysis","element":"span"}],[{"text":"The results are presented in Table ","element":"span"},{"href":"#id-105","text":"9","element":"a"},{"text":". Upon analyzing these results, we observe that LLMs fine-tuned in a multi-task setting generally outperform those simply fine-tuned on the respective training data for each task. This observation is interesting as it suggests that training the model uniformly across all three tasks can enhance the entire process simultaneously, thereby improving reasoning with changes in distribution. This implies that LLMs can potentially mimic human learning abilities, which are better equipped to reason with changes by collectively understanding the feasibility, consequence, and necessity of such changes. Such a phenomenon indirectly indicates that our task formulation is indeed interconnected and collectively forms a reasoning pipeline. However, it’s important to note that this improvement is only marginal. LLMs still ","element":"span"},{"text":"exhibit limited metaphysical reasoning ability, particularly in the metaphysical event discrimination task. More advanced methods are still required to enable LLMs to achieve metaphysical reasoning.","element":"span"}],[{"id":"id-68","style":{"fontWeight":"bold"},"text":"E.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Few-shot Fine-tuning on","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"E.2.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Setup","element":"span"}],[{"text":"From the main evaluation results in Table ","element":"span"},{"href":"#id-69","text":"2","element":"a"},{"text":", it is evident that fine-tuning consistently enhances the performance of all models on","element":"span"}],[{"style":{"width":"100%"},"width":877,"height":1076,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/22-0.png","element":"img"}],[{"text":"The results are reported in Table ","element":"span"},{"href":"#id-106","text":"10","element":"a"},{"text":". From these results, we observe that training the model with a few-shot training data sample generally has a negative impact across all tasks in ","element":"span"},{"text":"M","element":"span"},{"text":"ARS","element":"span"},{"text":". However, this impact is not significant, and on rare occasions, the sampled training data even leads to superior results compared to training on the full sets. When the training data is reduced to different ratios (80%, 60%, 40%, and 20%), the performance of the models is not significantly affected. This suggests that the models are capable of learning from a small amount of training data and that performance is not significantly influenced by the size of the training data. In other words, annotating more data for training does not necessarily result in better performance, indicating that our task cannot be simply resolved by increasing training data. Future research can explore more advanced reasoning paradigms or training methods to further enhance the capabilities of LLMs in metaphysical reasoning.","element":"span"}],[{"id":"id-108","style":{"width":"81%"},"width":1481,"height":330,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/23-0.png","element":"img"}],[{"text":"Table 7: Evaluation results (%) of GPT-4o on M","element":"figcaption","subtype":"caption"},{"text":"ARS ","element":"figcaption","subtype":"caption"},{"text":"constructed with different backbone LLMs.","element":"figcaption","subtype":"caption"}],[{"id":"id-70","style":{"fontWeight":"bold"},"text":"E.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Fine-tuned PTLMs vs. Fine-tuned LLMs","element":"span"}],[{"text":"To validate the reason why fine-tuned PTLMs perform better than fine-tuned LLMs, we first hypothesis that PTLMs have a faster convergence rate to the training data due to their smaller number of parameters and fully fine-tuned paradigm (compared to LoRA when fine-tuning LLMs). This results in better fine-tuned performance than LLMs. Although LLMs have lower performance, they exhibit stronger generalizability to other tasks. We fine-tune a DeBERTa-v3 model with 25% and 50% of the training data and observed their performance in Table ","element":"span"},{"href":"#id-106","text":"10","element":"a"},{"text":". From the results, we observe that when we reduce the training data for PTLMs, they are hardly comparable to fine-tuned LLMs. However, the last 50% of randomly sampled data brought significant improvements. While we cannot determine the exact reason due to the black box nature of these language models, we believe that PTLMs have a faster rate of fitting into the distribution of the training data or human annotations, resulting in better outcomes on human-annotated evaluation sets. LLMs are more likely to learn how to make correct inferences rather than simply fitting the data. Another possible reason is that we use LoRA to fine-tune LLMs due to limited computational resources; fully fine-tuning LLMs might further enhance their performance.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"E.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Inherent Bias in M","element":"span"},{"style":{"fontWeight":"bold"},"text":"ARS ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Construction","element":"span"}],[{"text":"One concern regarding the M","element":"span"},{"text":"ARS ","element":"span"},{"text":"benchmark is the potential bias introduced by using GPT-series models, specifically ChatGPT, for dataset construction. Our approach to constructing M","element":"span"},{"text":"ARS ","element":"span"},{"text":"was guided by the need to balance scalability with quality. In pilot studies evaluating metaphysical reasoning across various models, GPT-series models consistently demonstrated the highest levels of creativity and reliability. Based on these findings, we selected GPT as the primary backbone for data generation. Constructing M","element":"span"},{"text":"ARS","element":"span"},{"text":", however, required extensive manual annotation, as LLMs often fail to provide ","element":"span"},{"text":"accurate labels for complex reasoning tasks. This manual verification process made it impractical to create multiple versions of M","element":"span"},{"text":"ARS ","element":"span"},{"text":"using different backbone LLMs due to expensive human labors required. Thus, to address concerns about potential biases arising from reliance on ChatGPT, we conducted additional experiments by constructing two smaller versions of the M","element":"span"},{"text":"ARS ","element":"span"},{"text":"benchmark. These alternative benchmarks utilized data generated from two different LLMs, Claude-3.5-sonnet (","element":"span"},{"href":"#id-107","referenceIndex":3,"text":"Anthropic","element":"a"},{"text":", ","element":"span"},{"href":"#id-107","referenceIndex":3,"text":"2024","element":"a"},{"text":") and LLAMA 3.1-70B (","element":"span"},{"href":"#id-56","referenceIndex":18,"text":"Dubey et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-56","referenceIndex":18,"text":"2024","element":"a"},{"text":"), in each step, to obtain 200 evaluation data entries per task in M","element":"span"},{"text":"ARS","element":"span"},{"text":". All samples underwent expert annotation to collect ground-truth labels. We then evaluate GPT-4’s zero-shot and few-shot performance on these alternative benchmarks alongside the original M","element":"span"},{"text":"ARS","element":"span"},{"text":".","element":"span"}],[{"text":"The results are shown in Table ","element":"span"},{"href":"#id-108","text":"7","element":"a"},{"text":". We observe that using different LLMs as backbones for M","element":"span"},{"text":"ARS ","element":"span"},{"text":"construction results in similar performance by GPT-4 across zero-shot and few-shot settings. Overall, the difficulty of the M","element":"span"},{"text":"ARS ","element":"span"},{"text":"benchmark remains robust and consistent, irrespective of the backbone LLM used during dataset generation. These experiments demonstrate that the reliance on ChatGPT for the original M","element":"span"},{"text":"ARS ","element":"span"},{"text":"construction does not compromise the benchmark’s validity or difficulty. The results reinforce the reliability of MARS as a comprehensive test of metaphysical reasoning, with its complexity surpassing any potential biases introduced by the specific LLM used in data collection.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"E.5 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Binary Task Design in M","element":"span"},{"style":{"fontWeight":"bold"},"text":"ARS","element":"span"}],[{"text":"In M","element":"span"},{"text":"ARS","element":"span"},{"text":", all tasks are designed as a binary prediction task to facilitate automated and easy label collection and evaluation. Here, we discuss the reason and some pilot analysis behind such task design by considering other task formulations, including multiple-choice, open-ended generation, and binary evaluation.","element":"span"}],[{"text":"Multiple-choice tasks, while structured and amenable to automated evaluation, posed signif- ","element":"span"},{"text":"icant challenges in collecting high-quality negative (distractor) options. Relying on human annotators to create distractors proved labor-intensive and impractical for scaling, as it required drafting multiple plausible but incorrect options for each question. As a result, we adopted open-ended generation and binary evaluation, ultimately choosing a generate-then-annotate paradigm. This approach involved two stages: first, evaluating the performance of LLMs in generating metaphysical cases during the generation phase; second, annotating the generated cases with binary labels (correct/incorrect).","element":"span"}],[{"text":"To complement the binary evaluation results, we also included human annotation results for ChatGPT’s performance in generating metaphysical data, as indicated in the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Majority ","element":"span"},{"text":"row of Table ","element":"span"},{"href":"#id-69","text":"2","element":"a"},{"text":", which can be regarded as following a generative task paradigm. The results demonstrate that, even when the task is framed as a generation task, ChatGPT struggles with metaphysical reasoning. The low proportion of human-annotated correct generations highlights the difficulty of reasoning about metaphysical changes, regardless of task formulation. While binary evaluation offers clear performance metrics and scalability advantages, the generation task provides complementary insights into the model’s creative and reasoning capabilities. Together, these observations underscore the importance of improving LLMs’ ability to reason about distributional and situational changes, which is crucial for advancing their metaphysical reasoning capabilities.","element":"span"}]]},{"heading":"F Case Studies","paragraphs":[[{"text":"In this section, we present some examples for each of the three tasks in ","element":"span"},{"text":"M","element":"span"},{"text":"ARS ","element":"span"},{"text":"to help readers better understand our benchmark. The examples are displayed in Table ","element":"span"},{"href":"#id-109","text":"11","element":"a"},{"text":". We observe that examples in ","element":"span"},{"text":"M","element":"span"},{"text":"ARS ","element":"span"},{"text":"typically require careful reasoning and consideration of the plausibility of occurrences in reality or the metaphysical realm to make the correct discrimination.","element":"span"}],[{"id":"id-96","style":{"width":"87%"},"width":1581,"height":1914,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/25-0.png","element":"img"}],[{"text":"Table 8: Full evaluation results (%) of various language models on the testing sets of M","element":"figcaption","subtype":"caption"},{"text":"ARS","element":"figcaption","subtype":"caption"},{"text":". The best performances within each method are underlined and the best among all methods are ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"bold-faced","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"id":"id-105","style":{"width":"96%"},"width":1757,"height":1058,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/26-0.png","element":"img"}],[{"text":"Table 9: Evaluation results (%) of LLMs fine-tuned on","element":"figcaption","subtype":"caption"}],[{"id":"id-106","style":{"width":"98%"},"width":1797,"height":1383,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/26-1.png","element":"img"}],[{"text":"Table 10: Evaluation results (%) of LLMs fine-tuned on ","element":"figcaption","subtype":"caption"},{"text":"M","element":"figcaption","subtype":"caption"},{"text":"ARS ","element":"figcaption","subtype":"caption"},{"text":"under the few-shot setting. Training data refers to the ratio of sampled training data from the full training sets of ","element":"figcaption","subtype":"caption"},{"text":"M","element":"figcaption","subtype":"caption"},{"text":"ARS","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"id":"id-98","style":{"width":"99%"},"width":1815,"height":190,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/27-0.png","element":"img"}],[{"text":"Hi! Welcome to our main round HITs. Thanks for contributing to our HIT! ","element":"span"},{"text":"Please read the following instructions carefully before starting the survey. Please don't spam our HITs as there are pre-defined answers. If your performance is too poor we will disqualify you. ","element":"span"},{"text":"In this survey, you will be given some events and their inferential inferences in the format of if... then... For each sentence, your task is to determine whether you think it is plausible and commonly appears in our normal life (in the reality) or it's a metaphysical inference that is implausible and unlikely to happen in our real world. If you cannot understand the sentence as there are fatal logic, wordings, or grammar mistakes, please select the implausible option. Note that for each sentence, there is a pre-defined answer. Please answer carefully! Too low correctness rate will lead to the disqualification of the HITs.","element":"span"}],[{"style":{"width":"99%"},"width":1808,"height":1548,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/27-1.png","element":"img"}],[{"text":"Figure 6: Our annotation instruction for the workers at the metaphysical inference discrimination task. Workers are provided with both task explanations and detailed examples.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"2%"},"width":41,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/27-2.png","element":"img"}],[{"text":"If a person sneezes, then they will immediately transform into a unicorn.","element":"span"}],[{"text":"If a person eats a sandwich, then they will become invisible for 24 ","element":"span"},{"id":"id-99","text":"hours.","element":"span"}],[{"text":"Figure 7: An example of a question that has been released to the worker. Workers are asked to annotate in a four-point Likert scale.","element":"figcaption","subtype":"caption"}],[{"id":"id-109","style":{"width":"99%"},"width":1815,"height":1348,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02106/images/28-0.png","element":"img"}],[{"text":"Table 11: Case studies of three tasks in the ","element":"figcaption","subtype":"caption"},{"text":"M","element":"figcaption","subtype":"caption"},{"text":"ARS ","element":"figcaption","subtype":"caption"},{"text":"benchmark. ME, MI, and MT refer to three tasks in metaphysical reasoning, respectively. ","element":"figcaption","subtype":"caption"},{"text":"P. ","element":"figcaption","subtype":"caption"},{"text":"refers to plausible in reality and ","element":"figcaption","subtype":"caption"},{"text":"M. ","element":"figcaption","subtype":"caption"},{"text":"refers to metaphysical. The original component before the change/transition is marked in ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"(grey)","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]