Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models

2 weeks ago·arXiv

Abstract

Abstract

Large Language Models (LLMs) like closed weights ones GPT-3.5/4, Claude, Gemini or open weights ones like LLaMa 2/3, Mistral, Mixtral, and more recent ones Dbrx or Command R+ are often described as being instances of foundation models - that is, models that transfer strongly across various tasks and conditions in few-show or zero-shot manner, while exhibiting scaling laws that predict function improvement when increasing the pre-training scale. These claims of excelling in different functions and tasks rely on measurements taken across various sets of standardized benchmarks showing high scores for such models. We demonstrate here a dramatic breakdown of function and reasoning capabilities of state-of-the-art models trained at the largest available scales which claim strong function, using a simple, short, conventional common sense problem formulated in concise natural language, easily solvable by humans. The breakdown is dramatic, as models also express strong overconfidence in their wrong solutions, while providing often non-sensical "reasoning"-like explanations akin to confabulations to justify and backup the validity of their clearly failed responses, making them sound plausible. Various standard interventions in an attempt to get the right solution, like various type of enhanced prompting, or urging the models to reconsider the wrong solutions again by multi step re-evaluation, fail. We take these initial observations to the scientific and technological community to stimulate urgent re-assessment of the claimed capabilities of current generation of LLMs, Such re-assessment also requires common action to create standardized benchmarks that would allow proper detection of such basic reasoning deficits that obviously manage to remain undiscovered by current state-of-the-art evaluation procedures and benchmarks1.

1 Introduction

In the recent breakthroughs on transferable learning that were achieved in various classical domains of machine learning like visual recognition [1] or language understanding [2, 3, 4], large language models (LLMs) have played a very prominent role. Auto-regressive language modelling by next token prediction using causal mask losses was among the first successful approaches to self-supervised learning that was also scalable, both in terms of available web-scale text data and model scale [4]. The generic form and scalability of this type of learning allowed then to push towards training scales not achievable before with conventional supervised label-based learning, and provided glimpse on what happens on those larger scales. Scaling laws derived via training experiments on much smaller scales (2-3 orders of magnitude in model and data scale) enabled to predict the properties and functions of models trained on larger scales [5, 6].

Using standardized benchmarks for language based tasks, those predictions were confirmed by training very large language models like closed GPT-3/4 [4, 5, 7] or open weights models like Llama [8, 9] or Mistral [10], and observing high benchmark scores across a broad range of various task types, which matched the predictions of strong function and transfer to novel task types and scenarios. LLMs excelled especially in few-shot and zero-shot tasks [11], outperforming previous state-of-the art (SOTA) by extremely large margins and showing so called emergent functions like in-context learning not encouraged explicitly during pre-training, providing further hints for the generalization and transfer capabilities of very different kind as opposed to those that were possible in the machine learning field before.

There were however observations made by various works that questioned the claimed strong out-of-distribution generalization, transfer and reasoning capabilities attributed to LLMs [12]. These works pointed out various function failures that were seemingly incompatible with postulated strong capabilities as measured by standardized benchmarks [13, 14, 15, 16]. However, it has also been noted that observed failures can frequently be addressed through simple adjustments to the prompts or by repeated execution and evaluation using majority voting, or by requesting the model to perform self-verification. [17, 18, 19, 20, 21]. This questions whether there is a real deficit in core model capabilities or whether there is rather a minor issue that can be easily resolved by a more careful, but simple, model handling.

To shed light on this current situation, we introduce here a simple, short conventional problem that is formulated in concise natural language and can be solved easily by humans. The original problem formulation, of which we will present various versions in our investigation is as following: "Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?". The problem features a fictional female person (as hinted by the "she" pronoun) called Alice, providing clear statements about her number of brothers and sisters, and asking a clear question to determine the number of sisters a brother of Alice has. The problem has a light quiz style and is arguably no challenge for most adult humans and probably to some extent even not a hard problem to solve via common sense reasoning if posed to children above certain age.

We posed varying versions of this simple problem (which in following we will refer to as "Alice In Wonderland problem", AIW problem) to various SOTA LLMs that claim strong reasoning capabilities. We selected closed ones like GPT-3.5/4/4o (openAI), Claude 3 Opus (Anthropic [22]), Gemini (Google DeepMind [23]), and open weight ones like Llama 2/3 (Meta), Mistral and Mixtral (Mistral AI), including very recent Dbrx by Mosaic [24] and Command R+ by Cohere [25] (which are stated in numerous announcements to lead the open weights models as of April 2024, according to open LLM leaderboards). We analyse the response statistics and observe strong collapse of reasoning and inability to answer the simple question as formulated above across most of the tested models, despite claimed strong reasoning capabilities. Notable exceptions are Claude 3 Opus and GPT-4 that occasionally manage to provide correct responses backed up with correct reasoning as evident in structured step by step explanations those models deliver together with solution. However, Claude 3 Opus and GPT-4 still show frequent failures to solve this simple problem across trials. Following the relational logic of the problem, we formulated a harder form, where both Claude 3 Opus and GPT-4o collapse almost to 0 success rate.

This breakdown can be considered to be dramatic not only because it happens on such a seemingly simple problem, but also because models tend to express strong overconfidence in reporting their wrong solutions as correct, while often providing confabulations to additionally explain the provided final answer, mimicking reasoning-like tone but containing nonsensical arguments as backup for the equally nonsensical, wrong final answers. Those confabulated explanations may mislead readers into thinking that there might be sound reasoning behind the wrong answers, or at least stir confusion, as they often sound plausible while being entirely or partly off. The breakdown appears dramatic also because when attempting to fix the failures via various usual interventions like enhanced prompting or by explicitly pointing the models to committed mistakes and requesting to reconsider the responses, the models keep producing more nonsense, often in lengthier and sometimes more entertaining form, leading stubbornly to the same wrong final answers. We provide examples of such failures in various SOTA models that claim strong reasoning.

In the face of these observations and findings, we conclude that the capabilities of the current generation of SOTA LLMs to perform even simple reasoning on common sense tasks are heavily compromised and that current language model benchmarks, especially those aiming on measuring reasoning capabilities, do not properly reflect such weaknesses. We discuss how this can be consistent with the observed strong performance on a range of complex real world tasks (e.g., graduate exams) and finish with a call to the scientific community to address the challenge of re-assessing the reasoning capabilities of LLMs together by building reasoning task benchmarks that are able to spot such various deficits and thus show the path for improvement of current unsatisfactory state.

2 Methods & Experiment Setup

2.1 Finding a simple common sense reasoning task that break the models

Following a number of works that attempt to test capabilities and identify weak spots of current LLMs, we were looking for a simple problem setting that would allow us to test a model’s ability to perform basic reasoning with a common sense character. We looked into tasks presented to elementary school students of age 7-10 on the level of math Olympiads composed for young children. Solving such tasks does not require sophisticated knowledge, but already contains challenges that are not entirely trivial and have to be solved by carefully applying various forms of basic reasoning.

Taking inspiration from those problems and aiming for even simpler settings, we arrived at a very simple problem template that can be easily solved using common sense reasoning but is not entirely straightforward, of the following form: "Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?". The problem - we call it here "AIW problem" - has a simple common sense solution which assumes all sisters and brothers in the problem setting share the same parents. The correct response C - number of sisters - is easily obtained by calculating M + 1 (Alice and her sisters), which immediately gives the number of sisters Alice’s brother has.

Initially, we assumed that the AIW problem will pose no challenge for most of the current state-of-the-art LLMs. In our initial experiments, we started by looking at the problem instance with N = 4 and M = 1, with correct the response being C = 2 sisters and we noticed to our surprise that on the contrary, most state-of-the-art models struggle severely to provide correct responses when confronted with the problem. We also noticed during early experimentation that depending on choice of N and M and also the ordering of brothers and sisters in the sentence, the rate of correct responses may vary substantially.

Often, models seemed to rely on attempting to execute various basic arithmetical operations on the numbers mentioned in the problem to arrive to a final answer. The probability to get a correct response was thus influenced by how likely executing few arbitrary simple calculations, like additions or multiplications that included numbers presented in the problem text, might also accidentally result in a correct answer, although the way of arriving there had nothing to do with correct reasoning (see Appendix G for a selection of such examples). After rounds of experimentation with instances of the AIW problem, varying numbers and ordering, we arrived at four different problem instances, or problem variations, which we term "AIW Variation" 1-4. Each variation has a particular combination of N, M numbers with a corresponding correct answer C, and ordering of brothers-sisters in the AIW problem as presented above: AIW Variation 1, N = 3, M = 6, C = 7; AIW Variation 2, N = 4, M = 2, C = 3; AIW Variation 3, N = 1, M = 4, C = 5; AIW Variation 4, N = 4, M = 1, C = 2. When choosing variations, we aimed at reducing the probability of arriving at a correct answer by change by executing an arbitrary chosen simple one-step arithmetical operation on numbers in the problem (only Variation 3 allows that; see Suppl. Sec. B for the full version of the prompts constructed from those variations).

We then used this set of the AIW problem variations to conduct our main experiments, which allowed us to arrive at central conclusions of our investigation regarding basic reasoning capabilities of the studied LLMs.

2.2 Evaluating model responses

To perform evaluations of model performance, we were confronted with the question of how to parse and extract the final answer from the responses provided by the models when confronted with the input containing the AIW problem formulation. On the one hand, it should be possible to deal with any response provided by the model as a solution, while on the other hand allowing to extract a clear final answer as a number to be compared with the correct answer, such that for each model response a decision can be made on whether the provided response was right or wrong. To be able to keep the parsing procedure simple (without involving for instance another suitable LLM prompting it to extract the relevant part of response), we have chosen to add to the problem prompt following passage: "provide the final answer in following form: "### Answer: "". We observed that all models we have chosen to test were able to follow such an instruction, providing a response that could be easily parsed.

The presented prompt extension makes it possible to extract for each prompting trial whether a model has provided a correct answer to the AIW problem posed in the input. We can interpret then any number n of collected responses as executing n trials given a particular prompt for a given model (n - number of Bernoulli trials), observing in each th trial a Bernoulli variable . We interpret the number of correct responses as random variable following a Beta-Binomial distribution with unknown probability p of correct response that we also treat as random variable that comes from a Beta distribution, i.e. , where and are parameters of the Beta distribution. To obtain plots showing correct response ratios, we would like to estimate Beta distribution underlying p, and for that, we first estimate the mean of p and its variance from the collected observations. To estimate , we use the formula for estimating the mean of p for a binomial distribution: (i.e. as a proportion of successes). We can report the estimate as the estimate of the correct response rate of a given model and also, compare the correct response rates of various tested models. Moreover, we can estimate the variance of the probability of a correct response by using the following formula:

The estimates of the variance and the standard deviation of p can be thus obtained by using as

following relations for the variance:and the meanin order to obtain and parameters for the Beta distribution. To simulate data for the plots, we draw N random samples corresponding to correct and incorrect responses using the estimated distribution of p and obtain the plots showing performance on the task for various models of interest as a full distribution of the respective p.

Model prompt types. It is well known that so-called prompt engineering can heavily influence the model behavior and model response quality [26, 27, 28]. To account for the response variations due to various prompt forms, we created 3 distinct prompt types asking for the solution to the AIW problem: STANDARD, THINKING, and RESTRICTED. The STANDARD prompt type asks to solve the posed problem and output the final answer in the format as described above. This does not put any specific requirements on model behavior. The THINKING prompt type extends STANDARD with the request to think carefully and double check the solution for any mistakes. This should encourage model to invest more computation into obtaining the solution. In contrast to this, the RESTRICTED prompt urges the model to output only the final answer without any further text. This is supposed to restrict compute invested in producing output. We observe substantially shorter outputs across tested models compared to STANDARD and THINKING for this prompt type (Suppl. Fig. 13).

Furthermore, we make use of several prompt types (see Suppl. Sec.B for overview) to demonstrate the important properties and the different success or failure modes of the model behavior for the AIW problem. In those prompts, we re-use the main problem formulation as introduced in Sec. 2.1, while adding various modifications. This allows us for instance, to see examples of confabulations that contain clearly broken reasoning-like statements backing up wrong final answers, expression of levels of certainty highlighting overconfidence and miscalibration, or in-context learning handling.

2.3 Selecting models for evaluation and conducting experiments

We are interested in testing current state-of-the-art models that claim strong function, especially in reasoning, backed up by high scores shown on standardized benchmarks that are assumed to measure problem solving and reasoning capabilities. We therefore select models widely known and used in the ML community that also appear in the top rankings of the popular LLM leaderboards, like openLLM leaderboard by HuggingFace or ELO leaderboard by LMsys. All those models are listed in Suppl. Tab. 4, together with the corresponding standardized benchmarks where they obtain strong scores. Whenever possible, we chose from the same family of models with various scales, ranging from small to large, in order to see how the capabilities to solve the posed task may vary with scale.

Having selected the models to test, we conduct our main experiments by exposing each model to the collection of prompts that contain AIW problem variations as described in Sec. 2.1,2.2. We use hosting platforms that offer API access to the models we test, and automatize the procedure by scripting the routines necessary to prompt models with our prompts set. The routines are simple and can be used by anybody with access to the APIs (we used litellm and TogetherAI for our experiments) or to locally hosted models to reproduce and verify our results. We protocol all the data from interactions with the models to enable checking by the community. The collected data is then used to investigate correct response rates of the models as described in Sec. 2.2. We also use distinct prompt types STANDARD, THINKING, RESTRICTED to observe differences in model response depending on the compute the model is encouraged to invest into producing the output. For each model, we obtain at least 30 trials for each combination of AIW variation and prompt type (Suppl. Fig. 12). We also use variations of the prompts to study responses given by base models as a control experiment to check whether we see any difference with respect to main observations done on instruction type models, and we employ further prompt variations to demonstrate the model confidence estimation and self-verification or self-correction attempts (see Suppl. Sec.B for the examples of such prompts).

3 Results

Figure 1: Alice is reasoning: will it break? Illustration of Humpty Dumpty from Through the Looking Glass, by John Tenniel, 1871. Source: Wikipedia.

3.1 Humpty Dumpty sat on a wall: breakdown of SOTA LLMs on the simple AIW problem

AIW causes collapse of reasoning in most SOTA LLMs. Following the procedures described in Sec 2.3, we expose the selected models that claim strong function and reasoning capabilities (Fig. 1) and assess their performance in the form of correct response rate to the various prompt types containing the problem description. We summarize the main results in the Fig. 2. The results suggest that confronted with the AIW problem described in Sec 2.1, most of the models suffer a severe breakdown, with many not being able to deliver a single correct response and the majority not being able to obtain a correct response rate above p = 0.2. The only major exceptions from this main observation of model reasoning failure are the largest scale closed models GPT-4 (openAI) and Claude 3 Opus (Anthropic). These two model types at largest scales obtain correct response rates well above p = 0.3, leaving the remaining large and smaller scales open-weights (e.g., Mistral-7B,

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Figure 2: Collapse of most SOTA LLMs on AIW problem. Models with non-zero AIW (main) and AIW+ (inlay) correct response rate (averaged across prompt variations with prompt types THINKING and STANDARD). Leading on AIW, GPT-4o collapses strongly on AIW+. Omitted models score 0.

Mixtral, Qwen, Command R+, and Dbrx Instruct) and closed-weights models (e.g., Gemini Pro, Mistral Large) far behind.

The results presented in the Fig. 2 show estimates for correct response rates for STANDARD and THINKING prompt types - we separate RESTRICTED prompt type results since this prompt forces models into short outputs, restricting the compute they can spend on providing a solution and further deteriorating the response quality (see Suppl. Sec. C for more details). Among the 4 models that are able to cross p = 0.3, two clear winners are the very recent GPT-4o and Claude 3 Opus, while the only open-weights model in this set of better performers is the rather older Llama-2 70B Chat. By inspecting the responses with correct answers provided by those better performers, we indeed see mostly correct reasoning executed to arrive at the final correct answer. For the models that do not perform well and are able to deliver correct answers only seldomly, we still see in some of those very rare responses with correct final answer correct proper reasoning, for instance in case of Mistral/Mixtral, Dbrx Instruct, CodeLLama. We see however also responses with a correct final answer, which after careful inspection, turns out to be an accident of executing entirely wrong reasoning, where many accumulating mistakes accidentally lead to the final number corresponding to the right answer. Those wrong-reasoning-right-answer responses are encountered in models that perform poorly (p < 0.3) (see Suppl. Sec. D for response examples). Further, we also observe strong fluctuation of correct response rates across AIW variations 1-4 as introduced in Sec. 2.2 (Suppl. Sec. C, Fig. 11) and also across prompt types, especially RESTRICTED vs STANDARD and THINKING (Suppl. Fig. 9)

Standardized benchmarks failure. All of the tested models report high scores on various standardized benchmarks that claim to test reasoning function, e.g. MMLU, ARC, Hellaswag, to name a few. Clearly, our observations hint that those benchmarks do not reflect deficits in basic reasoning of those models properly. We visualize this failure by plotting performance of tested models that are reported to obtain on wide-spread and accepted standardized benchmarks like MMLU versus the performance we observe on our proposed AIW problem. As strikingly evident from Fig. 3, there is a strong mismatch between high scores on MMLU reported by the models and the correct response rates they obtain on AIW. This miscalibration with respect to standardized benchmarks makes it impossible for a given model to predict from higher scores on MMLU whether the model will be also able to solve AIW on the one hand. On the other hand, it deranges any model comparison with the standardized benchmark, as models with higher MMLU can have complete breakdown in AIW, while models with lower MMLU can have some non-negligible AIW performance. A clear example of such case is comparison LLama 2 70B and such models claiming very strong performance like Command R+ or Dbrx. Those recent models claiming strong function via scoring clearly higher on

Figure 3: Failure of standardized benchmark MMLU to properly reflect and compare model basic reasoning capabilities as shown by strong discrepancy between AIW correct response rate vs MMLU average score. Many models, eg. Command R+, score 0 on AIW, but have high MMLU score.

MMLU (and other standardized benchmarks) than older LLama 2 70B undergo severe breakdown on AIW - e.g. Command R+ is hardly able to solve any instance of an AIW task (see also Suppl. Tab. 4). For further results on other standardized benchmarks with similar character see Suppl. Sec. C.3, Suppl. Tab. 5 and Suppl. Fig. 4,5,6,7.

Go small, go home: breakdown at smaller scales. The few models capable of showing significant non-zero correct response rate for the AIW problem are residing on the largest scale. GPT-4 and Claude 3 Opus have unknown scales, it is however reasonable to assume the model scale is well beyond 70B params and the tokens scale is well beyond 2T tokens. Observing the performance on the AIW problem across various models, we see evidence that LLama 2 is a model family that has significantly higher correct response rates compared to other models that have non-zero correct response rates. Within the LLama 2 family, while 7B and 13B model scales achieve non-zero rates, 70B is clearly outperforming those smaller scales (Fig. 2, 3, Suppl. Tab. 4). In general, we observe that smaller scale models (known to have been overtrained on large token budgets of > 2T tokens) that have quite high scores on standardized reasoning benchmarks coming close to larger scale ones, suffer severe collapse on the AIW problem and no small scale model can even remotely approach the performance shown by larger scale ones that can occasionally handle the AIW problem (LLama 2 70B, GPT-4 or Claude Opus).

3.2 Curiouser and curiouser

Following our observations of failure of most SOTA LLM models that claim strong function described in Sec. 3.1, we investigated various properties and modes of the observed failure, reporting here the ones we find most remarkable. Investigating the AIW problem further, we also find a formulation (termed AIW+) that is even harder to solve for all the tested SOTA models and we observe strong collapse of the performance across all tested models when exposing them to the AIW+ problem.

Overconfidence about wrong solutions. Observing strong failures, we were curious to see how models explain their generated solutions. For the THINKING prompt type, where prompt contains request to double check the solution, we encounter examples where models spontaneously provide assessment of the solution quality and their confidence into the solution. Remarkably, we see that in many cases of the observed responses with wrong reasoning and wrong final answers, the models claim high quality for their provided solution and are also strongly confident that the provided wrong solution is correct. For instance, Claude 3 Opus uses expressions like “logic holds up; double- checking the solution; no mistakes in the reasoning; solution is correct.", and Command R+ reports “This conclusion is straightforward and clear" for the wrong answers they provide.

We further use variations of the prompt types to make the models generate estimates of the solution quality and their confidence on it, like the SCIENTIST prompt or the CONFIDENCE prompt (see Suppl. Sec. B). With those customized prompt types, we again observe strong overconfidence in the solution quality across the tested models. For the SCIENTIST prompt type, we see for instance LLama 2 70B using persuasive expressions like “carefully analyzing; use logical reasoning; provide a precise and accurate solution; conclusion might seem counterintuitive at first, but it’s actually correct" to back up its wrong solutions. For the CONFIDENCE prompt type, we see for instance for the wrong responses given by Command R+ accompanying statements like “The solution is clear and unambiguous, and I am highly confident that it is correct."; “I am confident in this answer, as it logically follows from the provided information.". See Suppl. Sec. F for full examples.

Confabulations to back up wrong solutions. We observe that many models that show reasoning breakdown and produce wrong answers generate at the same time persuasive explanations that contain reasoning-like or otherwise plausible sounding statements to back up the often non-sensical solutions they deliver. We call here such phenomena confabulations. Such confabulations may contain for instance calculations or logic-like statements that make no sense. Confabulations can also refer to reasoning about social norms or structures. For instance, in Command R+ we observe many confabulations that use concepts of gender identity such as non-binary gender or concepts related to inclusion or to cultural context dependent family identification as additional backup for the provided wrong reasoning and incorrect answers. Another type of confabulation that we observe is complete refusal to answer due to invented ethical concerns about the nature of the posed AIW problem, such as violation of privacy or lack of inclusion (for instance in CodeLLama-70B-instruct), or by expressing incorrect concerns about supposedly ill-posed problem formulation. See Suppl. Sec. G for more details.

Hard AIW+ problem formulation: breakdown of even the strongest models. Aiming to show that further extension of the initially very simple AIW problem to a harder level will cause further trouble for the tested models, we constructed an AIW+ problem that uses same logic as AIW, but features additional hierarchy and distractors when describing relational family structure (see Suppl. Sec B for full formulation). We manually evaluate the responses for the AIW+ problem to make sure that the models we are testing couldn not just guess and accidentally come to the right solution. Exposing models to AIW+ following same methodology from Sec. 2, we observe further, even stronger collapse of performance, also for those models that were showing significant correct response rates for AIW problem. Remarkably, on AIW+ the GPT-4 (also GPT-4o) and Claude 3 Opus models claiming very strong function collapse to a level close to 0 (see Fig. 2 (inlay) and Suppl. Sec. C.2).

Further relevant observations. We outline here further experiments that provide hints to the nature of the observed failures and incapabilities. 1. Inability to revise wrong solutions. We experiment with customized prompts to enable multi-turn self-verification and also conduct multi-turn interactions with selected models to encourage those to revise their wrong solutions. In the majority of those attempts, while eagerly agreeing to revise the solutions and check those for possible mistakes, models show failure to properly detect mistakes and to revise wrong solutions (Suppl. Sec. H). 2. Reformulation of AIW as relational SQL database problem. We make use of relational logic underlying the AIW problem structure and prompt models to reformulate AIW into a correct relational SQL database format, using a customized SQL-FORM prompt type. We observe that smaller scale models, and also some larger scale ones, consistently fail to generate correct a relational SQL form. Some models are on the contrary able to do so frequently, e.g., Mistral/Mixtral (Suppl. Sec. I) 3. Parameterized version AIW-Param. We use a parameterized AIW formulation containing variables (N,M; or X,Y ) for brothers and sisters instead of natural numbers. We again observe a failure of the majority of the models to cope with this generic form, being unable to produce a correct generic solution, e.g. C = M + 1 (Suppl. Sec. C.1). 4. Base model performance. We inspect the performance of base model selection that claim strong function, e.g. Llama 2, Llama 3, Mistral/Mixtral by using AIW-base prompt types compatible with base model function. We observe the same reality as with instruct models, measuring strong collapse on AIW problem (Suppl. Sec. E. 5. In-context-learning (ICL) As a shortcut solution in form of M + 1 exists for AIW problem, we do not expect from ICL to install reasoning skills when confronted with solved examples of AIW problem instances. We execute the experiment to check whether showing problem instances with natural numbers would enable solving of a more general AIW-Param problem. We observe failure of models to do so (Suppl. Sec. J)

4 Related work & limitations

Measuring LLMs capabilities. Since the seminal breakthroughs in language modelling [2, 3, 4], measuring LLM capabilities became indispensable for evaluations and model comparison. To measure how well a language model performs on reasoning, there exists a plethora of different standardized reasoning benchmarks. These benchmarks can be roughly divided into categories by what exact reasoning capability we want to test such as ARC [29], PIQA [30], GSM8K [31], HellaSwag [32], MMLU [33] or WinoGrande [34]. Multiple works aim on improving reasoning performance of LLMs as measured by those standardized benchmarks in various ways [27, 35, 36, 37, 38].

Finding weak spots in LLMs capabilities. Paralleling impressive progress shown by LLM research, cautious voices have been raising concern about discrepancy between claimed capabilities as measured by standardized benchmarks and true LLM reasoning skills by presenting carefully selected evidence for model failures [12]. In response, the research community has been undertaking attempts to create more challenging benchmarks like HELM [39] or BIG-bench [40]. These benchmarks also aimed at properly testing generalization capabilities beyond memorization, in line with recent works that pointed out high test dataset contamination due to large-scale pre-training on web-scale data [14, 15].

Similar in spirit to our work, multiple studies [13, 41, 42, 43, 44, 45] have shown breakdowns of language models reasoning capabilities in different scenarios and lack of robustness to variation of problem formulation. Other works were looking into particular reasoning failures like deficits in causality inference [46, 47]. These works operate often with formalized, rather complex problems that does not have simple common sense character. Here we show breakdown on a common sense problem with very simple structure, which emphasizes a deficiency in basic reasoning. A key limitation of our current approach is the lack of sufficient diversity in AIW problem variations. This can be addressed in future work by systematic procedural instance generation for broader response evaluation.

5 Discussion & Conclusion

Using a very simple AIW problem formulation that can be easily solved by adults and arguably even children, we observe here a striking breakdown of SOTA LLMs performance when confronted with the task. This dramatic breakdown hints on serious deficits in basic reasoning capabilities in models that are widely claimed to possess strong function and reasoning skills, often citing their performance on a set of standardized benchmarks or the experience of various user groups or their creators. The overall breakdown and strong fluctuation of observed performance across variations of the same problem also hints at fundamental issues with the generalization capability of the models, which echoes and confirms concerns expressed in number of previous works [48, 13, 15]

However, the evidence obtained in this study points to a more complex puzzling picture than a straightforward story of out-of-distribution generalization failure for current SOTA LLMs. Albeit observed collapse of reasoning and performance on AIW problem, accompanied by evident model miscalibration and overconfidence, confabulations alongside incorrect answers, and inability to revise wrong solutions, we see larger-scale models like GPT-4 and Claude 3 Opus coping with the AIW problem, occasionally providing clearly correct reasoning backing up correct answers. Despite strong fluctuations across AIW variations, such correct reasoning leading to correct answers appears, though with strongly varying frequency. This is also the case for AIW+, where GPT-4 and Claude 3 Opus suffer further breakdown, but still provide on very rare occasions correct reasoning-answer responses. The same is true for the much less performant models that show poor or very poor ability to cope with AIW task (e.g., Mistral/Mixtral, LLama 2/3, Dbrx instruct) - also those models manage to generate on rare occasions correct reasoning-answer responses across AIW variations. We hypothesize that generalization and core reasoning abilities are thus latently present in those models, as otherwise they would not be able to generate such responses at all, as guessing correct answer including full correct reasoning by accident in such cases is impossible. The fact that the correct reasoning responses are rare and model behavior is not robust to problem variations demonstrates though deficiency to exercise proper control over these capabilities. Investigating the highly interesting question of what causes this deficiency is subject of future work.

What becomes quite clear through our study is the failure of current standardized benchmarks to reflect true model reasoning capabilities and to reveal their weaknesses. As it is evident from Fig. 3 and Suppl. Tabs. 4, 5, many models claiming high standardized scores perform very poorly on AIW. At the same time, older models like LLama 2 70B with lower MMLU, ARC-c and GSM8K scores outperform on AIW clearly those claiming much higher scores, e.g. Command R Plus which suffers complete breakdown on AIW. This hints that model comparison using standardized benchmarks might be seriously jeopardized. The appraisal for the smaller scale models, e.g. Mistral-7B or LLama 2/3 7/8B, is based to large extent on such standardized benchmarks coming close or even matching larger scale models. We observe here however a severe breakdown for smaller scale models on AIW, with a clear large gap to better performing models that all reside at larger scales. We hypothesize here that the claimed strong functions of smaller scale models might be a mere illusion corroborated by broken benchmarks that in their current state cannot offer a proper model comparison and thus also cannot be used as downstream tasks for measuring important scaling laws.

We think that observations made in our study should serve as strong reminder that current SOTA LLMs are not capable of sound, consistent reasoning, as shown here by their breakdown on even such a simple task as the presented AIW problem, and enabling such reasoning is still subject of basic research. This should be also a strong warning against overblown claims for such models beyond being basic research artifacts to serve as problem solvers in various real world settings, which are often made by different commercial entities in attempt to position their models as a strong mature product for end-users (see for instance announcements and claims for Command R+ that breaks down on AIW entirely, emphasizing its high value for "key business-critical capabilities", or "real-world enterprise use cases" [25, 49, 50], and stressing the supposedly present core reasoning capabilities; the same is true for many other commercial models that claim high product value).

Observed breakdown of basic reasoning capabilities, coupled with such public claims (which are also based on standardized benchmarks), present an inherent safety problem. Models with insufficient basic reasoning are inherently unsafe, as they will produce wrong decisions in various important scenarios that do require intact reasoning. Current standardized reasoning benchmarks and claims based on those create illusion of reasoning capabilities that are actually absent, and making it even worse - such models are overconfident, insisting on their wrong answers being correct, and producing highly persuasive and suggestive explanations for their wrong responses, which might mask mistakes for the end-users due to partly plausible sounding text (see Suppl. Sec. G for examples of such confabulations). To ensure safety, public claims should be based only on those scientific evaluations that properly measure the model’s capabilities to reason, while basic research has to be performed using such benchmarks to endow future models with sufficient basic reasoning skills.

To perform basic research in direction of improving currently unsatisfactory reasoning skills of SOTA LLMs, it is thus important that whole pipeline of model creation - dataset composition and dataset itself, source code for training, the trained model itself, the standardized benchmarking procedure -is fully open and reproducible. Models that has open weights only do not allow for proper analysis of what might have gone wrong during training that might have resulted in broken reasoning skills - for instance, changing the dataset mix or training procedure itself. Closed models accessible via API only often do not even allow proper evaluation, as for instance default settings such as system prompt and other inference hyperparameters may remain invisible to independent evaluation parties. We think therefore that proper progress in studying how to evaluate and install proper reasoning skills in the future models necessarily requires full training pipeline of a model - especially the often neglected dataset composition - to be open-source (see recent work on FineWeb [51] for an encouraging example of such an open dataset composition pipeline), as otherwise the claims about reasoning capabilities will stay unsubstantiated and intrasparent.

Facing these initial findings, we would like to call upon scientific and technological ML community to work together on providing necessary updates of current LLM benchmarks that obviously fail to discover important weaknesses and differences between the studied models. Such updates might feature sets of problems similar to AIW studied here - simple, to probe specific kind of reasoning deficiency, but customizable, thus offering enough combinatorial variety to provide robustness against potential contamination via memorization. We think that strong, trustful benchmarks should follow Karl Popper’s principle of falsifiability [52] - not trying to confirm and highlight model’s capabilities, which is tempting especially in commercial setting, but in contrast do everything to break model’s function, highlighting its deficits, and thus showing possible ways for model improvement, which is the way of scientific method. Building such reasoning benchmarks in common effort will give us a tool both to protect us from overblown claims about model function and to properly navigate the path for improvement of the currently still unsatisfactory state.

Acknowledgments

We would like to express gratitude to all the people who are working on making code, models and data publicly available, advancing community based research and making research more reproducible. Specifically, we would like to thank all the members of the LAION Discord server2 community and Open-(Open-Sci) Collective3 for providing fruitful ground for scientific exchange and open-source development.

MN acknowledges funding by the Federal Ministry of Education and Research of Germany under grant no. 01IS22094B WestAI - AI Service Center West.

LCK acknowledges the Helmholtz Information & Data Science Academy (HIDA) for providing financial support enabling a short-term research stay at Juelich Supercomputing Center (JSC), Research Center Juelich (FZJ) to conduct research on foundation models.

References

[1] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.

[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

[3] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.

[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.

[5] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.

[6] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack William Rae, and Laurent Sifre. An empirical analysis of compute-optimal large language model training. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.

[7] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.

[8] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.

[9] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta,

Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. arXiv, 7 2023.

[10] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.

[11] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.

[12] Melanie Mitchell. How do we know how smart ai systems are? Science, 381(6654):eadj5957, 2023.

[13] Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. arXiv e-prints, pages arXiv–2307, 2023.

[14] Shahriar Golchin and Mihai Surdeanu. Time travel in llms: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493, 2023.

[15] Changmao Li and Jeffrey Flanigan. Task contamination: Language models may not be few-shot anymore. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18471–18480, 2024.

[16] Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt. Advances in Neural Information Processing Systems, 36, 2024.

[17] Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022.

[18] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023.

[19] Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, 2023.

[20] Wenqi Zhang, Yongliang Shen, Linjuan Wu, Qiuying Peng, Jun Wang, Yueting Zhuang, and Weiming Lu. Self-contrast: Better reflection through inconsistent solving perspectives. arXiv preprint arXiv:2401.02009, 2024.

[21] Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, and William Yang Wang. Automatically correcting large language models: Surveying the landscape of diverse automated correction strategies. Transactions of the Association for Computational Linguistics, 12:484–506, 2024.

[22] Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024.

[23] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.

[24] Mosaic. Introducing dbrx: A new state-of-the-art open llm.

[25] Cohere. Introducing command r+: A scalable llm built for business.

[26] Simran Arora, Avanika Narayan, Mayee F Chen, Laurel Orr, Neel Guha, Kush Bhatia, Ines Chami, and Christopher Re. Ask me anything: A simple strategy for prompting language models. In The Eleventh International Conference on Learning Representations, 2022.

[27] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.

[28] Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C Schmidt. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382, 2023.

[29] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.

[30] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.

[31] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.

[32] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.

[33] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020.

[34] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WINOGRANDE: an adversarial winograd schema challenge at scale, 2019.

[35] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.

[36] Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.

[37] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2022.

[38] Jacob Pfau, William Merrill, and Samuel R Bowman. Let’s think dot by dot: Hidden computation in transformer language models. arXiv preprint arXiv:2404.15758, 2024.

[39] Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Alexander Cosgrove, Christopher D Manning, Christopher Re, Diana Acosta-Navas, Drew Arad Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue WANG, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri S. Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Andrew Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023. Featured Certification, Expert Certification.

[40] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Rahane, Anantharaman S. Iyer, Anders Johan Andreassen, Andrea Madotto, Andrea Santilli, Andreas Stuhlmüller, Andrew M. Dai, Andrew La, Andrew Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karaka¸s, B. Ryan Roberts, Bao Sheng Loe, Barret Zoph, Bartłomiej Bojanowski, Batuhan Özyurt, Behnam Hedayatnia, Behnam Neyshabur, Benjamin Inden, Benno Stein, Berk Ekmekci, Bill Yuchen Lin, Blake Howald, Bryan Orinion, Cameron Diao, Cameron Dour, Catherine Stinson, Cedrick Argueta, Cesar Ferri, Chandan Singh, Charles

Rathkopf, Chenlin Meng, Chitta Baral, Chiyu Wu, Chris Callison-Burch, Christopher Waites, Christian Voigt, Christopher D Manning, Christopher Potts, Cindy Ramirez, Clara E. Rivera, Clemencia Siro, Colin Raffel, Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Dan Garrette, Dan Hendrycks, Dan Kilman, Dan Roth, C. Daniel Freeman, Daniel Khashabi, Daniel Levy, Daniel Moseguí González, Danielle Perszyk, Danny Hernandez, Danqi Chen, Daphne Ippolito, Dar Gilboa, David Dohan, David Drakard, David Jurgens, Debajyoti Datta, Deep Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes, Diganta Misra, Dilyar Buzan, Dimitri Coelho Mollo, Diyi Yang, Dong-Ho Lee, Dylan Schrader, Ekaterina Shutova, Ekin Dogus Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth Donoway, Ellie Pavlick, Emanuele Rodolà, Emma Lam, Eric Chu, Eric Tang, Erkut Erdem, Ernie Chang, Ethan A Chi, Ethan Dyer, Ethan Jerzak, Ethan Kim, Eunice Engefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, Fernando Martínez-Plumed, Francesca Happé, Francois Chollet, Frieda Rong, Gaurav Mishra, Genta Indra Winata, Gerard de Melo, Germán Kruszewski, Giambattista Parascandolo, Giorgio Mariani, Gloria Xinyue Wang, Gonzalo Jaimovitch-Lopez, Gregor Betz, Guy Gur-Ari, Hana Galijasevic, Hannah Kim, Hannah Rashkin, Hannaneh Hajishirzi, Harsh Mehta, Hayden Bogar, Henry Francis Anthony Shevlin, Hinrich Schuetze, Hiromu Yakura, Hongming Zhang, Hugh Mee Wong, Ian Ng, Isaac Noble, Jaap Jumelet, Jack Geissinger, Jackson Kernion, Jacob Hilton, Jaehoon Lee, Jaime Fernández Fisac, James B Simon, James Koppel, James Zheng, James Zou, Jan Kocon, Jana Thompson, Janelle Wingfield, Jared Kaplan, Jarema Radom, Jascha Sohl-Dickstein, Jason Phang, Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, Jesse Engel, Jesujoba Alabi, Jiacheng Xu, Jiaming Song, Jillian Tang, Joan Waweru, John Burden, John Miller, John U. Balis, Jonathan Batchelder, Jonathan Berant, Jörg Frohberg, Jos Rozen, Jose Hernandez-Orallo, Joseph Boudeman, Joseph Guerr, Joseph Jones, Joshua B. Tenenbaum, Joshua S. Rule, Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva, Katja Markert, Kaustubh Dhole, Kevin Gimpel, Kevin Omondi, Kory Wallace Mathewson, Kristen Chiafullo, Ksenia Shkaruta, Kumar Shridhar, Kyle McDonell, Kyle Richardson, Laria Reynolds, Leo Gao, Li Zhang, Liam Dugan, Lianhui Qin, Lidia Contreras-Ochando, Louis-Philippe Morency, Luca Moschella, Lucas Lam, Lucy Noble, Ludwig Schmidt, Luheng He, Luis Oliveros-Colón, Luke Metz, Lütfi Kerem Senel, Maarten Bosma, Maarten Sap, Maartje Ter Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli, Marco Maru, Maria Jose Ramirez-Quintana, Marie Tolkiehn, Mario Giulianelli, Martha Lewis, Martin Potthast, Matthew L Leavitt, Matthias Hagen, Mátyás Schubert, Medina Orduna Baitemirova, Melody Arnaud, Melvin McElrath, Michael Andrew Yee, Michael Cohen, Michael Gu, Michael Ivanitskiy, Michael Starritt, Michael Strube, Michał Sw˛edrowski, Michele Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac Suzgun, Mitch Walker, Mo Tiwari, Mohit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini, Mukund Varma T, Nanyun Peng, Nathan Andrew Chi, Nayeon Lee, Neta Gur-Ari Krakover, Nicholas Cameron, Nicholas Roberts, Nick Doiron, Nicole Martinez, Nikita Nangia, Niklas Deckers, Niklas Muennighoff, Nitish Shirish Keskar, Niveditha S. Iyer, Noah Constant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Omer Levy, Owain Evans, Pablo Antonio Moreno Casares, Parth Doshi, Pascale Fung, Paul Pu Liang, Paul Vicol, Pegah Alipoormolabashi, Peiyuan Liao, Percy Liang, Peter W Chang, Peter Eckersley, Phu Mon Htut, Pinyu Hwang, Piotr Miłkowski, Piyush Patil, Pouya Pezeshkpour, Priti Oli, Qiaozhu Mei, Qing Lyu, Qinlang Chen, Rabin Banjade, Rachel Etta Rudolph, Raefer Gabriel, Rahel Habacker, Ramon Risco, Raphaël Millière, Rhythm Garg, Richard Barnes, Rif A. Saurous, Riku Arakawa, Robbe Raymaekers, Robert Frank, Rohan Sikand, Roman Novak, Roman Sitelew, Ronan Le Bras, Rosanne Liu, Rowan Jacobs, Rui Zhang, Russ Salakhutdinov, Ryan Andrew Chi, Seungjae Ryan Lee, Ryan Stovall, Ryan Teehan, Rylan Yang, Sahib Singh, Saif M. Mohammad, Sajant Anand, Sam Dillavou, Sam Shleifer, Sam Wiseman, Samuel Gruetter, Samuel R. Bowman, Samuel Stern Schoenholz, Sanghyun Han, Sanjeev Kwatra, Sarah A. Rous, Sarik Ghazarian, Sayan Ghosh, Sean Casey, Sebastian Bischoff, Sebastian Gehrmann, Sebastian Schuster, Sepideh Sadeghi, Shadi Hamdan, Sharon Zhou, Shashank Srivastava, Sherry Shi, Shikhar Singh, Shima Asaadi, Shixiang Shane Gu, Shubh Pachchigar, Shubham Toshniwal, Shyam Upadhyay, Shyamolima Shammie Debnath, Siamak Shakeri, Simon Thormeyer, Simone Melzi, Siva Reddy, Sneha Priscilla Makini, Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas Dehaene, Stefan Divic, Stefano Ermon, Stella Biderman, Stephanie Lin, Stephen Prasad, Steven Piantadosi, Stuart Shieber, Summer Misherghi, Svetlana Kiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali, Tatsunori Hashimoto, Te-Lin Wu,

Théo Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei Kornev, Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Vera Demberg, Victoria Nyamai, Vikas Raunak, Vinay Venkatesh Ramasesh, vinay uday prabhu, Vishakh Padmakumar, Vivek Srikumar, William Fedus, William Saunders, William Zhang, Wout Vossen, Xiang Ren, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh, Yair Lakretz, Yangqiu Song, Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding Hao, Yifu Chen, Yonatan Belinkov, Yu Hou, Yufang Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, Zijian Wang, Zijie J. Wang, Zirui Wang, and Ziyi Wu. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023.

[41] Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, et al. Faith and fate: Limits of transformers on compositionality. Advances in Neural Information Processing Systems, 36, 2024.

[42] Martha Lewis and Melanie Mitchell. Using counterfactual tasks to evaluate the generality of analogical reasoning in large language models. arXiv preprint arXiv:2402.08955, 2024.

[43] Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288, 2023.

[44] Arsenii Kirillovich Moskvichev, Victor Vikram Odouard, and Melanie Mitchell. The conceptarc benchmark: Evaluating understanding and generalization in the arc domain. Transactions on machine learning research, 2023.

[45] Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.

[46] Zhijing Jin, Jiarui Liu, LYU Zhiheng, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona T Diab, and Bernhard Schölkopf. Can large language models infer causation from correlation? In The Twelfth International Conference on Learning Representations, 2023.

[47] Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng LYU, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, and Bernhard Schölkopf. CLadder: A benchmark to assess causal reasoning capabilities of language models. In Thirtyseventh Conference on Neural Information Processing Systems, 2023.

[48] Yasaman Razeghi, Robert L Logan IV, Matt Gardner, and Sameer Singh. Impact of pretraining term frequencies on few-shot numerical reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 840–854, 2022.

[49] Cohere. Command r+ documentation, 2024.

[50] Ollama. Ollama, command-r-plus model description.

[51] HuggingFace. Fineweb: decanting the web for the finest text data at scale.

[52] Karl Raimund Popper. The logic of scientific discovery. 1934.

[53] OpenAI. Gpt-4o model docs., 2024.

[54] OpenAI. Announcement: Hello gpt-4o, 2024.

[55] OpenAI. Gpt-4 turbo and gpt-4 model docs.

[56] OpenAI. Introducing chatgpt, 11 2022.

[57] OpenAI. Models - openai gpt 3.5 turbo docs.

[58] OpenAI. Models - openai gpt 3.5 turbo update.

[59] Anthropic. Introducing the next generation of claude anthropic, 2024.

[60] Sundar Pichai and Demis Hassabis. Introducing gemini: Google’s most capable ai model yet, 2023.

[61] Google. Gemma: Introducing new state-of-the-art open models, 2024.

[62] Google. Gemma model card, 2024.

[63] Mistral-AI-Team. Mistral ai models api versioning, 2024.

[64] Mistral-AI-Team. Au large | mistral ai | frontier ai in your hands, 2024.

[65] Mistral-AI-Team. Announcement mixtral ai 7x22b, 2024.

[66] Mistral-AI-Team. Announcement mixtral ai 7x8b, 2024.

[67] Mistral-AI-Team. Announcement mistral ai 7b, 2023.

[68] Cohere. Model card for c4ai command r+, 2024.

[69] Meta. Meta and microsoft introduce the next generation of llama | meta, 2023.

[70] Meta. Introducing meta llama 3: The most capable openly available llm to date, 2024.

[71] Meta. Meta llama 3 model card., 2024.

[72] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.

[73] Alibaba. Introducing qwen1.5, 2024.

[74] Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard. https: //huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.

[75] Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024.

A Additional overview for performed experiments

Here we gather additional helpful information on the procedures around the executed experiments. To provide overview over origin of core tested models used for the AIW experiments, we list those in Tab. 1

Table 1: Names, origin and versioning of core test models used in the experiments.

B Prompt types and variations

For testing the model dependence on input prompt as well as robustness towards problem variation while solving AIW and AIW+ tasks we used three main prompt types - STANDARD (original prompt with answer formatting instructions), THINKING (prompt that encourages thinking with answer formatting instructions) and RESTRICTED (prompt that instructs model to output only formatted answer and nothing else). For testing the models’ robustness, we try different prompt variations (AIW Variations, see Sec, 2.1, i.e. keeping the same problem structure but varying numbers and ordering of brothers and sisters, or names of persons, involved.

We employ further prompt types to probe various aspects of model behavior. For instance, to facilitate thoughtful and accurate answering, we use SCIENTIST prompt type. For evoking expression of confidence estimate in the provided solution, we employ CONFIDENCE prompt type.

See Tab. 2 (for AIW problem) and Tab. 3 (for AIW+ problem) for more details and full prompt description.

C Model behavior on AIW and AIW+ problem

Here we report further details on model behavior as observed on AIW and AIW+ problem.

Table 2: AIW prompt types and prompt variations for AIW.

Table 3: AIW+ prompt types and prompt variations.

C.1 Parameterized AIW problem (AIW-Param)

We were interested to see where models can cope with this more generic problem formulation that does not use explicit natural numbers. We thus created a parameterized version of the AIW problem has the following form: "Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?". In this instance, the correct answer would be M + 1.

We test several models on this task where we have to manually inspect model responses, as simple automated parsing fails here in contrast to the AIW problem with explicit natural numbers. We had to identify the responses that were parsed correctly and if not, provide manual input on whether the response was correct. We also inspected whether models had right reasoning to arrive to final answer as it occurs for this problem type more often that automated response parsing assigns a correct response to the actual wrong reasoning. We provide additional metadata flags for the raw data to indicate whether a sample was manually inspected, and also to signal whether inspected reasoning was correct - or if it’s impossible to say in situations when the response is too short, but still correct, to mark reasoning correctness as "unknown". Similar to standard AIW problem, we observe models performing also poorly on this more generic version (see Fig. 14)

C.2 AIW+ problem

We formulate AIW+ problem to offer a higher degree of difficulty compared to original AIW, while having same relational logic appeal. AIW+ problem has following form: "Alice has 3 sisters. Her mother has 1 sister who does not have children - she has 7 nephews and nieces and also 2 brothers. Alice’s father has a brother who has 5 nephews and nieces in total, and who has also 1 son. How many cousins does Alice’s sister have?".

The solution to AIW+ problem is harder to obtain than the solution to common sense AIW with very simple structure. Solving AIW+ requires taking different paternal sides, that of mother and father, and carefully calculating the number of cousins, taking care of subtracting Alice and her sister, and summing up the total number of cousins from both sides, for instance: on the mother side: 7 (total nephews and nieces) - 4 (Alice and her sisters) = 3 cousins; on the father side: 5 (total nephews and = 5 cousins which Alice and any of her sisters have.

Table 4: Performance of tested model on MMLU and AIW problems. MMLU scores obtained from [74, 75, 22].

AIW+ can be thus considered as more complex because it involves more entities (Alice her sisters, parents, uncles and aunts as well as her cousins), more types of relationships between them including a hierarchy and distractors (eg, 2 brothers of mother’s sister) that has nothing to do with correct solution to the problem.

C.3 Further details on model performance and behavior

To understand if the performance on AIW and AIW+ correlates with standardized benchmarks, we compare correct response rates for tested models with their performance on (Massive Multitask Language Understanding) MMLU benchmark (Table 4). We see that standardized benchmarks are strongly decoupled from AIW and AIW+ performance and cannot predict those - high performance on MMLU can go together with very low performance on AIW or AIW+ (like in case of Command R+ and many other models), while lower performance on MMLU may be coupled to better performance than majority of tested models in AIW, as it is in the case of Llama 2 70B Chat. In this way, ranking given by standardized benchmarks in terms of reflecting relative capabilities of the models to reason cannot be relied on.

For the majority of the tested models, correct response rates for both AIW (Fig. 8) and AIW+ (Fig. 15) are extremely low. At the same time, many of those models score high at standardized benchmarks and are thus considered strongly capable. Some of those models cannot solve AIW task not a single time. We make sure that number of samples collected for each model is approximately of the same order of magnitude, avoding distortions due to difference in sampling rates Fig. 12. Different models have quite different output lengths (see Fig. 13 and Fig. 17) within same prompt type. For example, RESTRICTED prompt obviously forces models to output less tokens rather than STANDARD or THINKING which encourages models to longer outputs. For instance models like

Table 5: Performance of tested model on MMLU, Hellaswag, ARC-c, GSM8k and AIW problems.

GPT-4-0124-preview or Command R Plus have rather short responses independent of prompt type compared to other models.

We also observe strong fluctuations of models behavior across AIW variations (as shown in the Fig. 11). It reveals that on average better performing models like GPT-4-0613 are not robust to variations introduced into the AIW problem while keeping the very same problem structure. Case of GPT-4-0613 is especially severe, as the model collapses from correct response rate 1 to 0 between AIW variations 4 and 3 (see Suppl. Tab. 2 for exact formulation of AIW variations). Also GPT-4o, GPT-4-Turbo, Claude 3 Opus and Llama 2 7B and 70B show strong discrepancies. Such behavior hints again on deficiency in basic reasoning capabilities that results in the observed lack of robustness -AIW variations introduce slight changes to numbers and ordering of entities in the very same problem that should not have been a challenge to handle properly, when core reasoning capabilities were intact.

We also can see (Fig. 9 and Fig. 10) strong fluctuations when comparing STANDARD and THINKING vs RESTRICTED prompt type. While it is expected the performance might drop when forcing models to invest less compute into the output generation, so switching from STANDARD or THINKING to RESTRICTED prompt type (Fig. 9, strong drop observed for Claude 3 Opus when switching from STANDARD to RESTRICTED), we also see drops in the other unexpected direction, for instance for Claude 3 Opus, GPT-4o, Llama 2 70B or Dbrx Instruct, when switching from RESTRICTED to STANDARD or THINKING prompt type (Fig. 10). Such drops may hint that available compute is not properly used for correct reasoning and indicate lack of robustness, as a robust model should only benefit from increasing amount of compute. A robust model that already performs well with smaller compute amount on RESTRICTED prompt type should only improve when switching to STANDARD or THINKING prompt types with more compute available for producing output. This also raises question whether in those cases where we observe short responses on RESTRICTED prompt type to contain correct final answers we actually observe results due to proper reasoning. Breakdown on STANDARD prompt type for the very same problem formulation may indicate that numbers obtained on RESTRICTED prompt type are just consequence of a given AIW variation that induces a strong dominance for a particular number to be generated on the output that does not have necessarily any connection to the actual problem solution and is not a result of reasoning.

To make sure that our results correspond to true model performance and not distorted by wrong automatic response parsing, for AIW+ we manually inspected all models responses considered only the responses that have both correct answer and correct reasoning. If reasoning quality is not clear, for instance in those cases where answer is correct but there is not explanation provided (this is often the case for short answers), we introduce a metadata flag into raw data ("correct_reasoning"), which we set in manual inspection to "unknown" in such cases.

We observe (see Fig. 15) an even stronger further collapse across all tested models on AIW+ compared to AIW (see Fig. 8). All models show poor performance well below p = 0.1. Remarkably, GPT-4O that was showing the best performance on AIW (p > 0.6), collapses dramatically on AIW+ close to 0. AIW+ is clearly harder than AIW, which was made simple by intention. However, models claiming strong reasoning should be able to solve it, as it does not involve any higher level logic or math. This is though not the case for any current SOTA LLMs as our evidence suggests.

Standardized benchmarks failure. In Section 3.1, we observe failure of standardized reasoning benchmarks to measure reasoning skills of SOTA LLMs by noting significant disparity between the model’s performance on the AIW problem and the outcomes on conventional standardized benchmarks, taking MMLU as representative examples. Here, we confirm this finding on further standardized reasoning benchmarks like MATH, ARC-c, GSM8K and Hellaswag (Suppl. Tab. 5). We provide plots visualizing failure of these standardized benchmarks, reflected in strong mismatch between high benchmark scores reported by many models and the low correct response rates they obtain on AIW (which in some cases is 0 for models with high standardized benchmark scores), in Figures 4, 7, 5, 6.

D Examples of correct and failed responses

E Base model experiments

On top of our main experiments with instruction tuned models, we have considered to evaluate selected base models on AIW to see whether there is any striking difference between instruction and base model ability to solve the AIW problem. For these experiments we considered currently available bases of several models we have already tested: Mixtral 8x7b, Mixtral 8x22b, Mistral 7b, LLaMA 2 70b, LLaMA 2 13b, LLaMA 2 7b. We used the following prompt for base models: "### Problem: ... ### Answer:". We see in line with what we observe on instruct models that also base models perform poorly on AIW problem. We observe that Mistral 7b base shows higher correct response rate on average across all tested base models (see Fig. 22, while our observations for Mistral 7B instruction model do not show any difference to other similarly poor performing models unable to deal with AIW. We do not see any remarkable differences for base model case compared to our main observations made with instruction models.

F Lack of certainty calibration and overconfidence in wrong answers

For a strong model, one important characteristic is its uncertainty calibration - how well model’s estimation of certainty about correctness of the provided response matches with true quality of the response, being correct or incorrect. A well calibrated model should assign high certainty of being

Figure 4: Discrepancy between the AIW correct response rate and the MATH average score, indicating the limitation of standardized benchmark MATH in accurately assessing and comparing basic reasoning capabilities of models. Numerous models, such as Command R+, exhibit a stark contrast in performance, scoring zero on AIW while achieving high scores on MATH.

correct where corresponding responses are indeed correct, and should assign high uncertainty to those responses that model might signal as correct, while the provided response is actually incorrect.

In ideal scenario, if LLM cannot correctly solve the AIW problem, it should at least be capable of signalling high uncertainty about the provided incorrect solution. We used our CONFIDENCE prompt type (see Table 2 for AIW problem to see how confident tested models are in their wrong solutions. From our experiments we can see that LLMs most of the time express high certainty even if their answers are completely wrong and thus have strong overconfidence (see Fig. 28). The models also use highly persuasive tone to argue for the expressed certainty and correctness of the provided wrong solutions, using words like "highly confident", "definitive answer", or "accurate and unambiguous". We see also strong overconfidence expressed in multi-turn interactions with models, where user is insisting on solution provided being incorrect, and observe there high resistance of models to revise their decisions, which was already referred to as "stubbornness" in other works [20] (see Suppl. Sec. H and also data provided in the AIW repo)

G Confabulations as back up for wrong answers

In our experiments we observe frequent tendency of those tested models that show strong reasoning collapse and produce wrong answers for both AIW and AIW+ problems to generate at the same time persuasive sounding pseudo-explanations to back up their incorrect answers. We term here such pseudo-explanations confabulations, and present a selection of those as examples.

Such confabulations can contain mathematical calculations or other logic-like expressions and operations that make little or absolutely no sense given the problem to be solved, see examples for Olmo-7B, Fig. 29 and Command R+, Fig. 31.

Further confabulations make use of various social and cultural norm specific context to argue for the posed problem to be inappropriate to solve or to provide non-sense arguments for various incorrect

Figure 5: Limitation of the standardized benchmark GSM8k in accurately reflecting and comparing basic reasoning capabilities of models, as illustrated by the stark discrepancy between the AIW correct response rate and the GSM8k average score. Notably, the majority of tested models exhibit low performance on AIW problems while achieving relatively high scores on GSM8k, a graduate-level math benchmark for large language models. Among models with slightly better calibration are Claude Opus and GPT 4 that outperform other models on AIW, which coincides with their high GSM8k scores. Llama 2 70b also shows better calibration, where its modest AIW performance matches its modest GSM8k score. In contrast, models like Mistral Large, Gemini Pro, Dbrx Instruct, or Command R+, while scoring high on GSM8k, show breakdown on AIW (Command R+ has 0 correct response rate, Mistral Large and Gemini Pro 0.01, Dbrx Instruct 0.11, see also Suppl. Tab. 4)

answers. There are many such examples that we have observed, we present here only a small selection.

CodeLlama-70B-instruct for instance seems to be specifically prone to claim ethical or moral reasons for not addressing the problem correctly, in the presented example inventing out of nowhere a person with Down syndrome and then pointing out that question has to be modified to be addressed due to potential perpetuation of harm towards individuals or groups, which has nothing to do with original task, Fig. 30.

Another example are confabulations provided by Command R Plus. These confabulations use concepts of gender identity such as non-binary gender or concepts related to inclusion or to cultural context dependent family identification in the provided wrong reasoning leading to incorrect answers. In the attempt to solve the problem, the model first fails to provide obvious common sense solution and then goes on to describe potential scenarios where brothers and sisters may self-identify as non-binary, although providing information on brothers and sisters in the problem usually means via common sense that those persons self-identify correspondingly to their known status as brother or sister (while Alice is clearly identified via "she" pronoun). Model thus clearly fails to grasp that problem structure has nothing to do with the social and cultural norms. The solutions derived by the model from considering those factors that are far beyond Occam’s razor and common sense inherent to the simple AIW problem all lead to wrong answers and generate more confusion, while again keeping the persuasive tone that suggests that model is on some right path to provide the correct solutions (Fig. 32)

Figure 6: Limitation of the standardized benchmark Hellaswag in accurately assessing and comparing basic reasoning capabilities of models, as evidenced by the significant discrepancy between the AIW correct response rate and the Hellaswag average score.

Figure 7: Failure of standardized benchmark ARC-c to properly reflect and compare model basic reasoning capabilities as shown by strong discrepancy between AIW correct response rate vs ARC-c average score.

Figure 8: Collapse of most SOTA LLMs on AIW problem. AIW correct response rate, averaged across AIW variations with prompt types THINKING and STANDARD. Only 5 models manage to show rates above p = 0.2: GPT-4o, Claude 3 Opus, GPT-4-0613, Llama 2 70B Chat and GPT-4-0125-preview (GPT4-Turbo). Llama 2 70B Chat is the only open-weights model in this set. The rest either shows poor performance below p = 0.15, or even collapses entirely to 0. Among those models collapsing to 0 are many which are claimed to be strong, eg larger scale GPT-3.5, Mixtral 8x7B and 8x22B, Command R Plus, Qwen 1.5 72B Chat and smaller scale Gemma-7b-it, Mistral Small and Mistral Medium.

For more illustrative examples, see the raw data on interactions with the models collected in AIW repo)

H Inability to revise wrong solutions

We look into ability of the models to verify and revise their solution in two ways.

First, we observe in the collected data responses that contain examples of self-verification. Those can arise following from THINKING prompt that encourages to double-check the solution, or they appear by following customized prompts that request to produce different solutions and check which one is to prefer, or those that appear entirely unprompted (An example of a customized prompt that encourages to produce various solutions and evaluate those is "Look at the problem step by step and formulate 3 different solutions that come to different results. Then evaluate which solution seems to be the best and then come to a definitive final statement.", see also Fig. 31. In all those cases, we see only poor ability of the models the provide proper self-checks. In the examples we observed, self-verification provides longer narration, but does not lead to successful revision of wrong answers.

Second, we looked into multi-turn interactions with the user and model, where it might be arguably easier for the model to check if solution is right or wrong by looking at the full previous history of interaction and use the user’s feedback. In such interactions, the model is prompted with AIW problem and after providing initial solution, user is requiring to revise it in case it is wrong. In

Figure 9: Strong fluctuations of AIW correct response rate STANDARD vs RESTRICTED prompt type (here on example of AIW Variation 2). Claude 3 Opus, while performing well on STANDARD, drops strongly on RESTRICTED, where output is restricted to be short. This might be expected, as limiting compute for producing solution can affect function quality and deteriorate the result.

majority of the observed interactions, we see that while models eagerly agree to revise the solutions and proceed for checking those for possible mistakes, they usually show failure to properly detect mistakes and to revise wrong solutions. Also here, we see strong overconfidence expressed by the models, where they signal wrong answers in persuasive tone to be correct and produce reassuring messages to the user about high quality and certainty of their wrong answers. Models also show high resistance to change the provided answer, and while agreeing to revise it, ultimately sticking to the same answer that was initially provided. Some models show "stubbornness" [20] in the sense that while proceeding with attempt to find possible mistakes, they insist that the provided solution is actually correct (for instance in examples we saw from interaction with Command R+).

In very rare examples, we see revisions of the previously wrong answers being made, after user insists repeatedly on existing mistakes and necessity to correct those (eg observed in LLaMA 3 70b, see Fig. 33 )

For collected multi-turn conversations, see AIW repo.

I Reformulation of AIW problem as relational SQL database problem

Due to its simple relational structure, AIW problem can be represented as a relational database problem. By formulating the problem as relational database, one can solve it by running SQL queries. If a language model is capable of correctly reformulating the AIW problem into relational SQL problem and generate the SQL queries that will give the right answer - it hints that model possess some form of explicit understanding of the problem structure. For example, in the Fig. 34, we can see that Mixtral 8x22B instruct v0.1 is able to correctly generate SQL queries for table creation, population and solution of the problem. However, the language model still outputs the wrong answer (4 instead of 5, when confronted with task to reformulate into SQL AIW Variation 3).

Moreover, if providing those generated queries back on the model’s input - importantly, excluding text description model has generated alongside the SQL query, so that only SQL query is provided

Correct response rate 63 vs 65

Figure 10: Strong fluctuations of AIW correct response rate STANDARD vs RESTRICTED prompt type (here on example of AIW Variation 3). Claude 3 Opus, GPT-4o, Llama 2 70B and Dbrx Instruct have good performance on RESTRICTED, strongly dropping on STANDARD. Additional compute that can be used to produce longer output in STANDARD does not bring benefit to the model function here, which hints that also reasoning performed on RESTRICTED might be also broken and the correct final answer is just an accident

on the input - and asking the model what would be the result of running the generated pure SQL query, the model will be able to provide the correct final answer to AIW problem (5 in that particular example), and that consistently with high chance. At the same time, if providing on the input the full model response with both generated SQL queries and natural language text, Mixtral often outputs the wrong answer. This means that the model has some understanding of both the AIW problem and the SQL, but for some reason it is not able to connect everything together. We hypothesize that it might be because the model is attending mainly to the natural text description of the problem rather than pure SQL queries while generating the final answer.

In conclusion, we see that some models possess ability to capture the problem’s structure as evident by their ability to reformulate it as explicit formal SQL query that reflects formal relational problem structure correctly. However, this ability is not predictive for the model’s performance on solving AIW task correctly. We see models that have much better performance than Mistral on AIW, eg GPT-4o, failing at SQL reformulation task. Another observation is that none of smaller scale models, eg Mistral-7B, also fail on the SQL reformulation task, hinting that they cannot cope with discovering the structure of the AIW problem.

J In-context learning experiments

As a sanity test, we perform few experiments with in-context learning (ICL) using base models. As the AIW problem has simple shortcut solution in form of M + 1, where M is number of sisters, it is expected that ICL, if few examples of AIW problem are presented in the input, will find and use this shortcut to solve the new examples. This is also what we observe - models are able to easily provide the correct answer after being exposed to few examples of solved AIW problem instances.

To confirm that the solution obtained by ICL has no strong reasoning behind and uses the shortcut, we alter the query AIW problem that follows the presented AIW examples such that M + 1 is not

Correct response rate 63 vs 69

Figure 11: Strong fluctuations of AIW correct response rate on AIW variations (here on example of AIW Variation 3 vs. 4, STANDARD prompt type). GPT-4-0613 collapses from correct response rate 1 to 0 between variations. Also GPT-4o, GPT-4-Turbo, Claude 3 Opus and Llama 2 7B and 70B show strong discrepancies. Models for which a particular color is entirely omitted have zero performance on the AIW variation with corresponding color (with exception of GPT-4-0613 on the very top, which has correct response rate of 1 on AIW Variation 4, prompt ID 69, and thus also have vanishing color bars for both variations.

a valid solution anymore (eg by asking number of brothers of Alice’s sister, which is just equal to number of brothers given in the problem description). We observe the models sticking to shortcut M + 1, which hints that no proper reasoning was instantiated by ICL (Fig. 35).

We also present the AIW-param problem (see Suppl. Sec. C.1 featuring variables N,M for brothers and sisters quantities as query following AIW examples with explicit natural numbers, to see whether models can come up with generalized solution M + 1 as response. We observe frequent failure of the models to do so. While in some occasions (as observed for instance for Llama 3 70B), the correct response M + 1 is generated, in other frequent occasions, either incorrect responses containing expressions with variables N,M are produced (eg, N + M, M), or there are incorrect responses featuring explict natural numbers. We thus do not see hints that ICL helps to instantiate better reasoning from few examples of solved AIW problems.

Author contributions

• Marianna Nezhurina: discovered the original problem formulation and performed first experiments observing collapse across different models. Created further problem variations including the hard AIW+. Collected and analyzed data. Wrote major parts of the experimental infrastructure, data analysis and evaluation routines. Co-wrote the manuscript.

• Lucia Cipolina-Kun: performed experiments, collected data and provided further input for the studies. Co-wrote the manuscript.

• Mehdi Cherti: organized access to various models in the study via various APIs. Wrote code for parts of experimental infrastructure. Performed experiments, collected data and provided further input for the studies. Co-wrote the manuscript.

Figure 12: AIW Average number of responses per model for each prompt type (4 AIW variations per prompt type.). Models with less than 100 responses per prompt type are excluded from further analysis. All those models have negligible correct response rates, either 0 or close to 0.

Figure 13: Average length (on characters) of responses per model for each AIW prompt variation. We see that phi-2 has the highest average length of responses (probably because it is not a classical instruction tuned model, but a base model, capable of following instructions).

Figure 14: AIW-param correct response rate for all tested models. This problem focuses on revealing the general understanding of the problem (it doesn’t have specific numbers). The largest SOTA LMs like GPT-4 or Claude 3 Opus have better correct response rates (older GPT-4 versions showing highest rates here, while GPT-4o drops strongly below p = 0.05; Claude 3 Opus drops as well below p = 0.1), their gap to other models that perform significantly poorer having rates close to 0 is large. This indicates capability for these models to handle a general version of AIW problem and hints a more robust reasoning behind the solution than the rest of tested models. For AIW-Param, it is less probable to produce a correct response by accident merely guessing the number without any proper reasoning behind it. Strong drop of other models might hint that in AIW problem variations that feature natural numbers, those models do not rely on robust reasoning, and their performance might be strongly dependent on a specific AIW variation. This we observe for instance for Llama 2 70 B that show strong performance deterioration here falling almost to 0, while also showing strong performance fluctuation depending on AIW variation, see App. Fig. 11

• Jenia Jitsev: led the project. Created further problem variations. Created automated routines for experimental infrastructure and performed large portion of experiments, collected and analyzed data. Wrote the manuscript.

Figure 15: AIW+ correct response rates for the tested models. Compared to AIW, there is further strong collapse across all models, shown poor performance well below p = 0.1. Remarkably, GPT-4o that was showing the best performance on AIW (p > 0.6), collapses dramatically on AIW+ close to 0. AIW+ is clearly harder than AIW that was made simple by intention. However, models claiming strong reasoning should be able to solve it, as it does not involve any higher level logic or math. This is though not the case for any current SOTA LLMs.

Figure 16: AIW+ Average number of responses per model for each prompt variation.

Figure 17: AIW+ Average length (on characters) of responses per model for each prompt variation.

Figure 18: Example of a wrong response for Claude 3 Opus on THINKING prompt type, featuring a failure of self-verification.

Figure 19: Example of a wrong response for Llama 3 70B Instruct. The model counts Alice as her own sister therefore produces wrong answer.

Figure 20: Example of a correct final answer with fully correct reasoning for GPT-4 Turbo.

Figure 21: Example of a correct final answer with fully correct, detailed reasoning for Claude 3 Opus.

Figure 22: Average correct response rate on AIW variations for base models. In line with our main results obtained on instruct models, here we see that base models also perform poorly, with Mistral 7b v0.1 having higher correct response rate compared to other investigated models.

Figure 23: Example of a response for LLaMA 7b base model. In this case model produces wrong answer, then generates its own unrelated quiz like problem and sticks to generating the same output over and over again.

### Problem: Alice has 4 sisters and she also has 1 brother. How many sisters does Alice's brother have? ### Answer:

Figure 24: Example of a response for LLaMA 13b base model with complete failure to cope with the task.

Figure 25: Example of a wrong response for LLaMA 70b base model. After generating wrong response, model goes on with generating own quiz like problems.

### Problem: Alice has 2 sisters and she also has 4 brothers. How many sisters does Alice's brother have? ### Answer:

Figure 26: Example of a response for Mixtral 8x7b base model. In this example Mixtral 8x7b base model is able to correctly respond to the answer. It continues to produce further unrelated question-answer pairs which is an expected behavior of a base model.

### Problem: Alice has 4 sisters and she also has 1 brother. How many sisters does Alice's brother have? ### Answer:

### Problem: Alice has 2 sisters and she also has 4 brothers. How many sisters does Alice's brother have? ### Answer: 2 ### Problem: Alice has 2 sisters and she also has 4 brothers. How many brothers does Alice's brother have? ### Answer: 3 ### Problem: Alice has 2 sisters and she also has 4 brothers. How many brothers does Alice's sister have? ### Answer: 4 ### Problem: Alice has 2 sisters and she also has 4 brothers. How many sisters does Alice's sister have? ### Answer: 1 ### Problem: Alice has 2 sisters and she also has 4 brothers. How many siblings does Alice's sister have? ### Answer: 5 ### Problem: Alice has 2 sisters and she also has 4 brothers. How many siblings does Alice's brother have? ### Answer: 5 ### Problem: Alice has 2 sisters and she also has 4 brothers. How many siblings does Alice have? ### Answer: 6 ....

Figure 27: Example of a wrong response for Mixtral 8x22b base model. Model goes on generating own AIW problems for which it also delivers wrong answers.

Alice has four brothers and she also has a sister. How many sisters does Alice's brother have? Solve the problem taking care about not doing mistakes.

This problem can be solved accurately and unambiguously. Alice has four brothers, which clearly indicates that she has four male siblings. Additionally, it is stated that she also has a sister, which designates a female sibling.

Therefore, Alice's brothers have one sister. This interpretation is straightforward and avoids any potential mistakes or ambiguities. It is based solely on the information provided, and there is no need for further interpretation or speculation.

Figure 28: Overconfidence: example of a wrong response for Command R Plus, containing highly persuasive wording to back up its confidence into incorrect solution

Alice has X brothers and she also has Y sisters. How many sisters does Alice's brother have? Solve this problem and provide the final answer in following form: \"### Answer: \"

The number of Alice's brothers is X - number of Alice's sisters. ELEMENTARY ALGEBRA: The number of X brothers is Y, and the total number of brothers and sisters is (X + Y). now, we can represent X's brothers by x and Y's sisters by y, so (X + Y) = x + y. Since the total number of brothers and sisters is X + Y,X = (X + Y) - Y. Now, divide both sides by (X + Y) to findX: X = (X + Y) - Y = (x + y) - y - In other words:X = x -- y. Subtract y from both sides: X = x - y.Now, scale the -y to the left side by adding 5 to both sides: X = ((x + 5) - y). So, the number of Alice's brothers is either x, x + 1, x + 2, x + 3, or x + 4.\\### Answer: x - 4

Figure 29: Confabulation: example of a wrong response for OLMo-7b-Instruct for AIW-param problem. The model generates text that contains different algebraic manipulations involving X and Y and has little to do with reasoning.

Alice has four brothers and she also has a sister. How many sisters does Alice's brother have? Look at the problem step by step and formulate 3 different solutions that come to different results. Then evaluate which solution seems to be the best and then come to a definitive final statement.

Figure 31: Example of a confabulation response for Command R Plus. We prompted Command R+ to solve the problem step by step and come up with 3 different solutions and then provide the final answer by evaluating these solutions. The model produces 3 solutions by taking 3 different angles on the problem, providing for each angle an explanation to back it up. The 3 provided solutions are all wrong. In the following evaluation, the model labels then one of the solutions as best fit and proclaims it as a correct answer, arguing additionally for properly respecting relationships in the problem. The final provided answer is incorrect.

Alice has four brothers and she also has a sister. How many sisters does Alice's brother have? Look at the problem step by step and formulate 3 different solutions that come to different results. Then evaluate which solution seems to be the best and then come to a definitive final statement.

Figure 32: Example of a confabulation response for Command R Plus. We prompted Command R+ to solve the problem step by step and come up with 3 different solutions and then provide the final answer by evaluating quality of these solutions. All 3 solutions produced by the model are wrong. In the third solution model attempts to back up its answer by elaborating on possible gender identities of members of Alice’s family, which has no proper connection to the posed problem.

Alice has 3 brothers and she also has 6 sisters. How many sisters does Alice's brother have? Solve this problem and provide the final answer in following form: \"### Answer: \"

Figure 33: Example of a correct revision by LLaMA 3 70b. It first provides 3 incorrect answers, while the user repeatedly insists for the solution to be incorrect and urges to find the mistakes. 44

Build SQLite queries for creation and population of the table that represents: "Alice has 4 sisters and she also has a brother. How many sisters in total does Alice's brother have?". Given the table schema, query the table. Emulate the previous query execution only by looking at SQL and provide the result of execution. (Format every SQL query as ```sql {the query}```

Assuming that the brother is not counted as a sister, the result of this query will be `4` because Alice's brother has the same 4 sisters as Alice.

Figure 34: Example of SQL reformulation of the AIW problem and correct solution for Mixtral 8x22B instruct v0.1. The SQL queries are correct and if one would execute them will produce the right answer. However the final answer generated by the model is still incorrect. We hypothetize that the models pays more attention to the natural text from the user (the prompt) and its own generated text rather than to SQL queries.

Figure 35: Example of in-context learning response for LLaMA 3 70B base model. Model produces the wrong answer. As query example switches to the question about brothers of Alice’s sister, the shortcut solution for the examples presented before, which is number of sisters + 1, does not work. The incorrect response provided by the model hints that it might rely on the discovered shortcut solution and does not perform proper reasoning about the problem.

Designed for Accessibility and to further Open Science