36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"2406.02061","publisher":"arxiv","paperJSON":{"title":"Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models","paperID":"2406.02061","avgLineHeight":10.93,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"$3c","element":"span"},{"text":"1","element":"span"},{"text":".","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"In the recent breakthroughs in transferable learning that were achieved in various classical domains of machine learning like visual recognition [","element":"span"},{"href":"#id-0","referenceIndex":1,"text":"1","element":"a"},{"text":"] or language understanding [","element":"span"},{"href":"#id-1","referenceIndex":2,"text":"2","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":3,"text":"3","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","text":"4","element":"a"},{"text":"], large language models (LLMs) have played a very prominent role. The generic form and scalability of autoregressive language modelling [","element":"span"},{"href":"#id-3","text":"4","element":"a"},{"text":"] allowed to push towards training scales not achievable before with conventional supervised label-based learning. Scaling laws derived from experiments on smaller scales","element":"span"}],[{"id":"id-15","style":{"width":"99%"},"width":1584,"height":491,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/1-0.png","element":"img"}],[{"text":"Figure 1: AIW problem is simple common sense math problem with short, concise formulation, serving as a minimalist setting to test model generalization, Problem template is used to create AIW variations 1-4 (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"left","element":"figcaption","subtype":"caption"},{"text":") by instantiating different numbers ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"N, M ","element":"figcaption","subtype":"caption"},{"text":"of brothers and sisters, keeping problem structure and difficulty unchanged. We measure sensitivity of models to those variations, observing strong performance fluctuations. Here on example of GPT-4 (gpt-4-0613) tested with various prompt types STANDARD, THINKING, RESTRICTED, a color per each variation 1-4, executing 60 trials per each variation. Correct response rate varies strongly depending on the variation. E.g., it drops close to 0 for variation 3, while going up to 1 for variation 4. This observation is consistent for different prompt types - STANDARD, THINKING and RESTRICTED (from left to right), showing that observed fluctuations are not due to prompt variation. Strong performance fluctuations on natural, structure preserving variations of such a simple problem points to severe lack of robustness and generalization deficits. Numbers in the legend are prompt IDs (Suppl. Tab. ","element":"figcaption","subtype":"caption"},{"href":"#id-4","text":"2)","element":"a","subtype":"caption"}],[{"text":"hinted on strong function and generalization capability appearing at larger scales [","element":"span"},{"href":"#id-3","text":"5","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","text":"6","element":"a"},{"text":"], which was then confirmed by training models at the large scales, measuring their performance on set of standardized benchmarks (MMLU, HellaSwag, ARC, MATH, GSM8k, etc) where they scored high on fewand zero-shot transfer across various tasks [","element":"span"},{"href":"#id-5","referenceIndex":10,"text":"7","element":"a"},{"text":"], following accurately the predictions [","element":"span"},{"href":"#id-3","text":"4","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","text":"5","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","referenceIndex":10,"text":"8","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","referenceIndex":10,"text":"9","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","referenceIndex":10,"text":"10","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":11,"text":"11","element":"a"},{"text":"].","element":"span"}],[{"text":"There were however observations made by various works that questioned the claimed strong generalization, transfer and reasoning capabilities attributed to LLMs [","element":"span"},{"href":"#id-7","referenceIndex":12,"text":"12","element":"a"},{"text":"]. These works pointed out various function failures that were seemingly incompatible with postulated strong capabilities as measured by standardized benchmarks [","element":"span"},{"href":"#id-8","referenceIndex":13,"text":"13","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":14,"text":"14","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":15,"text":"15","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":16,"text":"16","element":"a"},{"text":"]. However, it has also been noted that observed failures can frequently be addressed through simple adjustments to the prompts or by repeated execution and evaluation using majority voting, or by requesting the model to perform self-verification [","element":"span"},{"href":"#id-12","referenceIndex":17,"text":"17","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":18,"text":"18","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","text":"19","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","text":"20","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","text":"21","element":"a"},{"text":"]. For many further anecdotical reports of failures, there was lack of systematic evaluation, e.g. without controlling accurately the conditions under which failures occur, and often unclarity about potentially ambigious formulations that were leading to failures. It remained thus unclear where those observations of failures were pointing to some fundamental deficits in core model function like generalization and reasoning, or whether those were just due to ill-posed problem formulation or due to minor issues easily resolvable by simple interventions, leaving claim of strong core function backed up by high scores on standardized benchmarks and also by observed strong performance on selected complex tasks unaffected.","element":"span"}],[{"text":"To shed light on this situation, we study whether the claim of SOTA LLMs possessing strong functions across various complex problems can be put to test by using problems that are very simple, in contrast to those employed by various standardized benchmarks. We introduce a short conventional common sense problem template that can be formulated without any ambiguities in concise natural language and can be easily solved by humans. The problem (in following Alice in Wonderland, AIW problem) has following template: ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"\"Alice has ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"brothers and she also has ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"sisters. How many sisters does Alice’s brother have?\"","element":"span"},{"text":". Crucially, instantiating natural numbers ","element":"span"},{"style":{"height":14.8},"width":166.84,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/1-1.png","element":"img","alt":" N, M ≤ 7","inline":true,"padRight":true},{"text":"in the problem template allows us to naturally introduce controlled source of systematic variations that do not change problem structure and difficulty at all and thus should not affect ability to solve it (Fig. ","element":"span"},{"href":"#id-15","text":"1)","element":"a"},{"text":".","element":"span"}],[{"text":"We use then this technique of creating problem structure and difficulty preserving variations to measure models’ sensitivity to problem irrelevant perturbations across multiple repetitive trials, testing models’ generalization - should it be intact, models should handle the simple problem equally well across variations (Fig. ","element":"span"},{"href":"#id-15","text":"1)","element":"a"},{"text":".","element":"span"}],[{"text":"Surprisingly, when confronted with the AIW problem and its structure preserving variations, all SOTA models including advanced large-scale ones (eg GPT-4 [","element":"span"},{"href":"#id-16","referenceIndex":26,"text":"22","element":"a"},{"text":"], Claude 3 Opus [","element":"span"},{"href":"#id-16","referenceIndex":26,"text":"23","element":"a"},{"text":"$3d","element":"span"}],[{"text":"Confronted with evidence of zero shot generalization breakdown all tested SOTA LLMs exhibit on such a simple problem, we investigate further AIW problem versions to see whether the same phenomenon of model sensitivity to problem structure preserving variations appears consistently. We use same technique of varying numbers in problem templates and observe same strong fluctuations of model performance also on other problem settings, obtaining further evidence that the generalization deficit is generic and not unique to original AIW formulation. We also see most recent reasoning models like DeepSeek-R1 or o1-mini exhibiting strong fluctuations on AIW problem versions, revealing lack of robustness and generalization deficit also for this model class.","element":"span"}],[{"text":"$3e","element":"span"}],[{"text":"We also observe further breakdown characteristics that makes it severe. We measure distribution of natural numbers responses on output, showing that for AIW variations with low correct response rate, peaks are on wrong answers, excluding majority voting methods as a fix. We observe that wrong responses are often accompanied by persuasive explanation-like confabulations and overconfident tone about correctness of the wrong solutions provided by the models, which can further mislead model users into trusting wrong responses. This is especially troubling for scenarios where users have no simple way to check for the correctness of the provided solutions. We also see models failing to properly detect mistakes and to revise wrong solutions when encouraged to do so in experiments with multi-turn AIW problem interaction and self-verification.","element":"span"}],[{"text":"The observed breakdown of function and generalization is in strong contrast to scores on standardized benchmarks, which contain problems of higher difficulty. Many tested models that score high on such benchmarks show correct response rates close to zero across simple AIW problem variations. Claim put forward by standardized benchmarks to properly reflect model capabilities such as generalization and reasoning cannot be upheld in face of the evident failure to spot such severe function deficits as revealed in the simple AIW problem setting. Our study highlights necessity to re-assess capabilities of SOTA LLMs by creating novel benchmarks that properly reflect their true abilities to generalize and reason. Such benchmarks will be able to correctly spot deficits overlooked so far and thus show the path for improvement of current still unsatisfactory state.","element":"span"}]]},{"heading":"2 Methods & Experiment Setup","paragraphs":[[{"id":"id-23","style":{"fontWeight":"bold"},"text":"2.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Simple common sense reasoning problems and their variations","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"AIW Problem. ","element":"span"},{"text":"To measure models’ sensitivity to problem irrelevant variations and thus probe the zero-shot generalization, we use following problem template: ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"\"Alice has ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"brothers and she also has ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"sisters. How many sisters does Alice’s brother have?\"","element":"span"},{"text":". The problem has a simple common sense solution which assumes all sisters and brothers in the problem setting share the same parents. The correct response ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":"- number of sisters - is easily obtained by calculating ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"+ 1 ","element":"span"},{"text":"(Alice and her sisters), which gives the number of sisters Alice’s brother has. To create problem variations, we choose to vary natural numbers ","element":"span"},{"style":{"height":14.8},"width":168,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/3-0.png","element":"img","alt":" N, M ≤ 7","inline":true},{"text":", obtaining AIW variations 1-4 (see Suppl. Tab. ","element":"span"},{"href":"#id-4","text":"2) ","element":"a"},{"text":"that all pose same problem using variations irrelevant for problem solving. The resulting variations are as following:","element":"span"}],[{"text":"Variation 1. Alice has 3 brothers and she also has 6 sisters. [Correct answer: 7] Variation 2. Alice has 2 sisters and she also has 4 brothers. [Correct answer: 3] Variation 3. Alice has 4 sisters and she also has 1 brother. [Correct answer: 5] Variation 4. Alice has 4 brothers and she also has 1 sister. [Correct answer: 2] Question: How many sisters does Alice’s brother have?","element":"span"}],[{"text":"We further use 3 prompt types, RESTRICTED, STANDARD and THINKING, to ensure we measure models across various prompt formulations and check the observations hold independent of employed prompt type (see also Sec. ","element":"span"},{"href":"#id-17","text":"2.2 ","element":"a"},{"text":"and Suppl. Sec. ","element":"span"},{"text":"B)","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.1.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Control AIW Light problems","element":"span"}],[{"text":"To control for models struggling either with basic family relations structure handling or with executing arithmetic operations in frame of the posed AIW problem, we make various versions of AIW problem - AIW Light Family, AIW Light Arithmetic Siblings and AIW Light Arithmetic Total Girls. The AIW Light problems keep problem template close to the original, changing only the final question part such that the posed modified question tests particular operations. The variations 1-4 are created in the same way like in AIW original by varying natural numbers of brothers and sisters, while ensuring that the natural numbers for final correct answers in AIW original and AIW Light are matched across variations 1-4 (see also Suppl. Sec. ","element":"span"},{"text":"B)","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"AIW Light Arithmetic Siblings","element":"span"},{"text":". AIW Light Arithmetic Siblings has following problem template: ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"\"Alice has ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"brothers and she also has ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"sisters. How many siblings does Alice have?\"","element":"span"},{"text":". Compared to AIW original, only question part is modified. To solve the problem, summing up already given numbers of brothers and sisters is sufficient - the correct answer is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M","element":"span"},{"text":". This requires basic grasping of relational family structure (realizing Alice’s siblings are her sisters and brothers) and selection and execution of elementary arithmetic sum operation. In contrast to AIW original, it does not require execution of set operations nor binding sex attribute to Alice to properly assign her to correct sets. Should the issues with solving AIW original be rooted in selection and execution of elementary arithmetic operations in family frame, we should see models also failing here. Again, we create variations 1-4 by varying natural numbers ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N, M","element":"span"},{"text":", such that correct responses ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":"are matched with AIW original variations 1-4 (Suppl. Tab. ","element":"span"},{"href":"#id-18","text":"3)","element":"a"},{"text":". This gives following variations:","element":"span"}],[{"text":"Variation 1. Alice has 3 brothers and she also has 4 sisters. [Correct answer: 7] Variation 2. Alice has 2 sisters and she also has 1 brother. [Correct answer: 3] Variation 3. Alice has 4 sisters and she also has 1 brother. [Correct answer: 5] Variation 4. Alice has 1 brother and she also has 1 sister. [Correct answer: 2] Question: How many siblings does Alice have?","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"AIW Light Family","element":"span"},{"text":". AIW Light Family has following problem template: ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"\"Alice has ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"brothers and she also has ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"sisters. How many brothers does Alice’s sister have?\"","element":"span"},{"text":". Compared to AIW original, only question part is modified. To solve the problem, reporting already given number of brothers is sufficient - the correct answer is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":". This requires only basic grasping of relational family structure (understanding entity \"Alice’s sister\", binding female attribute to Alice and realizing Alice and her sisters share same brothers). It does NOT require execution of any arithmetic or set operations, in contrast to AIW original. Should the issues with solving AIW original be rooted in ","element":"span"},{"text":"handling basic family structure, we should see models also failing here. Again, we create AIW Light Family variations 1-4 by varying natural numbers ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N, M","element":"span"},{"text":", such that correct responses ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":"are matched with AIW original variations 1-4. (Suppl. Tab. ","element":"span"},{"href":"#id-19","text":"4)","element":"a"},{"text":". This gives following variations:","element":"span"}],[{"text":"Variation 1. Alice has 7 brothers and she also has 3 sisters. [Correct answer: 7] Variation 2. Alice has 4 sisters and she also has 3 brothers. [Correct answer: 3] Variation 3. Alice has 2 sisters and she also has 5 brothers. [Correct answer: 5] Variation 4. Alice has 2 brothers and she also has 3 sisters. [Correct answer: 2] Question: How many brothers does Alice’s sister have?","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"AIW Light Arithmetic Total Girls","element":"span"},{"text":". AIW Light Arithmetic Total Girls has following problem template: ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"\"Alice has ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"brothers and she also has ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"sisters. How many girls are there in total?\"","element":"span"},{"text":". Compared to AIW original, only question part is modified. To solve the problem, it is necessary to bind female attribute to Alice via the pronoun \"she\", to assign correct female attributes to the sisters and to execute the correct arithmetic sum operation adding all the obtained girls - the correct answer is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"+ 1 ","element":"span"},{"text":"(note it is the same as for AIW original). This requires basic grasping of family structure (realizing who are the girls in the family) and selection and execution of elementary arithmetic sum operation. In contrast to AIW original, it does not require execution of set operations to properly assign Alice to sisters set. Should the issues with solving AIW original be rooted in binding correct sex attributes or counting total members of particular sex in family frame given its structure, we should see models also failing here. Again, we create variations 1-4 by varying natural numbers ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N, M","element":"span"},{"text":", such that correct responses ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":"are matched with AIW original variations 1-4. (Suppl. Tab. ","element":"span"},{"href":"#id-20","text":"5)","element":"a"},{"text":". This gives following variations:","element":"span"}],[{"text":"Variation 1. Alice has 6 sisters and she also has 3 brothers. [Correct answer: 7] Variation 2. Alice has 2 sisters and she also has 4 brothers. [Correct answer: 3] Variation 3. Alice has 4 sisters and she also has 1 brother. [Correct answer: 5] Variation 4. Alice has 1 sister and she also has 4 brothers. [Correct answer: 2] Question: How many girls are there in total?","element":"span"}],[{"id":"id-17","style":{"fontWeight":"bold"},"text":"2.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Prompt types, providing input and response parsing","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Model prompt types. ","element":"span"},{"text":"It is well known that so-called prompt engineering can heavily influence the model behavior and model response quality [","element":"span"},{"href":"#id-16","referenceIndex":26,"text":"24","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","referenceIndex":26,"text":"25","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","referenceIndex":26,"text":"26","element":"a"},{"text":"]. To check that our observations reflect model sensitivity to controlled, problem structure preserving variations in same manner independent of particular prompt type, we employed 3 various prompt types to provide model’s input: STANDARD (prompt with instruction to format final answer output as a natural number), THINKING (prompt that in addition encourages thinking in spirit of CoT) and RESTRICTED (prompt with instruction to output nothing else but final answer as a natural number). THINKING v2 prompt type is a minor variation of THINKING type that just adds \"step by step\" after already existing \"think carefully\" phrasing (control experiments show that THINKING and THINKING v2 are equivalent in terms of observed performance, so we use both interchangeably, Suppl. Fig. ","element":"span"},{"href":"#id-21","text":"35b)","element":"a"},{"text":". STANDARD and THINKING prompt types allow models to generate any text output before delivering the final answer, while RESTRICTED is used as control with restricted output to measure model behavior when the only output allowed is the final answer (see Fig. 1, Suppl. Fig. ","element":"span"},{"href":"#id-22","text":"34b ","element":"a"},{"text":"and Suppl. Tab. ","element":"span"},{"href":"#id-4","text":"2)","element":"a"}],[{"text":"Furthermore, we make use of other prompt types (see Suppl. Sec","element":"span"},{"text":".B ","element":"span"},{"text":"for overview) to demonstrate various important properties and the different success or failure modes of the model behavior for the AIW problem. In those prompts, we re-use the main problem formulation as introduced in Sec. ","element":"span"},{"href":"#id-23","text":"2.1, ","element":"a"},{"text":"while adding various modifications. This allows us for instance to observe confabulations that contain clearly broken statements with reasoning-like convincing sound backing up wrong final answers, responses showing model overconfidence, or look into other function modes like in-context learning handling.","element":"span"}],[{"text":"For each trial, models receive thus an input that has a form <","element":"span"},{"style":{"fontStyle":"italic"},"text":"instantiated-template","element":"span"},{"text":"> <","element":"span"},{"style":{"fontStyle":"italic"},"text":"prompt-type","element":"span"},{"text":">, where <","element":"span"},{"style":{"fontStyle":"italic"},"text":"instantiated-template","element":"span"},{"text":"> is template with substituted numbers instantiating one of problem variations 1-4 containing the question, followed by <","element":"span"},{"style":{"fontStyle":"italic"},"text":"prompt-type","element":"span"},{"text":"> with task and output instructions corresponding to one of prompt types as described above.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Parsing model responses. ","element":"span"},{"text":"To perform evaluations of model performance, it is necessary to parse and extract the final answer from the responses provided by the models. Each input to the model is ","element":"span"},{"text":"combination of a AIW problem variation, followed by one of prompt types as described before. To keep the parsing procedure simple, each prompt type contains following output format instruction: ","element":"span"},{"style":{"fontStyle":"italic"},"text":"\"provide the final answer in following form: \"### Answer: \"\"","element":"span"},{"text":". We observed that all models we have chosen to test were able to follow such an instruction, providing a response that could be easily parsed. We also ran control experiments without such formatting instruction in the problem formulation, ensuring that behavior does not depend on it.","element":"span"}],[{"id":"id-27","style":{"fontWeight":"bold"},"text":"2.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Evaluating model responses","element":"span"}],[{"text":"The formatting instruction makes it possible to extract for each trial whether a model has provided a correct answer to the problem posed in the input. We can interpret then any number ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"of collected responses as executing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"trials given a particular prompt for a given model (","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"- number of Bernoulli trials), observing in each ","element":"span"},{"style":{"height":10.4},"width":44.73,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/5-0.png","element":"img","alt":" i−","inline":true},{"text":"th trial a Bernoulli variable ","element":"span"},{"style":{"height":16},"width":209.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/5-1.png","element":"img","alt":" Xi = {0, 1}","inline":true},{"text":". We interpret the number of correct responses ","element":"span"},{"style":{"height":16.74},"width":197.97,"height":41.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/5-2.png","element":"img","alt":" X = �i Xi","inline":true,"padRight":true},{"text":"as random variable following a Beta-Binomial distribution with ","element":"span"},{"text":"unknown probability ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"of correct response that we also treat as random variable that comes from a Beta distribution, i.e. ","element":"span"},{"style":{"height":16},"width":261.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/5-3.png","element":"img","alt":" p ∼ Beta(α, β)","inline":true},{"text":", where ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/5-4.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/5-5.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"are parameters of the Beta distribution. To obtain plots showing correct response ratios, we would like to estimate Beta distribution underlying ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":", and for that, we first estimate the mean of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"and its variance from the collected observations. To estimate ","element":"span"},{"style":{"height":14},"width":23.38,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/5-6.png","element":"img","alt":" ˆp","inline":true},{"text":", we use the formula for estimating the mean of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"for a binomial distribution: ","element":"span"},{"style":{"height":16},"width":220.45,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/5-7.png","element":"img","alt":" ˆp = X/n (i.e.","inline":true,"padRight":true},{"text":"as a proportion of successes). We can report the estimate ","element":"span"},{"style":{"height":14},"width":23.38,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/5-8.png","element":"img","alt":" ˆp","inline":true,"padRight":true},{"text":"as the estimate of the correct response rate of a given model and also, compare the correct response rates of various tested models. Moreover, we can estimate the variance of the probability of a correct response by using the following formula:","element":"span"}],[{"style":{"width":"87%"},"width":1382,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/5-9.png","element":"img"}],[{"text":"The estimates of the variance and the standard deviation of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"can be thus obtained by using ","element":"span"},{"style":{"height":14},"width":23.38,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/5-10.png","element":"img","alt":" ˆp","inline":true,"padRight":true},{"text":"as","element":"span"}],[{"style":{"width":"99%"},"width":1580,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/5-11.png","element":"img"}],[{"text":"following relations for the variance:","element":"span"},{"style":{"height":28.8},"width":387.66,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/5-12.png","element":"img","alt":"�σ2 = nαβ(α+β+n)(α+β)2(α+β+1)�","inline":true},{"text":"and the mean","element":"span"},{"style":{"height":28.8},"width":199.19,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/5-13.png","element":"img","alt":"�µ = αα+β�","inline":true},{"text":"in order to obtain ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/5-14.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/5-15.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"parameters for the Beta distribution. To simulate data for the plots, we draw ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"random samples corresponding to correct and incorrect responses using the estimated distribution of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"and obtain the plots showing performance on the task for various models of interest as a full distribution of the respective ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Selecting models for evaluation and conducting experiments","element":"span"}],[{"text":"We are interested in testing state-of-the-art models that claim strong function, especially in generalization and reasoning, backed up by high scores shown on standardized benchmarks that are assumed to measure problem solving capabilities and specifically, generalization and reasoning. We therefore select models widely known and used in the ML community that also appear in the top rankings of the popular LLM leaderboards, like openLLM leaderboard by HuggingFace or ELO leaderboard by LMsys. We provide the overview of the selected models in Suppl. Tab. ","element":"span"},{"href":"#id-24","text":"1 ","element":"a"},{"text":"and list in Suppl. Tab. ","element":"span"},{"href":"#id-25","text":"12 ","element":"a"},{"text":"the corresponding standardized benchmarks where they obtain strong scores. Whenever possible, we were choosing models of various scales from the same family, ranging from small to large, in order to see how the capabilities to solve the posed task may vary with scale.","element":"span"}],[{"text":"We expose selected SOTA LLMs, including advanced models at largest scales (see Suppl. Tab. ","element":"span"},{"href":"#id-24","text":"1) ","element":"a"},{"text":"to AIW problem variations 1-4 (Suppl. Tab. ","element":"span"},{"href":"#id-4","text":"2) ","element":"a"},{"text":"and AIW Light control problems (Suppl. Tab. ","element":"span"},{"href":"#id-18","text":"3, ","element":"a"},{"href":"#id-19","text":"4, ","element":"a"},{"href":"#id-20","text":"5)","element":"a"},{"text":", and other problem versions using different prompt types as described above. For each combination of model, AIW problem variation and prompt type, at least 30 trials are collected to compute both correct response rates within each variation and average correct response rates across variations 1-4, Suppl. Fig. ","element":"span"},{"href":"#id-26","text":"42. ","element":"a"},{"text":"For details on correct response rates estimation procedure, see Sec. ","element":"span"},{"href":"#id-27","text":"2.3","element":"a"}],[{"text":"We use hosting platforms that offer API access or local deployment via vLLM [","element":"span"},{"href":"#id-28","referenceIndex":27,"text":"27","element":"a"},{"text":"] for testing the models, and automatize the procedure by scripting the routines necessary to prompt models with our prompts set. The routines are simple and can be used by anybody with access to the APIs (we used liteLLM and TogetherAI for our experiments) or to locally hosted models to reproduce and verify our results. We protocol all the data from interactions with the models to enable community checking. ","element":"span"},{"text":"We release all the collected raw response data, correct response rates estimates and routines used to conduct experiments as open-source for reproducibility and further usage.","element":"span"}]]},{"heading":"3 Results","paragraphs":[[{"style":{"width":"99%"},"width":1584,"height":727,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/6-0.png","element":"img"}],[{"text":"Figure 2: Alice is reasoning: will it break? Illustration of Humpty Dumpty from Through the Looking Glass ","element":"figcaption","subtype":"caption"},{"href":"#id-29","referenceIndex":28,"text":"[28]","element":"a","subtype":"caption"},{"text":", by John Tenniel, 1871. Source: Wikipedia.","element":"figcaption","subtype":"caption"}],[{"id":"id-43","style":{"fontWeight":"bold"},"text":"3.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Humpty Dumpty sat on a wall: breakdown of SOTA LLMs on the simple AIW problem","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"AIW reveals severe generalization and reasoning deficits in SOTA LLMs","element":"span"},{"text":". Following the procedures described in Sec ","element":"span"},{"text":"2, ","element":"span"},{"text":"we expose the selected models that claim strong function and reasoning capabilities (Suppl. Tab. ","element":"span"},{"href":"#id-24","text":"1) ","element":"a"},{"text":"and measure their correct response rate performance across and for each AIW variations 1-4 using various prompt types, executing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"30 ","element":"span"},{"text":"trials for each combination (see also Suppl. Tab. ","element":"span"},{"href":"#id-4","text":"2 ","element":"a"},{"text":"and Suppl. Fig. ","element":"span"},{"href":"#id-26","text":"42)","element":"a"},{"text":". The results suggest that confronted with the AIW problem, models suffer a severe function breakdown. This breakdown has two main manifestations:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"1. Low correct response rates. ","element":"span"},{"text":"Despite evident problem’s simplicity, many models are not able to deliver a single correct response, and the majority stay well below correct response rate of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"2","element":"span"},{"text":". We summarize the main results in the Fig. ","element":"span"},{"href":"#id-30","text":"3. ","element":"a"},{"text":"The only major exceptions from the observation of very low correct response rates are the largest scale closed models GPT-4 and Claude 3 Opus. These two models at largest scales obtain correct response rates well above ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"3","element":"span"},{"text":", leaving the remaining large and smaller scales open-weights (e.g., Mistral-7B, Mixtral, Qwen, Command R+, and Dbrx Instruct) and closed-weights models (e.g., Gemini Pro, Mistral Large) far behind. Remarkably, many models that claim high scores on standardized benchmarks, show very low correct response rates close to 0, eg. Llama-3-8B, Mixtral-8x22B, Qwen1.5-110B, or exhibit even complete breakdown on AIW with correct response rate of zero across all variations, eg Command R+ or Qwen1.5-72B (Suppl. Tab. ","element":"span"},{"href":"#id-25","text":"12)","element":"a"}],[{"text":"The results presented in the Fig. ","element":"span"},{"href":"#id-30","text":"3 ","element":"a"},{"text":"show estimates for correct response rates averaged across RESTRICTED, STANDARD and THINKING prompt types (Suppl. Tab. ","element":"span"},{"href":"#id-4","text":"2, ","element":"a"},{"text":"prompt IDs provided for reproducibility; Suppl. Fig. ","element":"span"},{"href":"#id-31","text":"33 ","element":"a"},{"text":"with models scoring 0). RESTRICTED prompt type was used as further control that forces models into short outputs, restricting the compute for providing a solution and thus serving as low baseline for the performance (see Suppl. Sec. ","element":"span"},{"text":"C ","element":"span"},{"text":"and Suppl. Fig. ","element":"span"},{"href":"#id-22","text":"34)","element":"a"},{"text":". Among the 4 models that are able to cross ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"3","element":"span"},{"text":", two clear winners are the GPT-4o (","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"649","element":"span"},{"text":") and Claude 3 Opus (","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"431","element":"span"},{"text":"). The only open-weights model in this set of better performers is the rather older Llama-2 70B Chat (","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"3","element":"span"},{"text":"). For better performers, when inspecting the responses with correct final answers, we see also correct reasoning backing up the final answers. For the poor performing models with low correct response rates, by inspecting those rare responses with correct answers we also in some cases still can see correct reasoning. In the poor performers, among the responses with a correct final answer we see however often responses where final answer, after careful inspection, turns out to be an accident of executing entirely wrong reasoning with various mistakes leading coincidentally to","element":"span"}],[{"id":"id-30","style":{"width":"99%"},"width":1584,"height":989,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/7-0.png","element":"img"}],[{"text":"Figure 3: Collapse of SOTA LLMs on AIW problem. (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"main","element":"figcaption","subtype":"caption"},{"text":") Models with non-zero AIW correct response rate, average over STANDARD, THINKING, RESTRICTED prompt types and AIW variations 1-4. Omitted models score 0. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(inlay","element":"figcaption","subtype":"caption"},{"text":") Strong fluctuations on AIW variations 1-4, despite problem structure and difficulty remaining entirely unchanged across variations. Note that overall correct response rate averaged across variations does not reveal these fluctuations (shown on example of GPT-4). Numbers in the legend are prompt IDs (Suppl. Tab. ","element":"figcaption","subtype":"caption"},{"href":"#id-4","text":"2)","element":"a","subtype":"caption"}],[{"text":"the final output number corresponding to the right answer. Such responses are encountered in models with low correct performance rates (","element":"span"},{"style":{"fontStyle":"italic"},"text":"p < ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"4","element":"span"},{"text":") (see Suppl. Sec. ","element":"span"},{"text":"D ","element":"span"},{"text":"for response examples), and we correct via manual inspection the status of correct response for such cases.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2. Strong performance fluctuations across irrelevant AIW problem variations. ","element":"span"},{"text":"Importantly, we also observe strong fluctuation of correct response rates across AIW variations 1-4 as introduced in Sec. ","element":"span"},{"text":"2. ","element":"span"},{"text":"Such fluctuations strongly affect better performers with higher average correct response rates like GPT-4/4o and Claude 3 Opus (as poor performers have often correct response rates across all variations close to 0, so that no room for fluctuations exist on that very low performance level). As shown in the Fig. ","element":"span"},{"href":"#id-30","text":"3 ","element":"a"},{"text":"(inlay) for the STANDARD and Fig. ","element":"span"},{"href":"#id-32","text":"4 ","element":"a"},{"text":"for the THINKING prompt type, the correct response rates can fluctuate between being close to 1 to being close to 0, depending on AIW variation. Remarkable is that such fluctuations appear despite AIW variations being all instances of the very same simple problem, as changes in numbers used across AIW variations do not change either the problem structure or its difficulty at all. This lack of robustness on such a simple problem hints on severe deficits in generalization. The strong fluctuations across variations appear independent of employed prompt types (Fig. ","element":"span"},{"href":"#id-15","text":"1)","element":"a"},{"text":", while correct response rate averaged across all variations also varies across prompt types, showing in addition expected prompt type dependency (Suppl. Fig. ","element":"span"},{"href":"#id-22","text":"34, ","element":"a"},{"href":"#id-21","text":"35)","element":"a"}],[{"id":"id-44","style":{"fontWeight":"bold"},"text":"3.1.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Control experiments using AIW Light problems","element":"span"}],[{"text":"As outlined in Sec. ","element":"span"},{"href":"#id-23","text":"2.1, ","element":"a"},{"text":"we make use of control experiments to rule out that the observed breakdown might be due to failures in executing low level operations specific to AIW, eg. parsing natural language, numbers, arithmetic operations, set operations, binding attributes to entities (eg female attribute via \"she\" pronoun to Alice), handling basic family structure and so on. Observed strong fluctuations across variations in better performing models make it already unlikely that same low level operation (or same set of low level operations) necessary to solve AIW are broken - if that would be the case, such failures would be manifested to equal extent across all variations, as they pose the very same problem requiring same operations to be performed, so that performance would suffer in same manner independent of variations. Still, to obtain direct evidence, we design control","element":"span"}],[{"id":"id-32","style":{"width":"99%"},"width":1576,"height":604,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/8-0.png","element":"img"}],[{"text":"Figure 4: Strong fluctuations across AIW problem variations, THINKING prompt. For better performers, eg GPT-4o/4 and Claude Opus 3, correct response rates vary strongly from close to 1 to close to 0, despite AIW variations not affecting problem structure or difficulty (a color per each variation 1-4). This shows clear lack of model robustness, revealing generalization and basic reasoning deficits.","element":"figcaption","subtype":"caption"}],[{"id":"id-33","style":{"width":"99%"},"width":1576,"height":457,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/8-1.png","element":"img"}],[{"text":"Figure 5: Correct response rates across AIW Light Arithmetic Siblings control problem variations 1-4 (THINKING v2 prompt type). Strong performance is observed across problem variations (a color per each variation 1-4; prompt IDs in the legend, Suppl. Tab. ","element":"figcaption","subtype":"caption"},{"href":"#id-18","text":"3)","element":"a","subtype":"caption"},{"text":". Models that entirely collapse on AIW, like Command R Plus and Dbrx Instruct, are clearly able to solve this version, with correct response rates going up to 1 or close to 1 across all problem variations. This shows that executing arithmetic operations or handling basic family setting is not an issue for the tested models.","element":"figcaption","subtype":"caption"}],[{"text":"experiments such that they require low level operations necessary to solve AIW problem while also keeping problem structure close enough to the original (Sec. ","element":"span"},{"href":"#id-23","text":"2.1)","element":"a"},{"text":".","element":"span"}],[{"text":"In all following experiments, for each AIW variation, 60 trials were executed to estimate correct response rate and its variance.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"AIW Light Arithmetic Siblings","element":"span"},{"text":". We show tested models’ performance in Fig. ","element":"span"},{"href":"#id-33","text":"5. ","element":"a"},{"text":"While all tested models clearly have struggled with AIW original (Fig. ","element":"span"},{"href":"#id-30","text":"3, ","element":"a"},{"text":"Suppl. Fig. ","element":"span"},{"href":"#id-31","text":"33)","element":"a"},{"text":", we observe them successfully solving AIW Light Arithmetic Siblings. Correct response rates go high up close to 1 for most tested models across all variations 1-4. This is also the case for the models that show very low correct response rates close to 0 or 0 on AIW original, like Command R+ or Dbrx Instruct (Suppl. Fig. ","element":"span"},{"href":"#id-31","text":"33, ","element":"a"},{"text":"Suppl. Tab. ","element":"span"},{"href":"#id-25","text":"12)","element":"a"},{"text":". Importantly, strong fluctuations we observe across variations on AIW original (Fig. ","element":"span"},{"href":"#id-30","text":"3, ","element":"a"},{"href":"#id-32","text":"4) ","element":"a"},{"text":"disappear entirely. This clearly demonstrates that models neither struggle with basic grasping of relational family structure - eg., realizing Alice’s siblings are her sisters and brothers, nor with selection and execution of elementary arithmetic sum operation.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"AIW Light Family","element":"span"},{"text":". We show tested models’ performance in Fig. ","element":"span"},{"href":"#id-34","text":"6. ","element":"a"},{"text":"Also here we observe all the tested models that are struggling with AIW original successfully solving AIW Light Family. Correct response rates go high up close to 1 for most tested models across all variations 1-4. This is also the case for the models that show very low correct response rates close to 0 or 0 on AIW original. like Command R+ or Dbrx Instruct (Suppl. Fig. ","element":"span"},{"href":"#id-31","text":"33 ","element":"a"},{"text":"& Tab. ","element":"span"},{"href":"#id-25","text":"12)","element":"a"},{"text":". Also strong fluctuations that we observe","element":"span"}],[{"id":"id-34","style":{"width":"99%"},"width":1577,"height":415,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/9-0.png","element":"img"}],[{"text":"Figure 6: Correct response rates across AIW Light Family control problem variations 1-4 (THINKING v2 prompt type). Strong performance is observed across problem variations (a color per each variation 1-4). Models that entirely collapse on AIW, like Command R Plus and Dbrx Instruct, are clearly able to solve this version, with correct response rates going up to 1 or close to 1 across all problem variations. This shows that handling basic family relations and binding sex attributes to entities via pronouns is not an issue for the tested models.","element":"figcaption","subtype":"caption"}],[{"id":"id-35","style":{"width":"99%"},"width":1584,"height":415,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/9-1.png","element":"img"}],[{"text":"Figure 7: Correct response rates across AIW Light Arithmetic Total Girls control problem variations 1-4 (THINKING v2 prompt type). Strong performance is observed across problem variations (a color per each variation 1-4; prompt IDs in the legend, Suppl. Tab. ","element":"figcaption","subtype":"caption"},{"href":"#id-20","text":"5)","element":"a","subtype":"caption"},{"text":". Models that entirely collapse on AIW, like Command R Plus and Dbrx Instruct, are clearly able to solve this version, with correct response rates going up to 1 or close to 1 across all problem variations. This rules out that either binding of female attributes to Alice and the sisters entities or selection and execution of arithmetic operations necessary to count total females is an issue for the tested models.","element":"figcaption","subtype":"caption"}],[{"text":"across variations on AIW original (Fig. ","element":"span"},{"href":"#id-30","text":"3, ","element":"a"},{"href":"#id-32","text":"4) ","element":"a"},{"text":"disappear. This clearly demonstrates that models handle well basic grasping of relational family structure - understanding entity \"Alice’s sister\", binding female attribute to Alice (which allows to determine correct number of brothers) and realizing Alice and her sisters share same brothers.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"AIW Light Arithmetic Total Girls","element":"span"},{"text":". We show tested models’ performance in Fig. ","element":"span"},{"href":"#id-35","text":"7. ","element":"a"},{"text":"Again, we observe also here strong performance for all tested models that clearly have struggled with AIW original. Correct response rates go high up close to 1 for most tested models across all variations 1-4. This is also the case for the models that show very low correct response rates close to 0 or 0 on AIW original. like Command R+ or Dbrx Instruct. Also strong fluctuations that we observe across variations on AIW original (Fig. ","element":"span"},{"href":"#id-30","text":"3, ","element":"a"},{"href":"#id-32","text":"4) ","element":"a"},{"text":"are gone. This clearly demonstrates that models successfully cope with binding female attribute to entity of Alice, handle assignment of correct female attributes to the sisters and select and execute the correct arithmetic sum operation adding all the girls together.","element":"span"}],[{"text":"From these control experiments, we are thus able to obtain strong evidence that all tested models do not suffer from low-level issues with tokenization and natural language or natural numbers parsing and can handle well basic family relations structure, binding of attributes to entities and selection and execution of elementary arithmetic operations necessary to solve AIW problem. This further strengthen the hypothesis that observed failures and strong fluctuations in all tested SOTA models on AIW problem (Fig. ","element":"span"},{"href":"#id-30","text":"3, ","element":"a"},{"href":"#id-32","text":"4) ","element":"a"},{"text":"are rooted in problem unspecific, generic deficits in generalization and basic reasoning about problem structure.","element":"span"}],[{"id":"id-36","style":{"width":"78%"},"width":1237,"height":730,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/10-0.png","element":"img"}],[{"text":"Figure 8: Failure of standardized benchmark MMLU to properly reflect and compare model basic reasoning capabilities as shown by strong discrepancy between AIW correct response rate vs MMLU average score. Many models, eg. Command R+, score 0 on AIW, but have high MMLU score.","element":"figcaption","subtype":"caption"}],[{"id":"id-110","style":{"fontWeight":"bold"},"text":"3.1.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Discrepancies between claimed strong function and observed breakdown","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Standardized benchmarks failure","element":"span"},{"text":". We observe failure of standardized reasoning benchmarks to properly reflect generalization and basic reasoning skills of SOTA LLMs by noting significant disparity between the model’s performance on the AIW problem and the scores on conventional standardized benchmarks. All of the tested models report high scores on various standardized benchmarks that claim to test problem solving via reasoning, e.g. MMLU, ARC, Hellaswag. Our observations of SOTA models breaking down on the simple AIW problem hint that the benchmarks do not reflect deficits in generalization and basic reasoning of those models properly. We visualize this failure by plotting scores tested models obtain on wide-spread and accepted standardized benchmarks like MMLU versus the performance we observe on our proposed AIW problem. As strikingly evident from Fig. ","element":"span"},{"href":"#id-36","text":"8, ","element":"a"},{"text":"there is a strong mismatch between high scores on MMLU reported by the models and the correct response rates they obtain on AIW. This mismatch and lack of differentiation makes it impossible for a given model to predict from its score on MMLU whether it will suffer breakdown on a simple problem like AIW, making the score unreliable for measuring core capabilities. Also model ranking fails, as models with similarly high MMLU scores claiming similar function level can have dramatic difference on simple AIW problem. For instance, models like Llama-3-70B, Mistral Large or GPT-4-Turbo come close with their MMLU score to GPT-4/Claude 3 Opus, hinting comparable capabilities, while settled in the crowd of high MMLU - low AIW score region (left upper part of Fig. ","element":"span"},{"href":"#id-36","text":"8) ","element":"a"},{"text":"with most other models that achieve very low AIW performance close to 0. This also demonstrates that MMLU, while containing problems of arguably higher difficulty, does not properly reflect deficits in basic model function, as revealed by much simpler AIW problem. For similar evidence on other standardized benchmarks, see Suppl. Sec. ","element":"span"},{"href":"#id-37","text":"C.1","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Go small, go home: breakdown of smaller scale models","element":"span"},{"text":". The few models capable of showing significant non-zero correct response rate for the AIW problem are residing on the larger scales. GPT-4o/4 and Claude 3 Opus have unknown scales, it is however reasonable to assume the model scale is well beyond 70B params and the tokens scale is well beyond 2T tokens. Observing the performance on the AIW problem across various models, we see evidence that in general, smaller scale models (known to have been overtrained on large token budgets of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"tokens) that have quite high scores on standardized reasoning benchmarks, some coming close to larger scale ones, suffer severe collapse on the AIW problem. No small scale model can even remotely approach the performance shown by better performers residing at larger scales which, despite observed strong fluctuations on problem irrelevant variations, do show significant correct response rates on some AIW problem variations (GPT-4o/4, Claude 3 Opus), while smaller scale models drop close to zero, staying far below ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"1 ","element":"span"},{"text":"(with the only exception of Llama 2 7B, which upon inspection reveals that its average correct response rate is substantially elevated due to one strong outlier on a single AIW variation, with correct response rate on others being close to 0 or 0, which cannot be taken as a signature of strong function and is a sign of severe lack of robustness, see also Fig. ","element":"span"},{"href":"#id-38","text":"26)","element":"a"},{"text":". An","element":"span"}],[{"id":"id-47","style":{"width":"99%"},"width":1584,"height":1065,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/11-0.png","element":"img"}],[{"text":"Figure 9: Altering model performance by fully redundant information. Adding fully redundant information \"Alice is female\" leads to increase of average correct response rates in ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(a) ","element":"figcaption","subtype":"caption"},{"text":"compared to AIW original ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(b) ","element":"figcaption","subtype":"caption"},{"text":"(see also Suppl. Fig. ","element":"figcaption","subtype":"caption"},{"href":"#id-39","text":"40)","element":"a","subtype":"caption"},{"text":". For some models, eg Llama 3 70B or Qwen 2 72B, this boost via redundant info is especially pronounced and happens across all variations, resulting in clear overall improvement from ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(b) ","element":"figcaption","subtype":"caption"},{"text":"to ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(a)","element":"figcaption","subtype":"caption"},{"text":". Note performance of a smaller scale model like Llama 3 8B stays low close to 0. Strong fluctuations across variations 1-4 persist. This again shows lack of model robustness, hinting on severe generalization and basic reasoning deficits.","element":"figcaption","subtype":"caption"}],[{"text":"important note here is that while probability of random guess on AIW problem is not straight forward to calculate, as the answer might be any natural number, it is conceivable that any model capable of language parsing from the input can probably easily infer from the problem formulation that response should be a rather low natural number without ever grasping any essentials about actual problem structure. That might already restrict the candidate numbers to a narrow interval, eg 0-10, which would provide a random guess probability of about ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"1","element":"span"},{"text":". Following that consideration, any correct rate response that is around or below ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"1 ","element":"span"},{"text":"might considered just random guessing that does not require any reasoning. All the tested smaller scale models (","element":"span"},{"style":{"fontStyle":"italic"},"text":"< ","element":"span"},{"text":"13","element":"span"},{"text":"B) would then fall into the category of random guessing according to Fig. ","element":"span"},{"href":"#id-30","text":"3, ","element":"a"},{"text":"Suppl. Fig. ","element":"span"},{"href":"#id-31","text":"33. ","element":"a"},{"text":"For further evidence on effect of scale, see Sec. ","element":"span"},{"href":"#id-40","text":"3.4.2 ","element":"a"},{"text":"and Fig. ","element":"span"},{"href":"#id-41","text":"23, ","element":"a"},{"href":"#id-42","text":"24.","element":"a"}],[{"id":"id-115","style":{"fontWeight":"bold"},"text":"3.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Curiouser and curiouser","element":"span"}],[{"text":"Following our observations of breakdown of SOTA LLM models that claim strong function described in Sec. ","element":"span"},{"href":"#id-43","text":"3.1, ","element":"a"},{"text":"we investigated various properties and modes of the observed failures, reporting here the ones we find most remarkable. Investigating the AIW problem further, we find various other formulations that show same breakdown characteristics for all SOTA LLMs. We observe same lack of robustness on structure and difficulty preserving problem variations using various AIW extensions which also abandon original family problem framing, providing further evidence that observed breakdown of zero-shot generalization is of generic character and is not unique for the specific AIW problem setting, as also corroborated by control experiments in Sec. ","element":"span"},{"href":"#id-44","text":"3.1.1.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Performance fluctuations by adding fully redundant information: Alice female power boost","element":"span"},{"text":". One clear signature of generalisation and reasoning breakdown are the strong fluctuations we observe across AIW problem variations 1-4 that differ only in instantiated numbers (Fig. ","element":"span"},{"href":"#id-32","text":"4)","element":"a"},{"text":". We investigate a further AIW problem version by adding ","element":"span"},{"style":{"fontStyle":"italic"},"text":"\"Alice is female\" ","element":"span"},{"text":"to the original AIW problem formulation ","element":"span"},{"text":"(see Suppl. Tab. ","element":"span"},{"href":"#id-45","text":"6 ","element":"a"},{"text":"and Suppl. Sec. ","element":"span"},{"href":"#id-46","text":"C.2)","element":"a"},{"text":". This is a fully redundant information, as Alice’s gender is already unambiguously specified by the \"she\" pronoun used in original AIW problem. As evident from Fig. ","element":"span"},{"href":"#id-47","text":"9 ","element":"a"},{"text":"and Suppl. Fig. ","element":"span"},{"href":"#id-39","text":"40, ","element":"a"},{"text":"the average correct response rates are increasing, despite the provided \"female boost\" information being entirely redundant and not revealing anything new necessary for AIW problem solution. Altering performance by fully redundant information that should not affect problem solving reveals again deficits in generalization and basic reasoning. While average correct response rates increase, the strong fluctuations across AIW variations 1-4 remain (Fig. ","element":"span"},{"href":"#id-47","text":"9a)","element":"a"},{"text":". For instance, GPT-4o has on AIW variations 2,4 correct response rate close to 1, while dropping heavily for AIW variations 1,3, showing same lack of robustness despite the average boost.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Performance fluctuations by changing magnitude of instantiated numbers.","element":"span"},{"text":". Variations of numbers ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N, M ","element":"span"},{"text":"for brothers and sisters in AIW problem were so far deliberately chosen to be in the range of common sense, avoiding exceedingly large numbers. For the formal mathematical structure of the problem though, magnitude of numbers does not pose any change. We conduct experiment using AIW version with exaggerated numbers, created by adding offset of ","element":"span"},{"text":"60 ","element":"span"},{"text":"to numbers ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N, M ","element":"span"},{"text":"used in the AIW original. We see the same breakdown pattern as in AIW original - models exhibit strong fluctuations across problem structure and difficulty preserving variations, or break down to low performance level close to 0 across all variations (Fig. ","element":"span"},{"href":"#id-48","text":"10) ","element":"a"},{"text":"We also observe differences in correct response rates between the exaggerated number AIW version Fig. ","element":"span"},{"href":"#id-48","text":"10(","element":"a"},{"style":{"fontWeight":"bold"},"text":"A","element":"span"},{"text":") and the original Fig. ","element":"span"},{"href":"#id-48","text":"10 ","element":"a"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"B","element":"span"},{"text":"). As the number magnitude modification leaves the problem structure unchanged, these performance fluctuations provide further evidence for lack of model robustness to problem irrelevant perturbations and hint on generic generalization deficits.","element":"span"}],[{"id":"id-54","style":{"fontWeight":"bold"},"text":"3.2.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Persisting fluctuations and breakdown on various AIW versions","element":"span"}],[{"text":"To confirm that the same observations hold for other problems of related kind, we construct further problem templates which pose problems of similar or higher difficulty than AIW original. Using same experimental procedure to create variations of these problem versions, we observe the same pattern as for the AIW original, especially the strong fluctuations across variations, confirming the existence of the same generalization deficits for further problem examples. The problems are as following:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"AIW extended (AIW Ext.) ","element":"span"},{"text":"In this problem version, Alice and Bob are employed as two entities, keeping the template close to the AIW original. The resulting variations (for the version, where question is posed explicitly for Alice - AIW Ext Explicit - prompt IDs 264, 266, 268, 270) are as following:","element":"span"}],[{"style":{"width":"75%"},"width":1196,"height":269,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/12-0.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"AIW Friends ","element":"span"},{"text":"In this problem version, we abandon the family frame setting. Instead, we use male and female friends in problem formulation. Note the problem structure is still related to AIW original ( brothers and sisters are male and female siblings). We use an additional condition to ensure there is no common sense ambiguity in this problem version. The resulting variations are as following:","element":"span"}],[{"text":"Variation 1. Alice has 3 male friends and she also has 6 female friends. [Correct answer: 7] Variation 2. Alice has 2 female friends and she also has 4 male friends. [Correct answer: 3] Variation 3. Alice has 4 female friends and she also has 1 male friend. [Correct answer: 5] Variation 4. Alice has 4 male friends and she also has 1 female friend. [Correct answer: 2] All mentioned persons are friends with each other and have no other friends aside. Question: How many female friends does male friend of Alice have?","element":"span"}],[{"text":"These two problem versions, AIW Ext and AIW Friends, have similar problem structure and difficulty to AIW original. They thus can be also seen as control to ensure our observations are consistent among various problem formulations that have similar problem difficulty (in contrast to AIW Light control problems which are designed to test low level operations involved, such that their templates pose problems of lower difficulty than AIW original).","element":"span"}],[{"id":"id-48","style":{"width":"99%"},"width":1581,"height":1071,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/13-0.png","element":"img"}],[{"text":"Figure 10: Strong fluctuations on AIW problem variations (a color per each variation 1-4) for AIW version with exaggerated numbers are used (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"A","element":"figcaption","subtype":"caption"},{"text":"), created by adding 60 offset to numbers used in AIW original (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"B","element":"figcaption","subtype":"caption"},{"text":"), using same THINKING v2 prompt type. Main pattern of model behavior remains the same - exhibiting strong fluctuations across variations and having breakdowns on some variations with low correct response rates. Variations correspond to different instantiations of numbers ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"N, M ","element":"figcaption","subtype":"caption"},{"text":"for brothers and sisters in the same problem template. Varying numbers should not affect problem solution at all, being a \"natural\" variation that does not affect problem structure and its difficulty. However, correct response rate varies strongly depending on the variation. Lack of robustness to irrelevant variations of such a simple problem points to severe generalization deficits. Correct response rates seem also to vary between the exaggerated number AIW version (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"A","element":"figcaption","subtype":"caption"},{"text":") and the original (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"B","element":"figcaption","subtype":"caption"},{"text":"). Average across all 4 variations, it is lower for exaggerated numbers for most models. For Llama 3 70B as an exception, there is a overall increase, while performance stays still poor. Changes in correct response rates induced by mere number modification that leaves problem structure unchanged again provide evidence for lack of model robustness and hint on generic generalization deficits.","element":"figcaption","subtype":"caption"}],[{"text":"Results of experiments with AIW Ext and AIW Friends and comparison to AIW original are shown in Fig. ","element":"span"},{"href":"#id-49","text":"11. ","element":"a"},{"text":"We again see rather low correct response rates on the one hand, with better performers exhibiting strong fluctuations on problem variations. While the breakdown pattern for each problem version might depend on a given model (eg., note Claude 3 Opus overall breakdown in correct response rates on AIW Ext in Fig. ","element":"span"},{"href":"#id-49","text":"11 ","element":"a"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"B","element":"span"},{"text":"), while performing comparably, though with lower correct response rates, on AIW Friends Fig. ","element":"span"},{"href":"#id-49","text":"11 ","element":"a"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"C","element":"span"},{"text":") ), the observed overall breakdown pattern is the same, confirming for tested SOTA LLMs same kind of severe lack of robustness and generalization deficit as observed on AIW original. Additional evidence for lack of model robustness and generalization deficit comes from the observation that performance also varies across problem versions, despite those being very similar in their structure and difficulty. Operations required to solve these various problem versions overlap strongly, although there are also differences. For instance, in AIW Ext ","element":"span"},{"href":"#id-49","text":"11 ","element":"a"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"B","element":"span"},{"text":")), operations are required to handle Alice and Bob entities separately, to assign both properly to sisters and brothers sets, while in AIW original, only Alice has to be handled. On the other hand, AIW original has additional operations not required by AIW Ext. It requires binding of Alice via \"she\" pronoun to female gender before she can be properly assigned to sisters set and resolving the entity of Alice’s brother in the posed question - which AIW Ext in turn does not require. In AIW Friends ","element":"span"},{"href":"#id-49","text":"11 ","element":"a"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"C","element":"span"},{"text":")), attributes female/male and friend have to be treated separately which requires additional binding operations. This is in contrast to sister and brother subsuming both the gender attribute (male, female) and the relationship (sibling) in one, which does not require additional binding. The problem structure","element":"span"}],[{"id":"id-49","style":{"width":"99%"},"width":1584,"height":1559,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/14-0.png","element":"img"}],[{"text":"Figure 11: Strong fluctuations on problem variations (a color per each variation 1-4) comparing AIW original (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"A","element":"figcaption","subtype":"caption"},{"text":") with two further versions, AIW Extended (AIW Ext (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"B","element":"figcaption","subtype":"caption"},{"text":") and AIW Friends (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"C","element":"figcaption","subtype":"caption"},{"text":"), using THINKING v2 prompt type. Overall, observed pattern remains the same - better performers (GPT-4/4o, Claude 3 Opus) that manage to get occasionally higher correct response rates (","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"p > ","element":"figcaption","subtype":"caption"},{"text":"0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":".","element":"figcaption","subtype":"caption"},{"text":"3","element":"figcaption","subtype":"caption"},{"text":") on some variations show strong fluctuations, with correct response rates dropping strongly on other variations, despite variations 1-4 leaving problem structure and difficulty unchanged by merely instantiating different numbers ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"N, M ","element":"figcaption","subtype":"caption"},{"text":"into problem templates. AIW Ext (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"B","element":"figcaption","subtype":"caption"},{"text":") and AIW Friends (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"C","element":"figcaption","subtype":"caption"},{"text":") show overall altered correct response rates (eg, for GPT-4/4o, Claude 3 Opus) compared to AIW original (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"A","element":"figcaption","subtype":"caption"},{"text":")). Strong fluctuations are common phenomenon across the problem versions, again hinting on generic generalization deficits that are independent of specific problem formulation. Differences in performance between the problem versions provide additional evidence for lack of model robustness, as problem structure and difficulty is highly similar between the versions.","element":"figcaption","subtype":"caption"}],[{"id":"id-51","style":{"width":"98%"},"width":1560,"height":750,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/15-0.png","element":"img"}],[{"text":"Figure 12: AIW+ correct response rates averaged over variations 1-6 (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"A","element":"figcaption","subtype":"caption"},{"text":") and fluctuations across variations (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"B","element":"figcaption","subtype":"caption"},{"text":"). Most tested models undergo further collapse compared to AIW original. o1 preview as clear exception shows robust ability to solve AIW+ across all variations without fluctuations. o1-mini on contrary collapses close to 0, also showing fluctuations (eg on variation 6 vs others). AIW+ was made intentionally harder than simple AIW. However, models claiming strong function should be able to solve it, as it does not involve any higher level logic or math.","element":"figcaption","subtype":"caption"}],[{"text":"and difficulty among the problem versions are thus similar enough so that if generalization were intact, such mild differences would not matter and no such pronounced difference in performance would have appeared, in contrast to our observations.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.2.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Harder problem versions: further breakdown of advanced LLMs","element":"span"}],[{"text":"To test what happens if the simple AIW problem is further extended towards a substantially harder difficulty level, providing further challenge for the tested models, we construct further problem versions of higher difficulty while keeping same relational logic structure appeal and conduct experiments as described in following.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"AIW+ problem","element":"span"},{"text":". We constructed an AIW+ problem that features additional hierarchy and distractors when describing relational family structure (see Suppl. Sec ","element":"span"},{"text":"B ","element":"span"},{"text":"for full formulation). AIW+ problem template has following form: ","element":"span"},{"style":{"fontStyle":"italic"},"text":"\"Alice has ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"M ","element":"span"},{"style":{"fontStyle":"italic"},"text":"sisters and ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"N ","element":"span"},{"style":{"fontStyle":"italic"},"text":"brothers in total. Her mother has 2 brothers. She also has 1 sister who does not have children and who has ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"X ","element":"span"},{"style":{"fontStyle":"italic"},"text":"nephews and nieces in total. Alice’s father has 2 sisters. He also has a brother who has ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Y ","element":"span"},{"style":{"fontStyle":"italic"},"text":"nephews and nieces in total, and who also has ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"K ","element":"span"},{"style":{"fontStyle":"italic"},"text":"[sons/daughters]. How many cousins does Alice’s sister have?\"","element":"span"},{"text":".","element":"span"}],[{"text":"The solution to AIW+ problem is harder to obtain than the solution to AIW original with its much simpler structure. Solving AIW+ requires taking different paternal sides, that of mother and father, and calculating the number of cousins, taking care of subtracting Alice and her siblings, and summing up the total number of cousins from both sides. Still, this problem is arguably far from olympiad or university graduate level, as it requires just using provided numbers and careful execution of elementary arithmetic operations on straightforward path to solution. The correct solution is given by ","element":"span"},{"style":{"height":16},"width":916.07,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/15-1.png","element":"img","alt":"C = (X − (M + N + 1)) + (Y − (M + N + 1) + K)","inline":true},{"text":". We follow again our approach to create variations by instantiating numbers ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N, M, X, Y, K ","element":"span"},{"text":"in problem template to obtain problem instances of same problem structure and difficulty (see also Suppl. Tab. ","element":"span"},{"href":"#id-50","text":"10)","element":"a"},{"text":".","element":"span"}],[{"text":"We show a full example of correct solution with instantiated numbers for AIW+ variation 1 (","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", X ","element":"span"},{"text":"= 6","element":"span"},{"style":{"fontStyle":"italic"},"text":", Y ","element":"span"},{"text":"= 5","element":"span"},{"style":{"fontStyle":"italic"},"text":", K ","element":"span"},{"text":"= 2","element":"span"},{"text":"), which corresponds to problem instance as following:","element":"span"}],[{"text":"Variation 1. Alice has 1 sister and 1 brother in total. Her mother has 2 brothers. She also has 1 sister who does not have children and who has 6 nephews and nieces in total. Alice’s father has 2 sisters. He also has a brother who has 5 nephews and nieces in total, and who also has 2 sons.","element":"span"}],[{"id":"id-56","style":{"width":"100%"},"width":1589,"height":1555,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/16-0.png","element":"img"}],[{"text":"Here, we have on the mother side: 6 (total nephews and nieces) - 3 (Alice and her siblings) = 3 cousins; on the father side: 5 (total nephews and nieces) + 2 (sons of the father’s brother) - 3 (Alice and her siblings) = 4 cousins; summing up 3 + 4 = 7 cousins which Alice and any of her siblings share.","element":"span"}],[{"text":"Exposing models to AIW+ following same methodology from Sec. ","element":"span"},{"text":"2 ","element":"span"},{"text":"(we add two more variations, so that AIW+ has variations 1-6), we observe further, even stronger collapse of performance also for those advanced models that were showing significant correct response rates for AIW problem (Fig. ","element":"span"},{"href":"#id-51","text":"12) ","element":"a"},{"text":"For instance, for GPT-4/4o and Claude 3 Opus overall correct response rate averaged across variations stays below ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p < ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"2 ","element":"span"},{"text":"(Fig. ","element":"span"},{"href":"#id-51","text":"12 ","element":"a"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"A","element":"span"},{"text":")). Large-scale open weight SOTA models Llama 3.1 405B, DeepSeek-v3 671B and Qwen 2.5 72B are settled around ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"1 ","element":"span"},{"text":"and below. Recent Claude 3.5 Sonnet is an outlier scoring higher up close to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"4","element":"span"},{"text":", without showing strong fluctuations we usually observe (Fig. ","element":"span"},{"href":"#id-51","text":"12 ","element":"a"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"B","element":"span"},{"text":"); see however Sec. on Claude 3.5 Sonnet breakdown - its performance on some versions of AIW might be due to exposure to AIW tasks in post training, as it appeared after first version of our public AIW release). To show that problem can be successfully handled, we also test here o1-preview that comes from the recent generation of reasoning models (which we treat as an exception; LLMs are conventionally understood as models pre-trained in purely autoregressive manner. It is still unknown for o1 class of models whether RL on unknown amounts of synthetic data presumably of math and logic type is executed during pre-training or is rather a part of post-training). o1-preview is a clear exception and has robust performance close to 1 across all AIW+ variations. Remarkably, o1-mini coming presumably from the same model class does not show same robustness - its performance is comparable to standard LLM generation far below o1-preview, settled close to 0 (Fig. ","element":"span"},{"href":"#id-51","text":"12 ","element":"a"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"A","element":"span"},{"text":")) and exhibiting fluctuations as usually observed in our study (Fig. ","element":"span"},{"href":"#id-51","text":"12 ","element":"a"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"B","element":"span"},{"text":")).","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"AIW Colleague Circles","element":"span"},{"text":". Evidencing strong performance of o1-preview on AIW+, we design a further problem with harder difficulty level than AIW original to test whether we can observe same breakdown patterns on the strongest model. AIW like problems can be understood as problems on graphs, featuring entities, properties and relationships that define sets. To increase problem difficulty, we depart from simple connectivity that was characteristic of the AIW original and introduce circles of colleagues where all-to-all connectivity defines a circles, while some entities have connections to outside, make those entities hubs connecting circles.","element":"span"}],[{"style":{"width":"100%"},"width":1595,"height":2109,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/17-0.png","element":"img"}],[{"text":"For the AIW Light problem versions used in control experiments, we observe as expected clear dominant peaks on the numbers corresponding for correct responses across all tested models (Fig. ","element":"span"},{"href":"#id-52","text":"16, ","element":"a"},{"href":"#id-52","text":"17)","element":"a"},{"text":", as AIW Light problems are successfully solved across all their variations.","element":"span"}],[{"style":{"width":"100%"},"width":1589,"height":1388,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/18-0.png","element":"img"}],[{"text":"Figure 15: Frequency distribution of output numbers in models’ responses. Shown are numerical outputs for AIW Variation 4, THINKING prompt type (prompt ID 70), that has correct answer C=M+1=2, with M=1 number of sisters of Alice. For this AIW variation, models have higher performance (Fig. ","element":"figcaption","subtype":"caption"},{"href":"#id-32","text":"4)","element":"a","subtype":"caption"},{"text":". Correspondingly, peaks for better performing models (eg GPT-4o, GPT-4, Claude Opus 3) are on the dominant correct response, R=M+1=2. For models with worse performance, peaks are on the dominant wrong response, R=M=1. For weaker models, eg Llama 3 8B, also broader distribution over numbers appears, with further wrong clear peaks that are further away from C=M+1 ","element":"figcaption","subtype":"caption"},{"style":{"height":14.8},"width":1587.18,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/18-1.png","element":"img","alt":"(eg M=4). The distribution shape and peaks nature can be thus used as signature of model’s capability","inline":true,"padRight":true},{"text":"to handle the problem, also allowing model ranking dependent on peak types and distribution sharpness. Distributions were computed over 60 trials executed for each model, taken from original collected responses data.","element":"figcaption","subtype":"caption"}],[{"text":"We note that distribution characteristics, eg concentration on numbers around the correct answer, height of the peaks, can be a further signature that reflects model’s capability to handle the problem. More capable models retain dominant peaks on number corresponding to correct answer with smaller peaks on neighboring numbers, while weak models have large peaks on numbers corresponding to wrong answers or in general broad distribution across all natural numbers below 10. Computing scores from distribution shape can thus also enable model ranking, similar to looking at distribution shape of model’s correct response rates across variations.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.3.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Overconfident tone and confabulations accompanying wrong answers","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Overconfident tone in wrong solutions. ","element":"span"},{"text":"Observing strong failures, we were curious to see how models explain their generated solutions. For the THINKING prompt type, where prompt contains request to double check the solution, we encounter examples where models spontaneously provide","element":"span"}],[{"id":"id-52","style":{"width":"100%"},"width":1589,"height":1303,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/19-0.png","element":"img"}],[{"text":"Figure 17: Frequency distribution of output numbers in models’ responses. Shown are numerical outputs for AIW Light Arithmetic, Variation 3, THINKING prompt type (prompt ID 279), that has ","element":"figcaption","subtype":"caption"},{"style":{"height":14.8},"width":1584.22,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/19-1.png","element":"img","alt":"correct answer C=5 (total number of Alice’s siblings). For this AIW Light version, all models have","inline":true,"padRight":true},{"text":"high performance. Correspondingly, peaks are on the dominant correct response, R=5. However also here, weaker models like Llama 3 8B show broader distribution with non-vanishing peaks besides the correct response (eg R=4, R=6) hinting on their weaker capabilities to deal robustly with the problem. Distributions were computed over 60 trials executed for each model.","element":"figcaption","subtype":"caption"}],[{"text":"assessment of the solution quality and their confidence into the solution. Remarkably, we see that in many cases of the observed responses with wrong reasoning and wrong final answers, the models claim high quality for their provided solution and are also strongly confident that the provided wrong solution is correct. For instance, Claude 3 Opus uses expressions like “logic holds up; doublechecking the solution; no mistakes in the reasoning; solution is correct.\", and Command R+ reports “This conclusion is straightforward and clear\" for the wrong answers they provide.","element":"span"}],[{"text":"We further use variations of the prompt types to make the models generate estimates of the solution quality and their confidence on it, like the SCIENTIST prompt or the CONFIDENCE prompt (see Suppl. Sec. ","element":"span"},{"text":"B)","element":"span"},{"text":". With those customized prompt types, we again observe strong overconfidence in the solution quality across the tested models. For the SCIENTIST prompt type, we see for instance Llama 2 70B using persuasive expressions like “carefully analyzing; use logical reasoning; provide a precise and accurate solution; conclusion might seem counterintuitive at first, but it’s actually correct\" to back up its wrong solutions. For the CONFIDENCE prompt type, we see for instance for the wrong responses given by Command R+ accompanying statements like “The solution is clear and unambiguous, and I am highly confident that it is correct.\"; “I am confident in this answer, as it logically follows from the provided information.\". See Suppl. Sec. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"?? ","element":"span"},{"text":"for full examples.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Confabulations to back up wrong solutions. ","element":"span"},{"text":"We observe that many models that show reasoning breakdown and produce wrong answers generate at the same time persuasive explanations that contain ","element":"span"},{"text":"reasoning-like or otherwise plausible sounding statements to back up the often non-sensical solutions they deliver. We call here such phenomena ","element":"span"},{"style":{"fontStyle":"italic"},"text":"confabulations","element":"span"},{"text":". Such confabulations may contain for instance calculations or logic-like statements that make no sense. Confabulations can also refer to reasoning about social norms or structures. For instance, in Command R+ we observe many confabulations that use concepts of gender identity such as non-binary gender or concepts related to inclusion or to cultural context dependent family identification as additional backup for the provided wrong reasoning and incorrect answers. Another type of confabulation that we observe is complete refusal to answer due to invented ethical concerns about the nature of the posed AIW problem, such as violation of privacy or lack of inclusion (for instance in CodeLLama-70B-instruct), or by expressing incorrect concerns about supposedly ill-posed problem formulation. See Suppl. Sec. ","element":"span"},{"text":"F ","element":"span"},{"text":"for more details.","element":"span"}],[{"style":{"width":"100%"},"width":1591,"height":1167,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/20-0.png","element":"img"}],[{"id":"id-60","style":{"fontWeight":"bold"},"text":"3.4.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Debunking strong function claims","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A Tale of Rise and Fall, 1: Claude 3.5 Sonnet","element":"span"},{"text":". After publication of our first manuscript version presenting AIW original problem (May 2024), Claude 3.5 Sonnet was released (June 2024), followed by wide acclaim as a strong frontiers LLM, as also confirmed by standardized benchmarks. We tested the model on AIW original problem variations 1-4. As the first among all tested models (at the time point, o1 model family was not existing yet), we observed for Claude 3.5 Sonnet correct response rates being 1 on all but AIW variation 1 (Fig. ","element":"span"},{"href":"#id-53","text":"18 ","element":"a"},{"style":{"fontWeight":"bold"},"text":"A","element":"span"},{"text":"). This behavior is also unchanged when using Bob AIW Variation, replacing Alice with Bob and adapting formulation correspondigly to keep problem’s original structure, Fig. ","element":"span"},{"href":"#id-53","text":"18 ","element":"a"},{"style":{"fontWeight":"bold"},"text":"B","element":"span"},{"text":". Claude 3.5 Sonnet appeared thus from that evidence to have solved AIW problem. We were cautious making conclusions from AIW original formulation only, as those problems were already publicly available and it could not be excluded that data was used for training of models appearing thereafter. We thus used AIW extended (AIW Ext, Sec. ","element":"span"},{"href":"#id-54","text":"3.2.1) ","element":"a"},{"text":"as it was not appearing in the initial version of our study. Like other AIW versions, AIW Ext has similar problem structure and difficulty as AIW original, AIW Ext variations 1-4 being also aligned with AIW original variations 1-4 wrt. correct final answers (Sec. ","element":"span"},{"href":"#id-54","text":"3.2.1 ","element":"a"},{"text":"Suppl. Tab. ","element":"span"},{"href":"#id-55","text":"8 ","element":"a"},{"text":"). We tested Claude 3.5 Sonnet auf AIW Ext variations 1-4 in same way we tested models on AIW variations 1-4. Claude 3.5 Sonnet revealed then strikingly different behavior on AIW Ext, Fig. ","element":"span"},{"href":"#id-53","text":"18 ","element":"a"},{"style":{"fontWeight":"bold"},"text":"C ","element":"span"},{"text":"- it shows strong breakdown to much lower correct response rates and again fluctuations across variations 1-4, despite AIW Ext problem structure being highly similar to the AIW original. This is in line to Claude","element":"span"}],[{"id":"id-53","style":{"width":"99%"},"width":1584,"height":730,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/21-0.png","element":"img"}],[{"text":"Figure 18: A Tale of Rise and Fall of Claude 3.5 Sonnet. While correct response rates go up close to 1 on ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(A) ","element":"figcaption","subtype":"caption"},{"text":"AIW original and also ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(B) ","element":"figcaption","subtype":"caption"},{"text":"AIW Original Bob version, strong breakdown of corrrect response rates is observed on AIW extension ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(C) ","element":"figcaption","subtype":"caption"},{"text":"(AIW Ext), accompanied with fluctuations across variations 1-4. Strongly elevated correct respones rates on AIW original might hint on exposure of Claude 3.5 Sonnet to AIW problem data for tuning. Collapse on AIW Ext, which has same problem structure as AIW original, shows though again clearly lack of robustness and hints on same basic reasoning deficits as suspected for other tested models.","element":"figcaption","subtype":"caption"}],[{"text":"3.5 Sonnet behavior on other AIW versions, eg AIW+ (Fig. ","element":"span"},{"href":"#id-51","text":"12) ","element":"a"},{"text":"and AIW Colleague Circles (Fig. ","element":"span"},{"href":"#id-56","text":"13)","element":"a"},{"text":", where it also exhibits lower correct response rates and strong fluctuations across variations.","element":"span"}],[{"text":"$3f","element":"span"},{"href":"#id-57","text":"19 ","element":"a"},{"text":", similar picture emerges - we observe strong performance improvement of the fine-tuned Llama 3.1 8B on AIW original. It scores close to 1 on most of AIW variations, while still poorly performing on AIW Ext. Models like Claude 3.5 Sonnet report high scores on standardized benchmarks, however strong function claims cannot be derived from those, as we see from strong collapse and performance fluctuations on variations of rather simple problems as AIW Ext.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A Tale of Rise and Fall, 2: NuminaMath-7B & claim of olympiad level problem solving. ","element":"span"},{"text":"Another example of debunking overblown claims is a case of NuminaMath-7B that was ranked 1st at the AIMO competition in July 2024, solving 29/50 private set problems of olympiad math level. The claim was widely put forward that the model is capable of solving high school olympiad math problems. AIW has arguably average elementary school level and does not require any advanced math knowledge. We tested NuminaMath-7B on AIW and observed a strong collapse of this model on AIW problem, with correct response rates close to 0 across AIW variations 1-4 (Fig. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"??","element":"span"},{"text":"). Using AIW Light control problems, we can also see that NuminaMath-7B can handle all the low level operations (elementary arithmetic, attribute binding, etc) and knowledge required to deal with family structure, ruling out that those are the issues. Using the AIW problem setting, we thus can contradict the strong claim of being capable to deal with olympiad level high school math","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A Tale of Rise and Fall, 3: o1-mini & claim of matching larger scale with smaller ones. ","element":"span"},{"text":"o1-mini was announced recently together with o1 and o1-preview as a smaller scale member of the","element":"span"}],[{"id":"id-57","style":{"width":"100%"},"width":1591,"height":2376,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/22-0.png","element":"img"}],[{"id":"id-40","style":{"fontWeight":"bold"},"text":"3.4.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Model comparison and ranking","element":"span"}],[{"text":"In previous experiments, we have shown that it is possible to use simple AIW problems as tool for falsification of strong function claims - robust generalization and strong problem solving - made","element":"span"}],[{"style":{"width":"99%"},"width":1584,"height":764,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/23-0.png","element":"img"}],[{"text":"Figure 20: Testing NuminaMath-7B, which claimed olympiad high school level math problem solving via 1st rank in AIMO competition, by using AIW original and AIW Light control problems. (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"A","element":"figcaption","subtype":"caption"},{"text":") Very low correct response rates across AIW problem variations 1-4 (THINKING v2 prompt type). NuminaMath-7B suffers strong collapse on simple AIW problem that has average elementary school level. This reveals clear deficits in generalization and even basic reasoning, refuting the claim of strong function on special domain of math problems. For each AIW variation, 100 trials were executed to estimate correct response rate and its variance. AIW Light experiments test various operations and knowledge required for solving AIW (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"B","element":"figcaption","subtype":"caption"},{"text":") Asking for Alice’s sister’s brothers number (requires understanding entity \"Alice’s sister\", binding female attribute to Alice and realizing Alice and her sisters share same brothers) (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"C","element":"figcaption","subtype":"caption"},{"text":") Asking for Alice’s siblings number (requires understanding entity \"siblings\", accessing numbers of Alice’s brothers and sisters, executing addition operation) (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"D","element":"figcaption","subtype":"caption"},{"text":") Asking for total girls number (requires binding female attribute to Alice via pronoun \"she\" and to her sisters, selecting and executing the correct arithmetic sum operation to count all the obtained girls). Across all AIW Light control problems, NuminaMath-7B obtains correct response rates much higher than for AIW original, some being close to 1. This proves that handling language, basic family structure, parsing numbers, and handling elementary arithmetics like counting are all intact and not the cause for failures in AIW (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"A","element":"figcaption","subtype":"caption"},{"text":"). In (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"D","element":"figcaption","subtype":"caption"},{"text":"), strong fluctuations despite only differences being instantiated numbers across variations of the same simple problem hint again on severe generalization deficits. The reason for the collapse on the AIW original problem is thus failure in inferring the problem structure, pointing to generalization and reasoning deficits, which is in contrast to claims made for NuminaMath-7B as a strong high school olympiad level math problem solver, based on AIMO competition benchmark, which did not reveal such flaws.","element":"figcaption","subtype":"caption"}],[{"text":"for SOTA LLMs, revealing model breakdowns manifested in low correct response rates or strong fluctuations across problem structure preserving variations.","element":"span"}],[{"text":"We also have seen that measuring overall correct response rates averaged across AIW variations (Fig. ","element":"span"},{"href":"#id-30","text":"3, ","element":"a"},{"href":"#id-36","text":"8, ","element":"a"},{"href":"#id-51","text":"12 ","element":"a"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"A","element":"span"},{"text":"), ","element":"span"},{"href":"#id-56","text":"13 ","element":"a"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"A","element":"span"},{"text":")), and looking at shape of correct response rate distribution across AIW variations (Fig. ","element":"span"},{"href":"#id-32","text":"4, ","element":"a"},{"href":"#id-49","text":"11, ","element":"a"},{"href":"#id-51","text":"12 ","element":"a"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"B","element":"span"},{"text":"), ","element":"span"},{"href":"#id-56","text":"13 ","element":"a"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"B","element":"span"},{"text":"), ","element":"span"},{"href":"#id-58","text":"21) ","element":"a"},{"text":"can provide ordering of models with regard to their capability to handle AIW problems. Here we demonstrate on selected scenarios that measuring capability to handle simple AIW problems & their variations can provide model comparison and model ranking that might reflect true model generalization and problem solving capabilities better than standardized benchmarks.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Comparing models of various scales. ","element":"span"},{"text":"Similar to model comparison between o1-preview and o1-mini (Fig. ","element":"span"},{"href":"#id-58","text":"21)","element":"a"},{"text":", where we were falsifying the hypothesis the smaller scale model being equal or even superior to its larger counterpart in problem solving, we continue with establishing model comparison using AIW problems.","element":"span"}],[{"text":"As first example, we examine GPT-4o/GPT-4o-mini and Claude 3 Opus/Claude 3 Sonnet, which are also pairs of smaller and larger scale closed models counterparts. We take results from the","element":"span"}],[{"id":"id-58","style":{"width":"100%"},"width":1588,"height":1030,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/24-0.png","element":"img"}],[{"text":"experiments conducted with AIW original, AIW Ext and AIW Friends and plot both distributions of correct response rates for variations 1-4 of each corresponding AIW problem template (Fig. ","element":"span"},{"href":"#id-41","text":"22 ","element":"a"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"A","element":"span"},{"text":"), (","element":"span"},{"style":{"fontWeight":"bold"},"text":"B","element":"span"},{"text":"), (","element":"span"},{"style":{"fontWeight":"bold"},"text":"C","element":"span"},{"text":")), together with correct response rates averaged across variations for each problem (Fig. ","element":"span"},{"href":"#id-41","text":"22 ","element":"a"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"a","element":"span"},{"text":"), (","element":"span"},{"style":{"fontWeight":"bold"},"text":"b","element":"span"},{"text":"), (","element":"span"},{"style":{"fontWeight":"bold"},"text":"c","element":"span"},{"text":")) and overall correct response rates averaged across all problem versions (Fig. ","element":"span"},{"href":"#id-41","text":"22 ","element":"a"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"D","element":"span"},{"text":")). Similar to model comparison via standardized benchmarks, we observe larger scale GPT-4o/Claude 3 Opus ranking higher than their smaller scale counterparts GPT-4o-mini/Claude 3 Sonnet. In contrast to standardized benchmarks, there is clear breakdown of smaller scale GPT-4o-mini/Claude 3 Sonnet (below ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"1","element":"span"},{"text":") if measuring correct response rates averaged across all problems and variations (Fig. ","element":"span"},{"href":"#id-41","text":"22 ","element":"a"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"D","element":"span"},{"text":")), and gap to their larger scale counterparts GPT-4o/Claude 3 Opus (above ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"3","element":"span"},{"text":") is strongly pronounced. As a comparison to MMLU as standardized benchmark, GPT-4o/GPT-4o-mini with 0.887/0.82 and Claude 3 Opus/Claude 3 Sonnet with 0.882/0.815 exhibit much smaller gap and are all settled at higher performance levels. Importantly, as the considered AIW problems are simple, the strong breakdown of smaller scale models and their wide gap to larger scale cannot be explained by increasing problem difficulty compared to standardized benchmarks. As evident from fluctuations shown in Fig. ","element":"span"},{"href":"#id-41","text":"22, ","element":"a"},{"text":"models show various degree of robustness, both to variations of same problem template and to problem versions. Only if a model shows sufficient robustness to variations, it can exhibit higher overall correct response rates accumulated over variations and problem versions. Standardized benchmarks do not make use of such controlled problem variations, and are thus not sensitive to lack of model robustness, which is picked up by testing on AIW problems, despite their low difficulty level.","element":"span"}],[{"text":"As another example. we compare open weights model families Qwen 2.5 (Fig. ","element":"span"},{"href":"#id-42","text":"24 ","element":"a"},{"text":"and Llama 3.1 (Fig. ","element":"span"},{"href":"#id-41","text":"23)","element":"a"},{"text":", which offer a broad range of pre-training scales (Qwen 2.5 1.5B-72B and Llama 3.1 8B-405B model scales). From results obtained from testing on AIW original and AIW Ext, we observe expected dependence of performance on pre-training scale. In line with previous observations, small scale models show strong breakdown with correct response rates close to 0 across variations. With larger scale, models exhibit higher correct response rates, while showing strong performance fluctuations across problem structure & difficulty preserving variations. We see again strong performance fluctuation also across problem versions, for instance for Llama 3.1 405B having much higher correct response rate for AIW original compared to AIW Ext (Fig. ","element":"span"},{"href":"#id-41","text":"23)","element":"a"},{"text":", although both problems have highly similar structure. Similar to our observations on Claude 3.5 Sonnet (Fig. ","element":"span"},{"href":"#id-53","text":"18)","element":"a"},{"text":", this might be again the case of training data containing AIW original instances, with strong diminished performance on AIW Ext pointing to generalization deficits.","element":"span"}],[{"id":"id-41","style":{"width":"100%"},"width":1588,"height":2198,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/25-0.png","element":"img"}],[{"text":"Figure 23: Model comparison and ranking, example of various scales of Llama 3.1 family. Models were tested on (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"A","element":"figcaption","subtype":"caption"},{"text":") AIW, (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"B","element":"figcaption","subtype":"caption"},{"text":") AIW Ext and (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"C","element":"figcaption","subtype":"caption"},{"text":") AIW Friends. Effect of scale is evident when comparing distribution of correct response rates across problem variations and correct response rate averaged across all problem versions in (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"D","element":"figcaption","subtype":"caption"},{"text":"), hinting advantage of larger scale pre-training. Lack of model robustness is evident from strong collapse models suffer on AIW Friends, which is structurally similar to AIW and AIW ext.","element":"figcaption","subtype":"caption"}],[{"id":"id-42","style":{"width":"100%"},"width":1589,"height":1453,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/26-0.png","element":"img"}],[{"text":"Figure 25: Model comparison and ranking on smaller scales using AIW Light problems, on example of smaller scale Qwen 2.5 and Llama 3.1 models. Models were tested on (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"A","element":"figcaption","subtype":"caption"},{"text":") AIW Light Family, (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"B","element":"figcaption","subtype":"caption"},{"text":") AIW Light Arithmetic Siblings and (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"C","element":"figcaption","subtype":"caption"},{"text":") AIW Arithmetic Total Girls. While smaller scale models lose signal on AIW, AIW Light makes it possible to see effect of scale also at smaller scales. Comparing distribution of correct response rates across problem variations and correct response rate averaged across all problem versions in (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"D","element":"figcaption","subtype":"caption"},{"text":"), advantage of larger scale pre-training is visible - higher average correct response rates and less pronounced fluctuations.","element":"figcaption","subtype":"caption"}],[{"text":"For smaller scale models below 8B, the breakdown on AIW original and AIW Ext is so strong that correct response rates do no reflect differences in scale, most models at that small scales being close to 0. We can still measure differences on smaller scales by using AIW Light control problems, that are even simpler than already simple AIW original and AIW Ext. Using variations of problem templates of various difficulty level with upper level being still simple, we can thus compare models both at smaller scales and larger scales, obtaining model ranking without losing signal and differentiation at poorer or higher capability levels (Fig. ","element":"span"},{"href":"#id-42","text":"25)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"An unified score for measuring model robustness. ","element":"span"},{"text":"Correct response rates averaged across variations of an AIW problem or averaged across various AIW problems with all their variations alone do not reveal the degree of present fluctuations across variations, which is important signature of model robustness or lack thereof, reflecting its generalization capability. Using average correct response rate alone can also hide the existing fluctuations. For instance, a model having very high correct response rate on only few problem variations, while others being close to 0, showing strong fluctuation, and a model having lower, but uniformly distributed correct response rates, showing no fluctuations, can show same overall correct response rate if averaged across variations.","element":"span"}],[{"id":"id-38","style":{"width":"100%"},"width":1591,"height":1626,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/27-0.png","element":"img"}],[{"text":"small constant to avoid division by zero), computing then standard entropy re-scaled by log of number of variations, ","element":"span"},{"style":{"height":23.51},"width":489.11,"height":58.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/27-1.png","element":"img","alt":" H(X) = − 1logK�Ki xi log xi","inline":true},{"text":". We also compute the average of correct response rates ","element":"span"},{"style":{"height":21.11},"width":233.08,"height":52.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/27-2.png","element":"img","alt":"ˆp = 1K�Ki pi","inline":true,"padRight":true},{"text":"and obtain ","element":"span"},{"style":{"height":16},"width":305.65,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/27-3.png","element":"img","alt":" R = H(X) · ˆp. R","inline":true,"padRight":true},{"text":"is close to 0 for models that either have overall low ","element":"span"},{"text":"correct response rates or high peaks at few variations (model that have generalization deficits), while it is close to 1 for models that have uniformly high correct response rates across all variations (models with high robustness and strong generalization).","element":"span"}],[{"text":"We show how unified score ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"can be helpful in providing a more accurate view on model robustness and model ranking than one based average correct response rate and its variance alone in Fig. ","element":"span"},{"href":"#id-38","text":"26. ","element":"a"},{"text":"As evident from Fig. ","element":"span"},{"href":"#id-38","text":"26 ","element":"a"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"D","element":"span"},{"text":"), average correct response rate ranks Llama 2 70B higher than Llama 3.1 70B. This score is however driven by only one outlier variation, where the model manages to show high performance, while other variations have either 0 or close to 0 correct response rate, as visible in full correct response rate distribution across variations in Fig. ","element":"span"},{"href":"#id-38","text":"26 ","element":"a"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"A","element":"span"},{"text":"), (","element":"span"},{"style":{"fontWeight":"bold"},"text":"B","element":"span"},{"text":"), (","element":"span"},{"style":{"fontWeight":"bold"},"text":"C","element":"span"},{"text":"). The ranking via correct response rate averaged across variations thus does not reflect this severe lack of model robustness. Ranking via unified robustness score ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"on the other hand drops Llama 2 70B on the same level as Llama 2 7B (Fig. ","element":"span"},{"href":"#id-38","text":"26 ","element":"a"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"E","element":"span"},{"text":")), punishing its strong fluctuations. Llama 3.1 70B is ranked higher, reflecting its more consistent behavior - either low correct rates across all variations, or increased correct response rates across consistently many variations, which indicates better robustness than Llama 2 70B. This advantage is not visible in Fig. ","element":"span"},{"href":"#id-38","text":"26 ","element":"a"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"D","element":"span"},{"text":") - when measuring average correct response rates only, Llama 3.1 70B falls behind less robust Llama 2 70B. Combining both rankings allows thus to avoid preferring less robust models over the more robust ones only due to extreme high performance outliers exhibited only on few variations out of many.","element":"span"}],[{"id":"id-59","style":{"width":"100%"},"width":1587,"height":689,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/28-0.png","element":"img"}],[{"text":"In Fig. ","element":"span"},{"href":"#id-59","text":"27 ","element":"a"},{"text":"we show the overview for model comparison and ranking using both average correct response rates and unified robustness scores. We use set of all AIW problems presented here so far (AIW Ext, AIW Friends, AIW Plus, AIW Circles Colleagues) with exception of AIW original, as we see evidence of data leakage for more recent models like Claude 3.5 Sonnet (Sec. ","element":"span"},{"href":"#id-60","text":"3.4.1, ","element":"a"},{"text":"Fig. ","element":"span"},{"href":"#id-53","text":"18) ","element":"a"},{"text":"or Llama 3.1 405B (Sec. ","element":"span"},{"href":"#id-40","text":"3.4.2, ","element":"a"},{"text":"Fig. ","element":"span"},{"href":"#id-41","text":"23)","element":"a"},{"text":", and would like to avoid giving those models unfair advantage. Both measures deliver same picture for model ranking. We see that with exception of o1-preview (","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"9","element":"span"},{"text":"), all models exhibit strong lack of robustness, staying well below ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"5","element":"span"},{"text":". We indicate in Fig. ","element":"span"},{"href":"#id-59","text":"27 ","element":"a"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"B","element":"span"},{"text":") a possible separation into performance classes corresponding to high (o1-preview), mid (Claude 3.5 Sonnet, o1-mini - which settle in lower range of mid region), low (Claude 3 Opus, DeepSeek v3, Llama 405B, Qwen 2.5 72B, GPT-4o) and very low (Llama 3.1 70B, Qwen 2.5 32B, GPT-4o-mini and other smaller scale models) levels, which hold both with respect of robustness and average correct response rates.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.4.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Model comparison: advanced reasoning models","element":"span"}],[{"text":"Recent works have demonstrated how open-weights reasoning models can be created by using open-weights LLMs as a base. Models like DeepSeek R1 [","element":"span"},{"href":"#id-61","referenceIndex":29,"text":"29","element":"a"},{"text":"] were created from base LLM DeepSeek v3, using simple reinforcement learning based losses (Group Relative Policy Optimization, GRPO) and conventional supervised fine-tuning (SFT), employing as training data real and synthetically generated reasoning traces in multiple training stages. Further models were distilled using reasoning traces, either synthetically generated from DeepSeek R1 (S1.1 32B [","element":"span"},{"href":"#id-62","referenceIndex":30,"text":"30","element":"a"},{"text":"], OpenThinker 32B [","element":"span"},{"href":"#id-63","referenceIndex":31,"text":"31","element":"a"},{"text":", ","element":"span"},{"href":"#id-64","referenceIndex":32,"text":"32","element":"a"},{"text":"], based on Qwen 2.5 32B), DeepSeek R1 Zero (R1-Distilled-Qwen-32b and R1-Distilled-Llama-70b [","element":"span"},{"href":"#id-61","referenceIndex":29,"text":"29","element":"a"},{"text":"]) or curated from mix of real and synthetic reasoning data (LIMO-32B [","element":"span"},{"href":"#id-65","referenceIndex":33,"text":"33","element":"a"},{"text":"]). These reasoning models show strong increase compared to conventional LLMs in scores on standardized benchmarks related to problem solving at olympiad or graduate difficulty levels (AIME24 ","element":"span"},{"href":"#id-66","referenceIndex":34,"text":"[34]","element":"a"},{"text":", MATH500 ","element":"span"},{"href":"#id-67","referenceIndex":35,"text":"[35]","element":"a"},{"text":", GPQA Diamond ","element":"span"},{"href":"#id-68","referenceIndex":36,"text":"[36]","element":"a"},{"text":").","element":"span"}],[{"text":"We measure the robustness and generalization performance of these models using AIW problems, to test the strong claims behind the high scores achieved on reasoning benchmarks (see also Sec. ","element":"span"},{"href":"#id-60","text":"3.4.1) ","element":"a"},{"text":"and check how these models compare to conventional LLMs. To reduce probability of test data leakage (see also Sec. ","element":"span"},{"href":"#id-60","text":"3.4.1, ","element":"a"},{"text":"Fig. ","element":"span"},{"href":"#id-53","text":"18 ","element":"a"},{"text":"and Sec. ","element":"span"},{"href":"#id-40","text":"3.4.2, ","element":"a"},{"text":"Fig. ","element":"span"},{"href":"#id-41","text":"23)","element":"a"},{"text":", we exclude AIW original and AIW ext from AIW test set for the experiments, as reasoning models released very recently might have been exposed to data from those or very similar problems due to their public availability. We take thus AIW Friends, AIW Plus and AIW Circles Colleagues as problem test set.","element":"span"}],[{"text":"As evident from Fig. ","element":"span"},{"href":"#id-69","text":"28, ","element":"a"},{"href":"#id-70","text":"29, ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-71","text":"30, ","element":"a"},{"text":"despite their high scores on standardized reasoning benchmarks, most reasoning models still suffer from strong fluctuations across AIW problem variations. Notable exception is again o1-preview, although also this model exhibit significant fluctuations and decreased average correct response rates on AIW Circles Colleagues. This shows that claims of robust problem solving of olympiad or graduate level as signalled by strong performance on reasoning benchmarks are not sustainable, as AIW problems where models show lack of robustness are far below these levels. On the other hand, reasoning models indeed strongly improve in robustness and average correct response rates compared to tested conventional SOTA LLMs as evident from Fig ","element":"span"},{"href":"#id-72","text":"31. ","element":"a"},{"text":"Remarkable is","element":"span"}],[{"id":"id-69","style":{"width":"96%"},"width":1535,"height":646,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/29-0.png","element":"img"}],[{"text":"$40","element":"figcaption","subtype":"caption"}],[{"id":"id-70","style":{"width":"99%"},"width":1584,"height":668,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/29-1.png","element":"img"}],[{"text":"Figure 29: Strong fluctuations of correct response rates on variations of AIW Plus problem (a color per each variation 1-6) exhibited by recent reasoning models. With exception of o1-preview and o3-mini, most reasoning models that show strong performance on standardized reasoning benchmarks reveal inability to cope with the problem robustly when facing slight variations in problem template. Eg, correct response rate drops for R1-Llama 70B from close to 1.0 on variation 1 or 6 below 0.4 on variation 4, despite differences between the those being just instantiated numbers. o1-mini undergoes strong overall collapse. Distilled reasoning models (S1.1 32B, LIMO 32B, OpenThinker-Qwen-32B) perform on par with DeepSeek-R1 or outmatch it, despite using for distillation SFT only. Distilled models at larger scales (32B, 70B) perform significantly better than smaller scale 7B/8B models.","element":"figcaption","subtype":"caption"}],[{"text":"that this improvement in robustness is also achieved by distilled reasoning models that use only SFT on reasoning traces in single stage post-training. Models like R1-Distilled-Llama-70B [","element":"span"},{"href":"#id-61","referenceIndex":29,"text":"29","element":"a"},{"text":"] (distilled on 800k closed data), OpenThinker-32B [","element":"span"},{"href":"#id-63","referenceIndex":31,"text":"31","element":"a"},{"text":", ","element":"span"},{"href":"#id-64","referenceIndex":32,"text":"32","element":"a"},{"text":"] (distilled on OpenThoughts-114k, 3 epochs), S1.1-32B [","element":"span"},{"href":"#id-62","referenceIndex":30,"text":"30","element":"a"},{"text":"] (distilled on 1k DeepSeek-R1 data), LIMO-32B [","element":"span"},{"href":"#id-65","referenceIndex":33,"text":"33","element":"a"},{"text":"] (distilled on 0.8k samples mix from","element":"span"}],[{"id":"id-71","style":{"width":"98%"},"width":1561,"height":644,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/30-0.png","element":"img"}],[{"text":"$41","element":"figcaption","subtype":"caption"}],[{"text":"real and DeepSeek-R1 generated data), having either Llama 3.3 or Qwen 2.5 as a base, are increasing their performance strongly compared to their base instruct models, reaching up to levels comparable with DeepSeek R1 671B, which uses both SFT and RL during multi stage training, also outperforming closed reasoning models like o1-mini. We observe remarkable similarity in the distribution shape of correct response rates between larger scale SFT only distilled models and DeepSeek-R1 or o1-mini and o3-mini that use RL. Further, as seen on Fig ","element":"span"},{"href":"#id-72","text":"31, ","element":"a"},{"text":"OpenThinker-Unverified-32B manages also to gain substantial boost despite using unverified traces for distillation (in contrast to OpenThinker-Verified-32B and other distilled reasoning models). This provides a hint that also unverified reasoning traces might be already sufficient to induce strong performance boost in problem solving compared to standard SOTA LLMs, while verified traces add further boost.","element":"span"}],[{"style":{"width":"39%"},"width":627,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/30-1.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Measuring LLMs capabilities. ","element":"span"},{"text":"Since the seminal breakthroughs in language modelling [","element":"span"},{"href":"#id-1","referenceIndex":2,"text":"2","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":3,"text":"3","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","text":"4","element":"a"},{"text":"], measuring LLM capabilities became indispensable for evaluations and model comparison. To measure how well a language model performs on reasoning, there exists a plethora of different standardized reasoning benchmarks. These benchmarks can be roughly divided into categories by what exact reasoning capability we want to test such as ARC [","element":"span"},{"href":"#id-73","referenceIndex":37,"text":"37","element":"a"},{"text":"], PIQA [","element":"span"},{"href":"#id-74","referenceIndex":38,"text":"38","element":"a"},{"text":"], GSM8K [","element":"span"},{"href":"#id-75","referenceIndex":39,"text":"39","element":"a"},{"text":"], HellaSwag [","element":"span"},{"href":"#id-76","referenceIndex":40,"text":"40","element":"a"},{"text":"], MMLU [","element":"span"},{"href":"#id-77","text":"41","element":"a"},{"text":"] or WinoGrande [","element":"span"},{"href":"#id-77","text":"42","element":"a"},{"text":"]. Multiple works aim on improving reasoning performance of LLMs as measured by those standardized benchmarks in various ways [","element":"span"},{"href":"#id-16","referenceIndex":26,"text":"25","element":"a"},{"text":", ","element":"span"},{"href":"#id-77","text":"43","element":"a"},{"text":", ","element":"span"},{"href":"#id-78","referenceIndex":46,"text":"44","element":"a"},{"text":", ","element":"span"},{"href":"#id-78","referenceIndex":46,"text":"45","element":"a"},{"text":", ","element":"span"},{"href":"#id-78","referenceIndex":46,"text":"46","element":"a"},{"text":"]. Recently, increasingly challenging benchmarks have been developed, aiming at testing of olympiad and graduate level problem solving, for instance AIME24/AIME25 [","element":"span"},{"href":"#id-66","referenceIndex":34,"text":"34","element":"a"},{"text":"], MATH500 [","element":"span"},{"href":"#id-67","referenceIndex":35,"text":"35","element":"a"},{"text":", ","element":"span"},{"href":"#id-79","referenceIndex":47,"text":"47","element":"a"},{"text":"], or GPQA Diamond [","element":"span"},{"href":"#id-68","referenceIndex":36,"text":"36","element":"a"},{"text":"], which contain problems of very high difficulty and are also used to test recent reasoning models ","element":"span"},{"href":"#id-61","referenceIndex":29,"text":"[29, ","element":"a"},{"href":"#id-62","referenceIndex":30,"text":"30, ","element":"a"},{"href":"#id-63","referenceIndex":31,"text":"31, ","element":"a"},{"href":"#id-65","referenceIndex":33,"text":"33]","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Finding weak spots in LLMs capabilities. ","element":"span"},{"text":"Paralleling impressive progress shown by LLM research, cautious voices have been raising concern about discrepancy between claimed capabilities as measured by standardized benchmarks and true model reasoning skills by presenting carefully selected evidence for model failures [","element":"span"},{"href":"#id-7","referenceIndex":12,"text":"12","element":"a"},{"text":"]. Number of works were investigating test data leakage into training set as","element":"span"}],[{"id":"id-72","style":{"width":"99%"},"width":1584,"height":393,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/31-0.png","element":"img"}],[{"text":"Figure 31: Comparing and rankining recent reasoning models on AIW test set (AIW Friends, AIW Plus, AIW Circles Colleagues) using average correct response rates (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"A","element":"figcaption","subtype":"caption"},{"text":") and unified robustness score ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"R ","element":"figcaption","subtype":"caption"},{"text":"(","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"B","element":"figcaption","subtype":"caption"},{"text":"). Evident is a strong performance boost for distilled larger scale (32B, 70B) reasoning models (with exception of R1-Qwen-32B), settling in mid range ","element":"figcaption","subtype":"caption"},{"text":"0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":".","element":"figcaption","subtype":"caption"},{"text":"4 ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"< p, ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"R ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"< ","element":"figcaption","subtype":"caption"},{"text":"0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":".","element":"figcaption","subtype":"caption"},{"text":"6 ","element":"figcaption","subtype":"caption"},{"text":"for average correct response rates ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"p ","element":"figcaption","subtype":"caption"},{"text":"and robustness ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"R","element":"figcaption","subtype":"caption"},{"text":", leaving standard SOTA LLMs far behind. Standard SOTA LLMs and smaller scale distilled reasoning models reside below ","element":"figcaption","subtype":"caption"},{"text":"0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":".","element":"figcaption","subtype":"caption"},{"text":"2","element":"figcaption","subtype":"caption"},{"text":". Larger scale distilled reasoning models match or outperform DeepSeek-R1 and o1-mini. This provides evidence that using only SFT on reasoning traces for the fine-tuning can be sufficient to reach performance level comparable with methods that employ RL. Interestingly, OpenThinker-Unverified-32B, trained only on unverified traces, while having lower performance than OpenThinker-Verified-32B, also manages to gain substantial boost, hinting that also unverified reasoning traces might have strong value as reasoning boosting data.","element":"figcaption","subtype":"caption"}],[{"text":"possible cause of observed discrepancy between claimed and actual capabilities, pointing out high test dataset contamination due to large-scale pre-training on web-scale data [","element":"span"},{"href":"#id-9","referenceIndex":14,"text":"14","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":15,"text":"15","element":"a"},{"text":"]. In response, the research community has been undertaking attempts to create benchmarks like HELM [","element":"span"},{"href":"#id-80","referenceIndex":48,"text":"48","element":"a"},{"text":"], BIGbench [","element":"span"},{"href":"#id-81","referenceIndex":49,"text":"49","element":"a"},{"text":"] or LiveCodeBench [","element":"span"},{"href":"#id-82","referenceIndex":50,"text":"50","element":"a"},{"text":"] that are supposed to circumvent data leakage issues and properly test generalization capabilities beyond memorization.","element":"span"}],[{"text":"Multiple studies [","element":"span"},{"href":"#id-8","referenceIndex":13,"text":"13","element":"a"},{"text":", ","element":"span"},{"href":"#id-83","referenceIndex":51,"text":"51","element":"a"},{"text":", ","element":"span"},{"href":"#id-84","referenceIndex":52,"text":"52","element":"a"},{"text":", ","element":"span"},{"href":"#id-85","referenceIndex":53,"text":"53","element":"a"},{"text":", ","element":"span"},{"href":"#id-86","referenceIndex":54,"text":"54","element":"a"},{"text":", ","element":"span"},{"href":"#id-87","referenceIndex":55,"text":"55","element":"a"},{"text":"] have shown breakdowns of language models reasoning capabilities in different scenarios and also lack of robustness to variation of problem formulation [","element":"span"},{"href":"#id-88","referenceIndex":56,"text":"56","element":"a"},{"text":", ","element":"span"},{"href":"#id-89","referenceIndex":57,"text":"57","element":"a"},{"text":"]. Other works were looking into particular reasoning failures like deficits in causality inference [","element":"span"},{"href":"#id-90","referenceIndex":58,"text":"58","element":"a"},{"text":", ","element":"span"},{"href":"#id-91","referenceIndex":59,"text":"59","element":"a"},{"text":"]. These works operate often with formalized problems of high difficulty that does not have simple common sense character expressible in natural language. Similar in spirit to our work, simple math word problems were used in [","element":"span"},{"href":"#id-92","referenceIndex":60,"text":"60","element":"a"},{"text":"] to show model breakdowns, but models on current frontiers that claim strong generalization and advanced capabilities were not tested. Recently, Mirzadeh et. al. (2024) made in their work [","element":"span"},{"href":"#id-93","referenceIndex":61,"text":"61","element":"a"},{"text":"] use of similar approach to create variations from templates of GSM8K problems. This work however does not provide for strong models any conclusive evidence of function breakdown on simple problems, in fact showing evidence that models like GPT-4o or Llama 3 8B can handle those well (Fig. ","element":"span"},{"href":"#id-94","text":"32)","element":"a"},{"text":". In contrast, we are able to measure lack of model robustness and strong fluctuations also for SOTA LLMs and advanced reasoning models pretrained at largest scales. A key limitation of our current approach is the lack of sufficient diversity and number of AIW problem variations. This can be addressed in future work by systematic procedural instance generation for evaluation of model behavior on more diverse problem set with larger variation number.","element":"span"}]]},{"heading":"5 Discussion & Conclusion","paragraphs":[[{"text":"In our work, using a simple AIW problem (Sec. ","element":"span"},{"text":"2) ","element":"span"},{"text":"that requires only elementary set and arithmetic operations and can be easily solved by adults and arguably even children, we observe a striking breakdown of SOTA LLMs performance when confronted with the AIW problem and its variations (Sec. ","element":"span"},{"href":"#id-23","text":"2.1, ","element":"a"},{"text":"Suppl. Tab. ","element":"span"},{"href":"#id-4","text":"2)","element":"a"},{"text":". The breakdown is manifested in ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(i) ","element":"span"},{"text":"Low average correct response rates (Fig. ","element":"span"},{"href":"#id-30","text":"3) ","element":"a"},{"text":"and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(ii) ","element":"span"},{"text":"Strong performance fluctuation across structure and difficulty preserving natural variations of the same problem, which hints at fundamental issues with the generalization capability of the models (Fig. ","element":"span"},{"href":"#id-32","text":"4)","element":"a"},{"text":".","element":"span"}],[{"text":"The observed breakdown is in strong contrast with claims about SOTA LLMs being robust solvers of higher difficulty problems at graduate or olympiad level. Any system claiming even basic robust reasoning should be able to obtain high correct response rates close to ","element":"span"},{"text":"100% ","element":"span"},{"text":"on problems as simple as AIW across all its variations. The obtained evidence also falsifies the claim of strong zero-shot","element":"span"}],[{"id":"id-94","style":{"width":"99%"},"width":1582,"height":855,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/32-0.png","element":"img"}],[{"text":"Figure 32: Both AIW (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"A","element":"figcaption","subtype":"caption"},{"text":") and GSM-Symbolic (GSM-S) (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"B","element":"figcaption","subtype":"caption"},{"text":") use variations in problem templates to measure sensitivity of model performance to variations and draw conclusions about generalization and reasonining capabilities. AIW (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"C","element":"figcaption","subtype":"caption"},{"text":") provides strong evidence for generalization deficits by observing 1) strong fluctuations across variations of simple AIW problem (a color for each AIW variation 1-4) and 2) low average correct response rates for most models, eg. Llama 3 8B on the right. This provides convincing falsification of strong function hypothesis. In contrast, GSM-S (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"D","element":"figcaption","subtype":"caption"},{"text":"), while using more sophisticated and content overloaded (eg see (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"B","element":"figcaption","subtype":"caption"},{"text":")) problems, cannot offer such conclusive evidence. The observed fluctuations are weak (eg 0.07 vs 0.67 on AIW for GPT-4o), and average performance is high (eg > 0.69 vs < 0.15 on AIW for Llama 3 8B), while GSM8k falls well within the measured response distribution. While AIW variations thus reveal GPT-4o and Llama-3 8B generalization deficits, they stay hidden for GSM-S. We hypothesize that inability to observe strong fluctuations on GSM-S might be due to leakage of GSM8k type of problems into training data. Fig. (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"B","element":"figcaption","subtype":"caption"},{"text":") and (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"D","element":"figcaption","subtype":"caption"},{"text":") are adapted from ","element":"figcaption","subtype":"caption"},{"href":"#id-93","referenceIndex":61,"text":"[61]","element":"a","subtype":"caption"}],[{"text":"generalization - strong performance fluctuations across variations that keep problem structure and difficulty unchanged reveal severe generalization deficits in all tested SOTA LLMs (Fig. ","element":"span"},{"href":"#id-32","text":"4, ","element":"a"},{"href":"#id-49","text":"11, ","element":"a"},{"href":"#id-51","text":"12, ","element":"a"},{"href":"#id-56","text":"13, ","element":"a"},{"href":"#id-69","text":"28, ","element":"a"},{"href":"#id-70","text":"29, ","element":"a"},{"href":"#id-71","text":"30)","element":"a"},{"text":". Given simplicity of the AIW problem, one can speak of complete reasoning breakdown in the most of the tested models, as observed strong fluctuations clearly prove the inability to infer robustly simple common problem structure behind all AIW variations.","element":"span"}],[{"text":"By executing experiments on control AIW Light problems, we provide evidence that the observed failures are not rooted in low-level issues like inability to parse crucial problem related information from natural language or execute elementary arithmetic operations required to solve it (Sec. ","element":"span"},{"href":"#id-44","text":"3.1.1)","element":"a"},{"text":". We also confirm similar breakdown patterns in further AIW problem versions like AIW Ext, AIW Friends, AIW+ and AIW Circles Colleagues (Fig. ","element":"span"},{"href":"#id-49","text":"11, ","element":"a"},{"href":"#id-51","text":"12, ","element":"a"},{"href":"#id-56","text":"13, ","element":"a"},{"href":"#id-69","text":"28, ","element":"a"},{"href":"#id-70","text":"29, ","element":"a"},{"href":"#id-71","text":"30)","element":"a"},{"text":", corroborating hypothesis that observed deficits in generalization and basic reasoning are generic and not specific to the specific problem type. Our study also clearly points to failure of standardized benchmarks to properly measure core model functionality such as generalization or reasoning (Fig. ","element":"span"},{"href":"#id-36","text":"8, ","element":"a"},{"text":"Suppl. Sec. ","element":"span"},{"href":"#id-37","text":"C.1, ","element":"a"},{"text":"Suppl. Tab. ","element":"span"},{"href":"#id-25","text":"12)","element":"a"},{"text":". Standardized benchmarks assigning high scores to SOTA LLMs fail to reveal severe model weaknesses made evident by breakdown on simple AIW problem.","element":"span"}],[{"text":"It has to be noted that despite observed breakdown with low average correct response rates and strong fluctuations, the reasoning is not entirely absent. Better performing larger scale models like GPT-4/4o, Claude 3 Opus, Claude 3.5 Sonnet do show examples of fully correct reasoning (see Suppl Fig. ","element":"span"},{"href":"#id-95","text":"47, ","element":"a"},{"href":"#id-96","text":"48)","element":"a"},{"text":". As our results show, this reasoning capability is however fragile and cannot be accessed robustly, even in such a simple scenario as posed by AIW problem variations. The same is valid for the very recent advanced reasoning models like open-weights DeepSeek R1 671B or closed o1-preview, o3-mini and o1-mini. While they do show clear improvement over standard SOTA LLMs, strong fluctuations on problem structure and difficulty preserving variations still appear (Fig. ","element":"span"},{"href":"#id-58","text":"21, ","element":"a"},{"href":"#id-69","text":"28, ","element":"a"},{"href":"#id-70","text":"29, ","element":"a"},{"href":"#id-71","text":"30, ","element":"a"},{"text":"again debunking claims of robust graduate or olympiad level problem solving put forward for these models based on standardized reasoning benchmarks (AIME24, MATH500, GPQA Diamond).","element":"span"}],[{"text":"Our findings raise the question how it is possible that standardized benchmarks - that contain much harder problems than AIW - attest seemingly strong function capabilities to the tested models, while testing via simple AIW problems reveal such clear breakdown. In line with previous works [","element":"span"},{"href":"#id-9","referenceIndex":14,"text":"14","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":15,"text":"15","element":"a"},{"text":", ","element":"span"},{"href":"#id-97","referenceIndex":62,"text":"62","element":"a"},{"text":"], we hypothesize that testing on standardized benchmarks suffers either from direct benchmark test data leakage or from training on test tasks that slip into training and post-training datasets of models created after benchmark appearance. We confirm with our experiments such scenario (Fig. ","element":"span"},{"href":"#id-57","text":"19)","element":"a"},{"text":". Simple supervised fine-tuning (SFT) on a set of AIW problems is sufficient to strongly increase performance of Llama 3.1 8B from very poor to very high across variations of the same problem as used in fine-tuning. The fine-tuned models undergoes though severe collapse when going to closely related, but slightly different AIW problem versions. Similar phenomena is observed when testing Claude 3.5 Sonnet (Fig. ","element":"span"},{"href":"#id-53","text":"18) ","element":"a"},{"text":"or Llama 3.1 405B (Fig. ","element":"span"},{"href":"#id-41","text":"23) ","element":"a"},{"text":"which appeared after public data on AIW problems was released - performing very strong on one set of AIW problems while collapsing on closely related versions. It becomes thus again clear that in order to properly measure generalization in models trained on large web-scale data, it is crucially to design benchmarks that are immune at least against direct test data leakage.","element":"span"}],[{"text":"The appraisal for the smaller scale models (e.g. Mistral-7B; LLama 2/3 7/8B; Qwen 2.5 1.5B, 3B, 7B) might be thus also based to large extent on illusion of strong function mediated by standardized benchmarks that suffer from test data leakage. We observe a severe breakdown for smaller scale models on AIW, with a clear large gap to better performing models that all reside at larger scales (Fig. ","element":"span"},{"href":"#id-30","text":"3, ","element":"a"},{"href":"#id-41","text":"23 ","element":"a"},{"text":", Fig. ","element":"span"},{"href":"#id-42","text":"24, ","element":"a"},{"text":"Fig. ","element":"span"},{"href":"#id-41","text":"23, ","element":"a"},{"text":"Suppl. Fig. ","element":"span"},{"href":"#id-31","text":"33, ","element":"a"},{"text":"Suppl. Tab. ","element":"span"},{"href":"#id-25","text":"12)","element":"a"},{"text":". We hypothesize here that the claimed strong functions of smaller scale models might be a mere illusion corroborated by broken benchmarks that in their current state cannot offer a proper model comparison and thus also cannot be used as downstream tasks for measuring important scaling laws.","element":"span"}],[{"text":"$42","element":"span"},{"href":"#id-98","referenceIndex":63,"text":"63","element":"a"},{"text":"] and DCLM-baselines [","element":"span"},{"href":"#id-99","referenceIndex":64,"text":"64","element":"a"},{"text":"] for encouraging examples of such open dataset composition pipelines), as otherwise the claims about generalization and reasoning capabilities will stay unsubstantiated. We show here a way to debunk such claims in the closed models using AIW problems. This however still relies on creating a problem set that is not yet a part of the training data, which is hard to ensure if training dataset is unknown.","element":"span"}],[{"text":"Relying on claims backed up by standardized benchmarks, it is commonly-held position to attribute to SOTA LLMs advanced functions like zero-shot reasoning [","element":"span"},{"href":"#id-5","referenceIndex":10,"text":"7","element":"a"},{"text":"], and in general to put high expectations of strong core functionality on released SOTA LLMs [","element":"span"},{"href":"#id-5","referenceIndex":10,"text":"8","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","referenceIndex":10,"text":"9","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","referenceIndex":10,"text":"10","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":11,"text":"11","element":"a"},{"text":"]. Such claims extend beyond basic research artifacts and become pervasive in applied industry, where SOTA LLMs are advertised as robust problem solvers for various real world settings, explicitly emphasizing their value as robust reasoners, coders and problem solvers, attesting \"key business-critical capabilities\" or suitability for \"real-world enterprise use cases\" (see announcements by Cohere on Command R-Plus [","element":"span"},{"href":"#id-100","referenceIndex":65,"text":"65","element":"a"},{"text":", ","element":"span"},{"href":"#id-101","referenceIndex":66,"text":"66","element":"a"},{"text":"], or by Mosaic on DBRX [","element":"span"},{"href":"#id-102","referenceIndex":67,"text":"67","element":"a"},{"text":"], as only few selected representative examples out of many). Our study provides clear evidence for model breakdown on simple problems, and importantly, lack of model robustness in face of natural problem variations that do not change problem structure or its difficulty.","element":"span"}],[{"text":"Observed breakdown of model robustness, coupled with public claims based on standardized benchmarks present an inherent safety problem. Models with insufficient generalization and lack of robustness in basic reasoning are inherently unsafe. As we observe sensitivity to problem variations in such a simple scenario, it is highly likely that similar lack of robustness will manifest on variations of real world problems to solve, resulting in unexpected performance changes. Those will be hard to detect for users if the problem solution is not trivial to check as it in case of simple AIW problems. ","element":"span"},{"text":"It should be thus clear that SOTA LLMs cannot be trusted to deal robustly with real world tasks, or to robustly solve problems of graduate or olympiad level as claimed by high scores on benchmarks containing those. Performance on current standardized reasoning benchmarks creates illusion of robust reasoning capabilities, and making it even worse - when doing mistakes, models are overconfident, insisting on their wrong answers being correct, and producing highly persuasive and suggestive explanations for their wrong responses, which might further obscure errors for the end-users due to partly plausible sounding text (see Suppl. Sec. ","element":"span"},{"text":"F ","element":"span"},{"text":"for examples of such confabulations). To ensure safety, public claims about model function should be based only on those scientific evaluations that properly measure the models’ capabilities to generalize and reason, while basic research has to be performed guided by such benchmarks to endow future models with sufficient generalization and robust basic reasoning skills.","element":"span"}],[{"text":"$43","element":"span"},{"href":"#id-103","referenceIndex":68,"text":"68","element":"a"},{"text":"], attempting everything to break model’s function, highlighting its deficits, and thus showing possible directions for model improvement, while also offering protection from overblown claims about model’s capabilities.","element":"span"}]]},{"heading":"Acknowledgments","paragraphs":[[{"text":"We would like to express gratitude to all the people who are working on making code, models and data publicly available, advancing community based research and making research more reproducible. Specifically, we would like to thank all the members of the LAION Discord server","element":"span"},{"text":"2 ","element":"span"},{"text":"community and Open-","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/34-0.png","element":"img","alt":"Ψ","inline":true,"padRight":true},{"text":"(Open-Sci) Collective","element":"span"},{"text":"3 ","element":"span"},{"text":"for providing fruitful ground for scientific exchange and open-source development.","element":"span"}],[{"text":"MN, LCK, MC and JJ acknowledge funding by the Federal Ministry of Education and Research of Germany (BMBF) under grant no. 01IS22094B (WestAI - AI Service Center West), under grant no. 01IS24085C (OPENHAFM) and under grant 16HPC117K (MINERVA), as well as co-funding by EU from EuroHPC Joint Undertaking programm under grant no. 101182737 (MINERVA) and from Digital Europe Programme under grant no. 101195233 (openEuroLLM).","element":"span"}],[{"text":"LCK acknowledges the Helmholtz Information & Data Science Academy (HIDA) for providing financial support enabling a short-term research stay at Juelich Supercomputing Center (JSC), Research Center Juelich (FZJ) to conduct research on foundation models.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-0","text":"[1] ","element":"span"},{"text":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 8748–8763. PMLR, 2021.","element":"span"}],[{"id":"id-1","text":"[2] ","element":"span"},{"text":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1810.04805","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-2","text":"[3] ","element":"span"},{"text":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of machine learning research","element":"span"},{"text":", 21(140):1–67, 2020.","element":"span"}],[{"id":"id-3","style":{"width":"98%"},"width":1566,"height":1029,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/35-0.png","element":"img"}],[{"id":"id-5","text":"[10] ","element":"span"},{"text":"$44","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv","element":"span"},{"text":", 7 2023.","element":"span"}],[{"id":"id-6","text":"[11] ","element":"span"},{"text":"Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2310.06825","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-7","text":"[12] ","element":"span"},{"text":"Melanie Mitchell. How do we know how smart ai systems are? ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Science","element":"span"},{"text":", 381(6654):eadj5957, 2023.","element":"span"}],[{"id":"id-8","text":"[13] ","element":"span"},{"text":"Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv e-prints","element":"span"},{"text":", pages arXiv–2307, 2023.","element":"span"}],[{"id":"id-9","text":"[14] ","element":"span"},{"text":"Shahriar Golchin and Mihai Surdeanu. Time travel in llms: Tracing data contamination in large language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2308.08493","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-10","text":"[15] ","element":"span"},{"text":"Changmao Li and Jeffrey Flanigan. Task contamination: Language models may not be few-shot anymore. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the AAAI Conference on Artificial Intelligence","element":"span"},{"text":", volume 38, pages 18471–18480, 2024.","element":"span"}],[{"id":"id-11","text":"[16] ","element":"span"},{"text":"Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 36, 2024.","element":"span"}],[{"id":"id-12","text":"[17] ","element":"span"},{"text":"Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2207.05221","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-13","text":"[18] ","element":"span"},{"text":"Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Eleventh International Conference on Learning Representations","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-14","style":{"width":"100%"},"width":1586,"height":920,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/36-0.png","element":"img"}],[{"id":"id-16","text":"[26] ","element":"span"},{"text":"Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C Schmidt. A prompt pattern catalog to enhance prompt engineering with chatgpt. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2302.11382","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-28","text":"[27] ","element":"span"},{"text":"Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. vllm, 2024.","element":"span"}],[{"id":"id-29","text":"[28] Lewis Carroll. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Through the Looking Glass","element":"span"},{"text":". Penguin Books, 1871.","element":"span"}],[{"id":"id-61","text":"[29] ","element":"span"},{"text":"Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2501.12948","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-62","text":"[30] ","element":"span"},{"text":"Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2501.19393","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-63","text":"[31] ","element":"span"},{"text":"OpenThoughts Team. OpenThinker 32B. https://huggingface.co/open-thoughts/OpenThinker-32B, January 2025.","element":"span"}],[{"id":"id-64","text":"[32] OpenThoughts Team. Open Thoughts. https://open-thoughts.ai, January 2025.","element":"span"}],[{"id":"id-65","text":"[33] ","element":"span"},{"text":"Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2502.03387","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-66","text":"[34] Art of Problem Solving. American invitational mathematics examination (AIME).","element":"span"}],[{"id":"id-67","text":"[35] ","element":"span"},{"text":"Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2103.03874","element":"span"},{"text":", 2021.","element":"span"}],[{"id":"id-68","text":"[36] ","element":"span"},{"text":"David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"First Conference on Language Modeling","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-73","text":"[37] ","element":"span"},{"text":"Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1803.05457","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-74","text":"[38] ","element":"span"},{"text":"Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the AAAI conference on artificial intelligence","element":"span"},{"text":", volume 34, pages 7432–7439, 2020.","element":"span"}],[{"id":"id-75","text":"[39] ","element":"span"},{"text":"Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2110.14168","element":"span"},{"text":", 2021.","element":"span"}],[{"id":"id-76","text":"[40] ","element":"span"},{"text":"Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-77","style":{"width":"100%"},"width":1586,"height":665,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/37-0.png","element":"img"}],[{"id":"id-78","text":"[46] ","element":"span"},{"text":"Jacob Pfau, William Merrill, and Samuel R Bowman. Let’s think dot by dot: Hidden computation in transformer language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2404.15758","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-79","text":"[47] ","element":"span"},{"text":"Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Twelfth International Conference on Learning Representations","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-80","text":"[48] ","element":"span"},{"text":"Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Alexander Cosgrove, Christopher D Manning, Christopher Re, Diana Acosta-Navas, Drew Arad Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue WANG, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri S. Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Andrew Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. Holistic evaluation of language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Transactions on Machine Learning Research","element":"span"},{"text":", 2023. Featured Certification, Expert Certification.","element":"span"}],[{"id":"id-81","text":"[49] ","element":"span"},{"text":"$45","element":"span"}],[{"text":"$46","element":"span"}],[{"text":"Yufang Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, Zijian Wang, Zijie J. Wang, Zirui Wang, and Ziyi Wu. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Transactions on Machine Learning Research","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-82","text":"[50] ","element":"span"},{"text":"Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Thirteenth International Conference on Learning Representations","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-83","text":"[51] ","element":"span"},{"text":"Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, et al. Faith and fate: Limits of transformers on compositionality. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 36, 2024.","element":"span"}],[{"id":"id-84","text":"[52] ","element":"span"},{"text":"Martha Lewis and Melanie Mitchell. Using counterfactual tasks to evaluate the generality of analogical reasoning in large language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2402.08955","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-85","text":"[53] ","element":"span"},{"text":"Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: Llms trained on\" a is b\" fail to learn\" b is a\". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2309.12288","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-86","text":"[54] ","element":"span"},{"text":"Arsenii Kirillovich Moskvichev, Victor Vikram Odouard, and Melanie Mitchell. The conceptarc benchmark: Evaluating understanding and generalization in the arc domain. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Transactions on machine learning research","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-87","text":"[55] ","element":"span"},{"text":"Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The 2023 Conference on Empirical Methods in Natural Language Processing","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-88","text":"[56] ","element":"span"},{"text":"Yongshuo Zong, Tingyang Yu, Ruchika Chavhan, Bingchen Zhao, and Timothy Hospedales. Fool your (vision and) language model with embarrassingly simple permutations. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Forty-first International Conference on Machine Learning","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-89","text":"[57] ","element":"span"},{"text":"Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Twelfth International Conference on Learning Representations","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-90","text":"[58] ","element":"span"},{"text":"Zhijing Jin, Jiarui Liu, LYU Zhiheng, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona T Diab, and Bernhard Schölkopf. Can large language models infer causation from correlation? In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Twelfth International Conference on Learning Representations","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-91","text":"[59] ","element":"span"},{"text":"Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng LYU, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, and Bernhard Schölkopf. CLadder: A benchmark to assess causal reasoning capabilities of language models. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Thirtyseventh Conference on Neural Information Processing Systems","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-92","text":"[60] ","element":"span"},{"text":"Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","element":"span"},{"text":", pages 2080–2094, 2021.","element":"span"}],[{"id":"id-93","text":"[61] ","element":"span"},{"text":"Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2410.05229","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-97","text":"[62] ","element":"span"},{"text":"Ricardo Dominguez-Olmedo, Florian E Dorner, and Moritz Hardt. Training on the test task confounds evaluation and emergence. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2407.07890","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-98","text":"[63] HuggingFace. Fineweb: decanting the web for the finest text data at scale.","element":"span"}],[{"id":"id-99","text":"[64] ","element":"span"},{"text":"Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Guha, Sedrick Scott Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation of training sets for language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 37:14200–14282, 2025.","element":"span"}],[{"id":"id-100","text":"[65] Cohere. Introducing command r+: A scalable llm built for business.","element":"span"}],[{"id":"id-101","text":"[66] Cohere. Command r+ documentation, 2024.","element":"span"}],[{"id":"id-102","text":"[67] Mosaic. Introducing dbrx: A new state-of-the-art open llm.","element":"span"}],[{"id":"id-103","text":"[68] Karl Raimund Popper. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The logic of scientific discovery","element":"span"},{"text":". 1934.","element":"span"}],[{"text":"[69] OpenAI. Gpt-4o model docs., 2024.","element":"span"}],[{"text":"[70] OpenAI. Announcement: Hello gpt-4o, 2024.","element":"span"}],[{"text":"[71] OpenAI. Introducing chatgpt, 11 2022.","element":"span"}],[{"text":"[72] OpenAI. Models - openai gpt 3.5 turbo docs.","element":"span"}],[{"text":"[73] OpenAI. Models - openai gpt 3.5 turbo update.","element":"span"}],[{"text":"[74] Anthropic. Claude 3.5 sonnet, 2024.","element":"span"}],[{"text":"[75] Anthropic. Introducing the next generation of claude, 2024.","element":"span"}],[{"text":"[76] ","element":"span"},{"text":"Sundar Pichai and Demis Hassabis. Introducing gemini: Google’s most capable ai model yet, 2023.","element":"span"}],[{"id":"id-108","style":{"width":"100%"},"width":1586,"height":1372,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/40-0.png","element":"img"}],[{"style":{"width":"24%"},"width":383,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/41-0.png","element":"img"}]]},{"heading":"A Additional details on performed experiments","paragraphs":[[{"text":"Here we give further details on the procedures around the executed experiments.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Models selected for experiments","element":"span"}],[{"text":"To provide overview over origin of core tested models used for the AIW experiments, we list those in Suppl Tab. ","element":"span"},{"href":"#id-24","text":"1. ","element":"a"},{"text":"All tested models use same default inference hyperparameters, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"top-p ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"0 ","element":"span"},{"text":"(we executed control experiments to check that various settings do not change the main pattern in the observed behavior). The output of standard LLMs was limited to ","element":"span"},{"text":"2048 ","element":"span"},{"text":"tokens, and as evident from Suppl Fig. ","element":"span"},{"href":"#id-104","text":"43, ","element":"a"},{"text":"most observed responses stayed well below this limit. When testing recent reasoning models, the limit was raised to 32k or 43k, depending on model type.","element":"span"}],[{"id":"id-24","text":"Table 1: Names, origin and versioning of core test models used in the experiments.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"114%"},"width":1816,"height":1243,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/41-1.png","element":"img"}]]},{"heading":"B Problem versions, prompt types and variations","paragraphs":[[{"text":"Here we provide overview of all AIW problem versions as introduced in Sec. ","element":"span"},{"href":"#id-23","text":"2.1 ","element":"a"},{"text":"with their corresponding variations and prompt types used.","element":"span"}],[{"text":"For the full inputs provided to the tested models during experiments, see Suppl. Tab. ","element":"span"},{"href":"#id-4","text":"2 ","element":"a"},{"text":"for AIW problem, Suppl. Tab. ","element":"span"},{"href":"#id-18","text":"3 ","element":"a"},{"text":"(AIW Light Arithmetic Siblings), ","element":"span"},{"href":"#id-19","text":"4 ","element":"a"},{"text":"(AIW Light Family), ","element":"span"},{"href":"#id-20","text":"5 ","element":"a"},{"text":"(AIW Light Arithmetic Total Girls) for AIW Light control problems, Suppl. Tab. ","element":"span"},{"href":"#id-45","text":"6 ","element":"a"},{"text":"for AIW Alice Female Boost, Suppl. Tab. ","element":"span"},{"href":"#id-105","text":"7 ","element":"a"},{"text":"for AIW Exaggerated Numbers, Suppl. Tab. ","element":"span"},{"href":"#id-55","text":"8 ","element":"a"},{"text":"for AIW Extended (AIW Ext), Suppl. Tab. ","element":"span"},{"href":"#id-106","text":"9 ","element":"a"},{"text":"for AIW Friends, Suppl. Tab. ","element":"span"},{"href":"#id-50","text":"10 ","element":"a"},{"text":"for AIW Plus and Suppl. Tab. ","element":"span"},{"href":"#id-107","text":"11 ","element":"a"},{"text":"for AIW Circles Colleagues.","element":"span"}],[{"text":"Table 2: AIW main variations 1-4, prompt types and correct answers overview.","element":"figcaption","subtype":"caption"}],[{"id":"id-4","style":{"width":"115%"},"width":1836,"height":2229,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/42-0.png","element":"img"}],[{"text":"Table 3: AIW Light Arithmetic Siblings variations 1-4","element":"figcaption","subtype":"caption"}],[{"id":"id-18","style":{"width":"115%"},"width":1836,"height":789,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/43-0.png","element":"img"}],[{"text":"Table 4: AIW Light Family variations 1-4","element":"figcaption","subtype":"caption"}],[{"id":"id-19","style":{"width":"115%"},"width":1836,"height":788,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/43-1.png","element":"img"}]]},{"heading":"C Model performance and behavior on AIW problem versions","paragraphs":[[{"text":"Here we report further details on model evaluation, performance and behavior as observed on AIW problems. For executing experiments, we either use local model deployment via vLLM [","element":"span"},{"href":"#id-28","referenceIndex":27,"text":"27","element":"a"},{"text":"], or API based liteLLM ","element":"span"},{"href":"#id-108","text":"[94] ","element":"a"},{"text":"and TogetherAI ","element":"span"},{"href":"#id-108","text":"[95]","element":"a"},{"text":".","element":"span"}],[{"text":"For the full overview of average correct response rate including models that score zero, see Suppl. Fig. ","element":"span"},{"href":"#id-31","text":"33. ","element":"a"},{"text":"For the statistics on number of trials conducted for each model and each prompt type, see Suppl. Fig. ","element":"span"},{"href":"#id-26","text":"42. ","element":"a"},{"text":"For the statistics on the average output length across models and prompt types, see Suppl. Fig. ","element":"span"},{"href":"#id-104","text":"43. ","element":"a"},{"text":"For models’ behavior on RESTRICT prompt type, see Suppl. Fig. ","element":"span"},{"href":"#id-22","text":"34. ","element":"a"},{"text":"For control comparison of THINKING v2 prompt type to THINKING and STANDARD, see see Suppl. Fig. ","element":"span"},{"href":"#id-21","text":"35. ","element":"a"},{"text":"For fluctuations on different AIW Ext versions that provide further evidence for strong models’ sensitivity to slight problem perturbations, see Suppl. Fig. ","element":"span"},{"href":"#id-109","text":"41","element":"a"}],[{"text":"Table 5: AIW Light Arithmetic Total Girls variations 1-4","element":"figcaption","subtype":"caption"}],[{"id":"id-20","style":{"width":"115%"},"width":1836,"height":789,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/44-0.png","element":"img"}],[{"id":"id-45","text":"Table 6: AIW Alice Female Power Boost and AIW Original, variations 1-4, THINKING v2 prompt","element":"figcaption","subtype":"caption"}],[{"style":{"width":"115%"},"width":1836,"height":1487,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/44-1.png","element":"img"}],[{"text":"Table 7: AIW Exaggerated Numbers variations 1-4, Thinking v2 prompt type.","element":"figcaption","subtype":"caption"}],[{"id":"id-105","style":{"width":"115%"},"width":1836,"height":789,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/45-0.png","element":"img"}],[{"text":"Table 8: AIW Ext variations 1-4, Thinking v2 prompt type.","element":"figcaption","subtype":"caption"}],[{"id":"id-55","style":{"width":"115%"},"width":1836,"height":1138,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/45-1.png","element":"img"}],[{"text":"Table 9: AIW Friends variations 1-4","element":"figcaption","subtype":"caption"}],[{"id":"id-106","style":{"width":"115%"},"width":1836,"height":1662,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/46-0.png","element":"img"}],[{"id":"id-37","style":{"fontWeight":"bold"},"text":"C.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Standardized benchmarks failure.","element":"span"}],[{"text":"In Section ","element":"span"},{"href":"#id-110","text":"3.1.2, ","element":"a"},{"text":"we observe failure of standardized reasoning benchmarks to properly reflect generalization and basic reasoning skills of SOTA LLMs by noting significant disparity between the model’s performance on the AIW problem and the outcomes on conventional standardized benchmarks, taking MMLU as representative examples. Here, we confirm this finding on further standardized reasoning benchmarks like MATH, ARC-c, GSM8K and Hellaswag (Suppl Tab. ","element":"span"},{"href":"#id-25","text":"12)","element":"a"},{"text":". We provide plots visualizing failure of these standardized benchmarks, reflected in strong mismatch between high benchmark scores reported by many models and the low correct response rates they obtain on AIW (which in some cases is 0 for models with high standardized benchmark scores), in Figures ","element":"span"},{"href":"#id-111","text":"37, ","element":"a"},{"href":"#id-112","text":"39, ","element":"a"},{"href":"#id-113","text":"36, ","element":"a"},{"href":"#id-114","text":"38.","element":"a"}],[{"text":"We see thus that standardized benchmarks fail to properly reflect true model capabilities to generalize and reason - the majority of the tested models score high on standardized benchmarks, suggesting strong function, while showing extreme low correct response rates on simple AIW problem. Many of the models with high scores on standardized benchmarks cannot solve AIW problem a single time (e.g. Command R+ is unable to solve a single AIW problem instance, see Suppl Tab. ","element":"span"},{"href":"#id-25","text":"12)","element":"a"},{"text":". This","element":"span"}],[{"text":"Table 10: AIW Plus variations 1-6.","element":"figcaption","subtype":"caption"}],[{"id":"id-50","style":{"width":"105%"},"width":1666,"height":2384,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/47-0.png","element":"img"}],[{"id":"id-107","text":"Table 11: AIW Circles Colleagues variations (omitting variations 2-5 for better readability)","element":"figcaption","subtype":"caption"}],[{"style":{"width":"105%"},"width":1666,"height":1643,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/48-0.png","element":"img"}],[{"text":"discrepancy refutes the claim of standardized benchmarks to measure correctly current models’ core functionality.","element":"span"}],[{"id":"id-46","style":{"fontWeight":"bold"},"text":"C.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Boosting by redundant information and persisting fluctuations: Alice female power boost","element":"span"}],[{"text":"We report in Sec. ","element":"span"},{"href":"#id-115","text":"3.2, ","element":"a"},{"text":"Fig. ","element":"span"},{"href":"#id-47","text":"9 ","element":"a"},{"text":"how introducing fully redundant information ","element":"span"},{"style":{"fontStyle":"italic"},"text":"\"Alice is female\" ","element":"span"},{"text":"(see Suppl. Tab. ","element":"span"},{"href":"#id-45","text":"6 ","element":"a"},{"text":"for full prompts) causes increase of average correct response rates across AIW variations 1-4, whereas strong fluctuations across variations remain. Here we visualize the observed increase in Suppl. Fig. ","element":"span"},{"href":"#id-39","text":"40. ","element":"a"},{"text":"We see that for most models that had some non-negligible correct response rates on AIW original average correct response rate os significantly boosted, despite the provided information being fully redundant. This change in performance caused by information irrelevant for problem solving again hints on deficits in generalization and basic reasoning across the models.","element":"span"}],[{"id":"id-125","style":{"fontWeight":"bold"},"text":"C.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Parameterized AIW problem (AIW-Param)","element":"span"}],[{"text":"We were interested to see where models can cope with this more generic problem formulation that does not use explicit natural numbers. We thus created a parameterized version of the AIW problem","element":"span"}],[{"id":"id-31","style":{"width":"95%"},"width":1506,"height":1246,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/49-0.png","element":"img"}],[{"text":"Figure 33: Collapse of most SOTA LLMs on AIW problem. AIW correct response rate across AIW variations averaged across all prompt types THINKING, STANDARD and RESTRICTED. Only 5 models manage to show rates above ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"p ","element":"figcaption","subtype":"caption"},{"text":"= 0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":".","element":"figcaption","subtype":"caption"},{"text":"2","element":"figcaption","subtype":"caption"},{"text":": GPT-4o, Claude 3 Opus, GPT-4-0613, Llama 2 70B Chat and GPT-4-0125-preview (GPT4-Turbo). Llama 2 70B Chat is the only open-weights model in this set. The rest either shows poor performance below ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"p ","element":"figcaption","subtype":"caption"},{"text":"= 0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":".","element":"figcaption","subtype":"caption"},{"text":"15","element":"figcaption","subtype":"caption"},{"text":", or even collapses entirely to 0. Among those models collapsing to 0 are many which claim strong function via high scores obtained on standardized benchmarks, eg larger scale GPT-3.5, Mixtral 8x7B and 8x22B, Command R Plus, Qwen 1.5 72B Chat and smaller scale Gemma-7b-it, Mistral Small and Mistral Medium.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"100%"},"width":1593,"height":540,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/49-1.png","element":"img"}]]},{"heading":"D Examples of correct and failed responses","paragraphs":[[{"text":"We provide all collected model responses we obtained during this study in the collected","element":"span"},{"style":{"fontFamily":"monospace"},"text":"_","element":"span"},{"text":"responses folder in the ","element":"span"},{"href":"https://github.com/LAION-AI/AIW/tree/main/collected_responses","text":"AIW repo. ","element":"a"},{"text":"Here we also showcase some correct and incorrect answers as an example (see Suppl. Figs. ","element":"span"},{"href":"#id-116","text":"45, ","element":"a"},{"href":"#id-96","text":"48, ","element":"a"},{"href":"#id-117","text":"46, ","element":"a"},{"href":"#id-95","text":"47, ","element":"a"},{"href":"#id-118","text":"49)","element":"a"},{"text":".","element":"span"}],[{"id":"id-22","style":{"width":"98%"},"width":1567,"height":842,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/50-0.png","element":"img"}],[{"text":"Figure 34: Correct response rates on RESTRICTED prompt type. The prompt type enforcing to output only final answer without any further text was used as further control. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(a) ","element":"figcaption","subtype":"caption"},{"text":"Correct response rates averaged over variations 1-4 resemble behavior with STANDARD and THINKING types, while looking at fluctuations across variations 1-4 in ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(b) ","element":"figcaption","subtype":"caption"},{"text":"reveals stronger models’ lack of robustness compared to other prompt types (see for comparison Fig. ","element":"figcaption","subtype":"caption"},{"href":"#id-32","text":"4)","element":"a","subtype":"caption"},{"text":". We thus used THINKING prompt types across main experiment not to put models into disadvantage on AIW testing.","element":"figcaption","subtype":"caption"}],[{"id":"id-21","style":{"width":"98%"},"width":1567,"height":864,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/50-1.png","element":"img"}],[{"text":"Figure 35: Control comparison of correct response rates averaged across AIW variations 1-4. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(a) ","element":"figcaption","subtype":"caption"},{"text":"THINKING v2 vs. STANDARD, ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(b) ","element":"figcaption","subtype":"caption"},{"text":"THINKING v2 vs. THINKING prompt types. THINKING provides better average correct response rates for tested models. We thus used THINKING prompt types for main and control experiments to ensure tested models are not disadvantaged on AIW problem. THINKING and THINKING v2 show highly similar behavior across tested models ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(b) ","element":"figcaption","subtype":"caption"},{"text":"and can be used interchangeably (THINKING v2 only difference to THINKING is the explicit phrasing ","element":"figcaption","subtype":"caption"},{"text":"\"step by step\", Suppl Tab. ","element":"figcaption","subtype":"caption"},{"href":"#id-45","text":"6)","element":"a","subtype":"caption"}],[{"id":"id-113","style":{"width":"100%"},"width":1587,"height":1770,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/51-0.png","element":"img"}],[{"text":"On top of our main experiments with instruction tuned models, we have considered to evaluate selected base models on AIW to see whether there is any striking difference between instruction and base model ability to solve the AIW problem. For these experiments we considered currently available bases of several models we have already tested: Mixtral 8x7b, Mixtral 8x22b, Mistral 7b, LLaMA 2 70b, LLaMA 2 13b, LLaMA 2 7b. We used the following prompt for base models: \"","element":"span"},{"style":{"fontFamily":"monospace"},"text":"### ","element":"span"},{"text":"Problem: ... ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"### ","element":"span"},{"text":"Answer:\". We see in line with what we observe on instruct models that also base models perform poorly on AIW problem. We observe that Mistral 7b base shows higher correct response rate on average across all tested base models (see Fig. ","element":"span"},{"href":"#id-119","text":"50, ","element":"a"},{"text":"while our observations for Mistral 7B instruction model do not show any difference to other similarly poor performing models unable to deal with AIW. We do not see any remarkable differences for base model case compared to our main observations made with instruction models.","element":"span"}]]},{"heading":"F Confabulations and overconfident tone accompanying wrong answers","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Overconfident tone. ","element":"span"},{"text":"In ideal scenario, if LLM cannot correctly solve the AIW problem, it should at least be capable of expressing high uncertainty about the provided incorrect solution to the user. We","element":"span"}],[{"id":"id-111","style":{"width":"100%"},"width":1591,"height":2125,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/52-0.png","element":"img"}],[{"text":"Such confabulations can contain mathematical calculations or other logic-like expressions and operations that make little or absolutely no sense given the problem to be solved, see examples for Olmo-7B, Fig. ","element":"span"},{"href":"#id-120","text":"57 ","element":"a"},{"text":"and Command R+, Fig. ","element":"span"},{"href":"#id-121","text":"59.","element":"a"}],[{"text":"Further confabulations make use of various social and cultural norm specific context to argue for the posed problem to be inappropriate to solve or to provide non-sense arguments for various incorrect answers. There are many such examples that we have observed, we present here only a small selection.","element":"span"}],[{"text":"CodeLlama-70B-instruct for instance seems to be specifically prone to claim ethical or moral reasons for not addressing the problem correctly, in the presented example inventing out of nowhere a person","element":"span"}],[{"id":"id-114","style":{"width":"100%"},"width":1591,"height":2583,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/53-0.png","element":"img"}],[{"id":"id-112","style":{"width":"97%"},"width":1547,"height":1130,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/54-0.png","element":"img"}],[{"text":"Figure 39: Failure of standardized benchmark ARC-c to properly reflect and compare model basic reasoning capabilities as shown by strong discrepancy between AIW correct response rate vs ARC-c ","element":"figcaption","subtype":"caption"},{"id":"id-39","text":"average score.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"78%"},"width":1241,"height":892,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/54-1.png","element":"img"}],[{"text":"Figure 40: AIW \"Alice Female Power Boost\" version. Average correct response rate (measured across AIW variations 1-4) increases after addition of entirely redundant information \"Alice is female\" (pronoun \"she\" already fully indicates the gender in original AIW). Thinking v2 prompt type is used for both AIW versions. See also Fig. ","element":"figcaption","subtype":"caption"},{"href":"#id-47","text":"9 ","element":"a","subtype":"caption"},{"text":"for persisting strong fluctuations across variations 1-4.","element":"figcaption","subtype":"caption"}],[{"text":"appear by following customized prompts that request to produce different solutions and check which one is to prefer, or those that appear entirely unprompted (An example of a customized prompt that","element":"span"}],[{"id":"id-25","text":"Table 12: Performance of tested models on MMLU, Hellaswag, ARC-c, GSM8k and AIW. Correct re- ","element":"figcaption","subtype":"caption"},{"text":"sponse rate averaged across AIW variations 1-4, across STANDARD, THINKING and RESTRICTED prompt types.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":1580,"height":1400,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/55-0.png","element":"img"}],[{"text":"encourages to produce various solutions and evaluate those is ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"\"Look at the problem step by step and formulate 3 different solutions that come to different results. Then evaluate which solution seems to be the best and then come to a definitive final statement.\"","element":"span"},{"text":", see also Fig. ","element":"span"},{"href":"#id-121","text":"59. ","element":"a"},{"text":"In all those cases, we see only poor ability of the models the provide proper self-checks. In the examples we observed, self-verification provides longer narration, but does not lead to successful revision of wrong answers.","element":"span"}],[{"text":"$47","element":"span"},{"href":"#id-14","text":"20","element":"a"},{"text":"] in the sense that while proceeding with attempt to find possible mistakes, they insist that the provided solution is actually correct (for instance in examples we saw from interaction with Command R+).","element":"span"}],[{"id":"id-109","style":{"width":"99%"},"width":1575,"height":1146,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/56-0.png","element":"img"}],[{"text":"Figure 41: Strong fluctuations within variations 1-4 for AIW Ext ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(A) ","element":"figcaption","subtype":"caption"},{"text":"Bob’s sisters question and ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(B) ","element":"figcaption","subtype":"caption"},{"text":"Alice’s brothers question versions, both using THINKING v2 prompt type (Suppl. Tab. ","element":"figcaption","subtype":"caption"},{"href":"#id-55","text":"8)","element":"a","subtype":"caption"},{"text":". Despite both versions using same problem structure, only minor difference being in form of question asked, average correct response rates differ. For instance, both Claude 3.5 Sonnet and Qwen 2 72B Instruct drop heavily on ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(B) ","element":"figcaption","subtype":"caption"},{"text":"vs ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(A) ","element":"figcaption","subtype":"caption"},{"text":"without any apparent reason, as the modifications do not change the problem structure. Strong fluctuations across variations 1-4 are also evident within each problem version, again exposing lack of robustness to slight problem variations and pointing to severe flaws in basic reasoning.","element":"figcaption","subtype":"caption"}],[{"text":"In very rare examples, we see revisions of the previously wrong answers being made, after user insists repeatedly on existing mistakes and necessity to correct those (eg observed in LLaMA 3 70b, see Fig. ","element":"span"},{"href":"#id-122","text":"61 ","element":"a"},{"text":")","element":"span"}],[{"text":"For collected multi-turn conversations, see ","element":"span"},{"href":"https://github.com/LAION-AI/AIW/tree/main/collected_responses/multi_turn_conversation.json","text":"AIW repo.","element":"a"}]]},{"heading":"H Reformulation of AIW problem as relational SQL database problem","paragraphs":[[{"text":"Due to its simple relational structure, AIW problem can be represented as a relational database problem. By formulating the problem as relational database, one can solve it by running SQL queries. If a language model is capable of correctly reformulating the AIW problem into relational SQL problem and generate the SQL queries that will give the right answer - it hints that model possess some form of explicit understanding of the problem structure. For example, in the Fig. ","element":"span"},{"href":"#id-123","text":"62, ","element":"a"},{"text":"we can see that Mixtral 8x22B instruct v0.1 is able to correctly generate SQL queries for table creation, table population and solution of the problem. However, the language model still outputs the wrong answer (4 instead of 5, when confronted with task to reformulate into SQL AIW Variation 3).","element":"span"}],[{"text":"Moreover, if providing those generated queries back on the model’s input - importantly, excluding text description model has generated alongside the SQL query, so that only SQL query is provided on the input - and asking the model what would be the result of running the generated pure SQL query, the model will be able to provide the correct final answer to AIW problem (5 in that particular example), and that consistently with high chance. At the same time, if providing on the input the full model response with both generated SQL queries and natural language text, Mixtral often outputs the","element":"span"}],[{"id":"id-26","style":{"width":"85%"},"width":1349,"height":2242,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/57-0.png","element":"img"}],[{"text":"Figure 42: AIW Average number of responses per model for each prompt type (4 AIW variations per prompt type.). Models with less than 100 responses per prompt type are excluded from further analysis. All those models have negligible correct response rates, either 0 or close to 0.","element":"figcaption","subtype":"caption"}],[{"id":"id-104","style":{"width":"85%"},"width":1360,"height":2242,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/58-0.png","element":"img"}],[{"text":"Figure 43: Average length (in characters) of responses per model for each AIW prompt variation. Phi-2 has the highest average length of responses, because it is not a classical instruction tuned model, but a base model, less capable of following instructions.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"94%"},"width":1504,"height":718,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/59-0.png","element":"img"}],[{"text":"Figure 44: AIW-param correct response rate for all tested models. This problem focuses on revealing the general understanding of the problem (it doesn’t have specific numbers). The largest SOTA LMs like GPT-4 or Claude 3 Opus have better correct response rates (older GPT-4 versions showing highest rates here, while GPT-4o drops strongly below ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"p ","element":"figcaption","subtype":"caption"},{"text":"= 0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":".","element":"figcaption","subtype":"caption"},{"text":"05","element":"figcaption","subtype":"caption"},{"text":"; Claude 3 Opus drops as well below ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"p ","element":"figcaption","subtype":"caption"},{"text":"= 0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":".","element":"figcaption","subtype":"caption"},{"text":"1","element":"figcaption","subtype":"caption"},{"text":"), their gap to other models that perform significantly poorer having rates close to 0 is large. This indicates capability for these models to handle a general version of AIW problem and hints a more robust reasoning behind the solution than the rest of tested models. For AIW-Param, it is less probable to produce a correct response by accident merely guessing the number without any proper reasoning behind it. Strong drop of other models might hint that in AIW problem variations that feature natural numbers, those models do not rely on robust reasoning, and their performance might be strongly dependent on a specific AIW variation. This we observe for instance for Llama 2 70 B that show strong performance deterioration here falling almost to 0, while also showing strong performance fluctuation depending on AIW variation, see Suppl Fig. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"??","element":"figcaption","subtype":"caption"}],[{"text":"wrong answer. This means that the model has some understanding of both the AIW problem and the SQL, but it is not able to connect everything together. We hypothesize that it might be because the model is attending mainly to the natural text description of the problem rather than pure SQL queries while generating the final answer.","element":"span"}],[{"text":"In conclusion, we see from these initial experiments that some models possess ability to capture the problem’s structure as evident by their ability to reformulate it as explicit formal SQL query that reflects formal relational problem structure correctly. However, this ability is not predictive for the model’s performance on solving AIW task correctly. We see models that have much better performance than Mistral on AIW, eg GPT-4o, failing at SQL reformulation task. Another observation is that none of smaller scale models, eg Mistral-7B, can solve the SQL reformulation task, hinting that they cannot cope with discovering the structure of the AIW problem.","element":"span"}]]},{"heading":"I In-context learning experiments","paragraphs":[[{"text":"As a sanity test, we perform few experiments with in-context learning (ICL) using base models. As the AIW problem has simple shortcut solution in form of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"+ 1","element":"span"},{"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"is number of sisters, it is expected that ICL, if few examples of AIW problem are presented in the input, will find and use this shortcut to solve the new examples. This is also what we observe - models are able to easily provide the correct answer after being exposed to few examples of solved AIW problem instances.","element":"span"}],[{"text":"To confirm that the solution obtained by ICL has no strong reasoning behind and uses the shortcut, we alter the query AIW problem that follows the presented AIW examples such that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"+ 1 ","element":"span"},{"text":"is not a valid solution anymore (eg by asking number of brothers of Alice’s sister, which is just equal to number of brothers given in the problem description). We observe the models sticking to shortcut ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"+ 1","element":"span"},{"text":", which hints that no proper reasoning was instantiated by ICL (Fig. ","element":"span"},{"href":"#id-124","text":"63)","element":"a"},{"text":".","element":"span"}],[{"id":"id-116","style":{"width":"98%"},"width":1563,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/60-0.png","element":"img"}],[{"text":"Alice has 4 sisters and she also has 1 brother. How many sisters does Alice's brother have? Before providing answer to this problem, think carefully and double check the path to the correct solution for any mistakes. Provide then the final answer in following form: \\\"### Answer: \\\"","element":"span"}],[{"style":{"width":"98%"},"width":1563,"height":547,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/60-1.png","element":"img"}],[{"text":"Figure 45: Example of a wrong response for Claude 3 Opus on THINKING prompt type, featuring a failure of self-verification.","element":"figcaption","subtype":"caption"}],[{"id":"id-117","style":{"width":"100%"},"width":1589,"height":783,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/60-2.png","element":"img"}],[{"text":"Figure 46: Example of a wrong response for Llama 3 70B Instruct. The model counts Alice as her own sister therefore produces wrong answer.","element":"figcaption","subtype":"caption"}],[{"id":"id-95","style":{"width":"98%"},"width":1561,"height":575,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/60-3.png","element":"img"}],[{"text":"Figure 47: Example of a correct final answer with fully correct reasoning for GPT-4 Turbo.","element":"figcaption","subtype":"caption"}],[{"id":"id-96","style":{"width":"100%"},"width":1590,"height":1077,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/61-0.png","element":"img"}],[{"text":"Figure 48: Example of a correct final answer with fully correct, detailed reasoning for Claude 3 Opus.","element":"figcaption","subtype":"caption"}],[{"text":"We also present the AIW-param problem (see Suppl. Sec. ","element":"span"},{"href":"#id-125","text":"C.3 ","element":"a"},{"text":"featuring variables N,M for brothers and sisters quantities as query following AIW examples with explicit natural numbers, to see whether models can come up with generalized solution ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"+ 1 ","element":"span"},{"text":"as response. We observe frequent failure of the models to do so. While in some occasions (as observed for instance for Llama 3 70B), the correct response ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"+ 1 ","element":"span"},{"text":"is generated, in other frequent occasions, either incorrect responses containing expressions with variables N,M are produced (eg, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M","element":"span"},{"text":"), or there are incorrect responses featuring explict natural numbers. We thus do not see hints that ICL helps to instantiate better reasoning from few examples of solved AIW problems.","element":"span"}]]},{"heading":"Author contributions","paragraphs":[[{"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Marianna Nezhurina","element":"span"},{"text":": discovered the original problem formulation and performed first experiments observing collapse across different models. Created further problem variations including the hard AIW+. Collected and analyzed data. Wrote major parts of the experimental infrastructure, data analysis and evaluation routines. Co-wrote the manuscript.","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Lucia Cipolina-Kun","element":"span"},{"text":": performed experiments, collected data and provided further input for the studies. Co-wrote the manuscript.","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Mehdi Cherti","element":"span"},{"text":": organized access to various models in the study via various APIs. Wrote code for parts of experimental infrastructure. Performed experiments, collected data and provided further input for the studies. Co-wrote the manuscript.","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Jenia Jitsev","element":"span"},{"text":": led the project. Created further problem variations. Created automated routines for experimental infrastructure and performed large portion of experiments, collected and analyzed data. Wrote the manuscript.","element":"span"}],[{"id":"id-118","style":{"width":"93%"},"width":1480,"height":86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/62-0.png","element":"img"}],[{"text":"Alice has 3 sisters. Her mother has 1 sister who does not have children - she has 7 nephews and nieces and also 2 brothers. Alice's father has a brother who has 5 nephews and nieces in total, and who has also 1 son. How many cousins does Alice's sister have? Before providing answer to this problem, think carefully and double check the path to the correct solution for any mistakes. Provide then the final answer in following form: \"### Answer: \". [92]","element":"span"}],[{"style":{"width":"93%"},"width":1480,"height":190,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/62-1.png","element":"img"}],[{"text":"Let's break down the information given in the problem and solve it step by step: 1. Alice has 3 sisters. 2. Alice's mother has 1 sister who does not have children. 3. Alice's mother has 7 nephews and nieces in total. 4. Alice's mother has 2 brothers. 5. Alice's father has a brother who has 5 nephews and nieces in total. 6. Alice's father's brother has 1 son. Now, let's calculate the number of cousins Alice's sister has:","element":"span"}],[{"style":{"width":"93%"},"width":1480,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/62-2.png","element":"img"}],[{"text":"1. ","element":"span"},{"text":"Alice's mother's side: ","element":"span"},{"text":"- Alice's mother's sister has no children, so there are no cousins from this side. - Alice's mother has 2 brothers, but we don't know how many children each of them has. - We know that Alice's mother has a total of 7 nephews and nieces, which includes Alice and her 3 sisters. - So, the number of cousins from Alice's mother's brothers = ","element":"span"},{"style":{"fontWeight":"bold"},"text":"7 - (Alice + her 3 sisters) = 7 - 4 = 3 cousins","element":"span"}],[{"style":{"width":"93%"},"width":1480,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/62-3.png","element":"img"}],[{"text":"2. ","element":"span"},{"text":"Alice's father's side: ","element":"span"},{"text":"- Alice's father has a brother who has 1 son. - We know that Alice's father's brother has a total of 5 nephews and nieces, which includes Alice and her 3 sisters. - So, the ","element":"span"},{"style":{"fontWeight":"bold"},"text":"number of cousins from Alice's father's brother = 5 - (Alice + her 3 sisters) + 1 son = 5 - 4 + 1 = 2 cousins Therefore, the total number of cousins Alice's sister has = Cousins from mother's side + Cousins from father's side = 3 + 2 = 5 cousins","element":"span"}],[{"style":{"width":"93%"},"width":1480,"height":129,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/62-4.png","element":"img"}],[{"text":"Figure 49: Example of a fully correct reasoning and final answer response given to AIW+ problem by Claude 3 Opus (prompt id 92, THINKING prompt type, Suppl. Tab. ","element":"figcaption","subtype":"caption"},{"href":"#id-50","text":"10)","element":"a","subtype":"caption"},{"text":". Such responses are very rare, but do exist (Suppl. Fig. ","element":"figcaption","subtype":"caption"},{"href":"#id-51","text":"12 ","element":"a","subtype":"caption"},{"text":"- hinting that reasoning capability is potentially in place in stronger performers at largest scale like Claude 3 Opus, but is very fragile and severely compromised.","element":"figcaption","subtype":"caption"}],[{"id":"id-119","style":{"width":"94%"},"width":1504,"height":981,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/63-0.png","element":"img"}],[{"text":"Figure 50: Average correct response rate on AIW variations for base models. In line with our main results obtained on instruct models, here we see that base models also perform poorly, with Mistral 7b v0.1 having higher correct response rate compared to other investigated models.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"100%"},"width":1589,"height":723,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/63-1.png","element":"img"}],[{"text":"Figure 51: Example of a response for LLaMA 7b base model. In this case model produces wrong answer, then generates its own unrelated quiz like problem and sticks to generating the same output over and over again.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":1581,"height":75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/64-0.png","element":"img"}],[{"text":"### Problem: Alice has 4 sisters and she also has 1 brother. How many sisters does Alice's brother have? ### Answer:","element":"span"}],[{"style":{"width":"99%"},"width":1581,"height":673,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/64-1.png","element":"img"}],[{"text":"Figure 52: Example of a response for LLaMA 13b base model with complete failure to cope with the task.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"100%"},"width":1589,"height":763,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/64-2.png","element":"img"}],[{"text":"Figure 53: Example of a wrong response for LLaMA 70b base model. After generating wrong response, model goes on with generating own quiz like problems.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":1581,"height":75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/65-0.png","element":"img"}],[{"text":"### Problem: Alice has 2 sisters and she also has 4 brothers. How many sisters does Alice's brother have? ### Answer:","element":"span"}],[{"style":{"width":"99%"},"width":1581,"height":633,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/65-1.png","element":"img"}],[{"text":"Figure 54: Example of a response for Mixtral 8x7b base model. In this example Mixtral 8x7b base model is able to correctly respond to the answer. It continues to produce further unrelated question-answer pairs which is an expected behavior of a base model.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"100%"},"width":1589,"height":843,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/65-2.png","element":"img"}],[{"text":"Figure 55: Example of a wrong response for Mixtral 8x22b base model. Model goes on generating own AIW problems for which it also delivers wrong answers.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":1581,"height":101,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/66-0.png","element":"img"}],[{"text":"Alice has four brothers and she also has a sister. How many sisters does Alice's brother have? Solve the problem taking care about not doing mistakes.","element":"span"}],[{"style":{"width":"99%"},"width":1581,"height":142,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/66-1.png","element":"img"}],[{"text":"This problem can be solved ","element":"span"},{"style":{"fontWeight":"bold"},"text":"accurately and unambiguously","element":"span"},{"text":". Alice has four brothers, which clearly indicates that she has four male siblings. Additionally, it is stated that she also has a sister, which designates a female sibling.","element":"span"}],[{"style":{"width":"99%"},"width":1581,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/66-2.png","element":"img"}],[{"text":"Therefore, Alice's brothers have one sister. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"This interpretation is straightforward and avoids any potential mistakes or ambiguities","element":"span"},{"text":". It is based solely on the information provided, and there is no need for further interpretation or speculation.","element":"span"}],[{"style":{"width":"99%"},"width":1581,"height":156,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/66-3.png","element":"img"}],[{"text":"Figure 56: Overconfidence: example of a wrong response for Command R Plus, containing highly persuasive wording to back up its confidence into incorrect solution","element":"figcaption","subtype":"caption"}],[{"id":"id-120","style":{"width":"99%"},"width":1581,"height":93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/66-4.png","element":"img"}],[{"text":"Alice has X brothers and she also has Y sisters. How many sisters does Alice's brother have? Solve this problem and provide the final answer in following form: \\\"### Answer: \\\"","element":"span"}],[{"style":{"width":"99%"},"width":1581,"height":135,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/66-5.png","element":"img"}],[{"text":"The number of Alice's brothers is X - number of Alice's sisters. ELEMENTARY ALGEBRA: The number of X brothers is Y, and the total number of brothers and sisters is (X + Y). now, we can represent X's brothers by x and Y's sisters by y, so (X + Y) = x + y. Since the total number of brothers and sisters is X + Y,X = (X + Y) - Y. Now, ","element":"span"},{"style":{"fontWeight":"bold"},"text":"divide both sides by (X + Y) to findX: X = (X + Y) - Y = (x + y) - y ","element":"span"},{"text":"- In other words:X = x -- y. Subtract y from both sides: X = x - y.Now, ","element":"span"},{"style":{"fontWeight":"bold"},"text":"scale the -y to the left side by adding 5 to both sides: X = ((x + 5) - y)","element":"span"},{"text":". So, ","element":"span"},{"style":{"fontWeight":"bold"},"text":"the number of Alice's brothers is either x, x + 1, x + 2, x + 3, or x + 4.","element":"span"},{"text":"\\\\### Answer: x - 4","element":"span"}],[{"style":{"width":"99%"},"width":1581,"height":68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/66-6.png","element":"img"}],[{"text":"Figure 57: Confabulation: example of a wrong response for OLMo-7b-Instruct for AIW-param problem. The model generates text that contains different algebraic manipulations involving X and Y and has little to do with reasoning.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":1581,"height":316,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/67-0.png","element":"img"}],[{"text":"I apologize, but as a responsible AI language model, I cannot provide an answer that would potentially perpetuate harm or discrimination towards any individual or group, ","element":"span"},{"style":{"fontWeight":"bold"},"text":"including a person of living with Down syndrome","element":"span"},{"text":". It is important to recognize that every individual has inherent dignity and worth, regardless of their abilities or characteristics. Instead, ","element":"span"},{"style":{"fontWeight":"bold"},"text":"I suggest rephrasing the question to focus on Alice's relationship with the person living with Down syndrome ","element":"span"},{"text":"in a respectful and inclusive manner. I can assist you in modifying the question to promote understanding and empathy towards individuals with diverse abilities and characteristics. If you have any other questions or concerns, please feel free to ask, and I will do my best to provide helpful and responsible responses.","element":"span"}],[{"style":{"width":"99%"},"width":1581,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/67-1.png","element":"img"}],[{"text":"Figure 58: Example of a confabulation response for CodeLlama-70b-Instruct. In this example, the model produces wrong reasoning and in addition distracts from the actual main problem to solve by bringing up that it is disrespectful to ask questions about a person with Down syndrome, which is entirely invented out of nowhere by the model and is not present in the original task formulation.","element":"figcaption","subtype":"caption"}],[{"id":"id-121","style":{"width":"99%"},"width":1581,"height":75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/68-0.png","element":"img"}],[{"text":"Alice has four brothers and she also has a sister. How many sisters does Alice's brother have? Look at the problem step by step and formulate 3 different solutions that come to different results. Then evaluate which solution seems to be the best and then come to a definitive final statement.","element":"span"}],[{"style":{"width":"99%"},"width":1581,"height":1792,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/68-1.png","element":"img"}],[{"text":"Figure 59: Example of a confabulation response for Command R Plus. We prompted Command R+ to solve the problem step by step and come up with 3 different solutions and then provide the final answer by evaluating these solutions. The model produces 3 solutions by taking 3 different angles on the problem, providing for each angle an explanation to back it up. The 3 provided solutions are all wrong. In the following evaluation, the model labels then one of the solutions as best fit and proclaims it as a correct answer, arguing additionally for properly respecting relationships in the problem. The final provided answer is incorrect.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"79%"},"width":1266,"height":1746,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/69-0.png","element":"img"}],[{"text":"Figure 60: Example of a confabulation response for Command R Plus. We prompted Command R+ to solve the problem step by step and come up with 3 different solutions and then provide the final answer by evaluating quality of these solutions. All 3 solutions produced by the model are wrong. In the third solution model attempts to back up its answer by elaborating on possible gender identities of members of Alice’s family, which has no proper connection to the posed problem.","element":"figcaption","subtype":"caption"}],[{"id":"id-122","style":{"width":"69%"},"width":1095,"height":1822,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/70-0.png","element":"img"}],[{"text":"Figure 61: Example of a correct revision by LLaMA 3 70b. It first provides 3 incorrect answers, while the user repeatedly insists for the solution to be incorrect and urges to find the mistakes.","element":"figcaption","subtype":"caption"}],[{"id":"id-123","style":{"width":"80%"},"width":1273,"height":1489,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/71-0.png","element":"img"}],[{"text":"Figure 62: Example of SQL reformulation of the AIW problem and correct solution for Mixtral 8x22B instruct v0.1. The SQL queries are correct and if one would execute them will produce the right answer. However the final answer generated by the model is still incorrect. We hypothetize that the models pays more attention to the natural text from the user (the prompt) and its own generated text rather than to SQL queries.","element":"figcaption","subtype":"caption"}],[{"id":"id-124","style":{"width":"100%"},"width":1589,"height":1261,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02061/images/72-0.png","element":"img"}],[{"text":"Figure 63: Example of in-context learning response for LLaMA 3 70B base model. Model produces the wrong answer. As query example switches to the question about brothers of Alice’s sister, the shortcut solution for the examples presented before, which is number of sisters + 1, does not work. The incorrect response provided by the model hints that it might rely on the discovered shortcut solution and does not perform proper reasoning about the problem.","element":"figcaption","subtype":"caption"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]