Can large language models understand uncommon meanings of common words?

Abstract Large language models (LLMs) like ChatGPT have shown signiﬁcant advancements across diverse natural language understanding (NLU) tasks, including intelligent dialogue and autonomous agents. Yet, lacking widely acknowledged testing mechanisms, answering ‘whether LLMs are stochastic parrots or genuinely comprehend the world’ remains unclear, fostering numerous studies and sparking heated debates. Prevailing research mainly focuses on surface-level NLU, neglecting ﬁne-grained explorations. However, such explorations are crucial for understanding their unique comprehension mechanisms, aligning with human cognition, and ﬁnally enhancing LLMs’ general NLU capacities. To address this gap, our study delves into LLMs’ nuanced semantic comprehension capabilities, particularly regarding common words with uncommon meanings. The idea stems from foundational principles of human communication within psychology, which underscore accurate shared understandings of word semantics. Speciﬁcally, this paper presents the innovative construction of a Lexical Semantic Comprehension (LeSC) dataset with novel evaluation metrics, the ﬁrst benchmark encompassing both ﬁne-grained and cross-lingual dimensions. Introducing models of both open-source and closed-source, varied scales and architectures, our extensive empirical experiments demonstrate the inferior performance of existing models in this basic lexical-meaning understanding task. Notably, even the state-of-the-art LLMs GPT-4 and GPT-3.5 lag behind 16-year-old humans by 3.9% and 22.3%, respectively. Additionally, multiple advanced prompting techniques and retrieval-augmented generation are also introduced to help alleviate this trouble, yet limitations persist. By highlighting the above critical shortcomings, this research motivates further investigation and oﬀers novel insights for developing more intelligent LLMs. The resources are available at https://github.com/jinyangwu/LeSC. Keywords: Large language models, Human cognition, Semantic comprehension, Evaluation metrics, Prompting techniques, Retrieval-augmented generation

1. Introduction “Any fool can know. The point is to understand.” — Albert Einstein Researchers in AI community, especially Natural Language Processing (NLP), have been investigating the fundamental principles of intelligence for years (Bubeck et al., 2023). Ben-eﬁtting from unprecedented scales of model size and training corpus, LLMs like ChatGPT (OpenAI, 2023), Gem-

ini (Anil et al., 2023), and LLaMA (Touvron et al., 2023) have

shown surprising, even emergent Natural Language Understanding (NLU) capabilities (Rai et al., 2023) at or surpassing human levels even in unseen scenarios, which ensures proﬁcient execution across diverse downstream tasks, such as sentiment

analysis (Wang et al., 2018), question answering (Peng et al., 2023), and autonomous agents (Chen et al., 2024).

As Albert Einstein once stated, “Any fool can know. The point is to understand”1. From a human cognitive perspective, understanding is crucial for possessing human intelligence rather than relying solely on memorization and

mimicry (Lonergan, 1957; Sternberg, 1983). Given LLMs’

impressive performance across diverse semantic understanding tasks, there is no doubt that LLMs are powerful tools and possess some degree of language understanding and intelligence (H¨aggstr¨om, 2023). Nevertheless, recent stud-

ies (Bender et al., 2021; Bubeck et al., 2023; Xu et al., 2024)

have expressed considerable concerns about the scientiﬁc and also philosophical question: “whether LLMs genuinely understand the world or just mimic language patterns and logic, i.e. stochastic parrots”. For example, (Bender et al., 2021) advocates that NLP researchers carefully examine the risks of blindly pursuing larger models, considering factors such as resource consumption and the signiﬁcance of model outputs due to their inherent limitations. (Borji, 2023) points out that LLMs may excel in understanding abstract tasks detached from the

physical world (e.g. mathematics or coding) but struggle with more real-world understandings, which is essential for developing more intelligent LLMs. (Li, 2023) discusses that new legal and ethical risks are emerging due to stochastic parrots and hallucinations. Numerous attempts have been made to answer the above question, especially in creating meaningful benchmarks that reﬂect LLMs’ ability to accurately grasp human-conveyed semantics. (Choi et al., 2023) introduced a theory-driven benchmark and measured how well LLMs understand social language at the sentence level; (Riccardi and Desai, 2023) proposed a noun-noun combination-based dataset to assess LLMs’ capacity for meaningfulness judgment, which requires a high level of language understanding; (Jang et al., 2023) explored whether LLMs actually understand what they are instructed to do facing word-level modiﬁcations; GLUE (Wang et al., 2018) aimed to evaluate the performance of NLP models on eight language understanding tasks, such as Duplicate Sentence Detection; (Zheng et al., 2022) constructed FewNLU to makes a systematic assessment of the few-shot learning problems, and Promptbench (Zhu et al., 2023) were proposed to analyze LLMs’ robustness while facing NLU tasks with attacks. However, they primarily focus on sentence-level semantic understanding or word-level instruction comprehension, overlooking nuanced dimensions that hinder our holistic comprehension of their linguistic competence. Moreover, the basis of communication in psychology relies on a shared understanding of word mean-

ings (Miller, 1967; De Deyne et al., 2023), a fine-grained lan-

guage comprehension capacity. Therefore, proposing a targeted dataset to bridge this gap is of profound signiﬁcance. In this work, we aim to comprehensively discuss the above challenge question focusing on uncommon meanings of common words, a ﬁne-grained perspective in line with communication psychology and human cognition (Clark and Gerrig, 1983; DuVivier, 1999). Motivated by this intuitive inspiration, we construct the LeSC dataset to assess the word-level NLU ability, and introduce two metrics including our proposed weighted accuracy, which eﬀectively mitigates the impact of model preferences, thus fostering a more fair and objective evaluation. Contrary to prior work, our proposed dataset LeSC specializes in context-aware ﬁne-grained lexical semantics understanding (LSU, deﬁnition in theoretical linguistics (Geeraerts, 2002)), and also provides cross-lingual transfer tests, which delve deeper into the model’s comprehension at a more precise semantic level. Given this benchmark, a series of conﬁrma-tory experimental research can be conducted, and the results highlight a prevailing lack of LSU capability among current LLMs, with even ChatGPT producing less-than-satisfactory performance (Figure 1). We then take a step towards investigating the language capability transfer in LLMs. Moreover, some strategies are also discussed to eﬀectively alleviate the above problem, such as few-shot prompting (Brown et al., 2020) and chain-of-thought (Wei et al., 2022). For in-depth analysis, we also posit conjectures to dive into the underlying explanations and magic through additional studies. To sum up, the key contributions are summarized as follows:

Figure 1: An example from LeSC dataset. Within the gray box are the inputs, comprising a prompt, a question, and provided options, and ’A’, ’B’, ’C’ refer to ’low in price’, ’unwilling to spend money’, ’of poor quality; inferior’, respectively. Within the green box, the answer of ChatGPT is ’A’, inconsistent with the correct answer ’C’.

• We introduce LeSC, a pioneering open-source benchmark with a fair and objective evaluation metric rooted in model preferences for assessing the ﬁne-grained LSU capability of LLMs. The design of LeSC ensures its validity, thereby furnishing a valuable foundation for future alignment with human comprehension and facilitating other NLU-based research like robotic navigation with LLMs. • Extensive experiments reveal that existing models, affected by their inherent imitation tendencies, overconﬁ-dence, and inadequate language capacity transfer, exhibit limited proﬁciency in the basic LSU task, despite their notable eﬃcacy in more complex NLU tasks. • Comparing ChatGPT (GPT-3.5 and GPT-4) with the human performance measured by asking 16-year-old humans from diverse backgrounds on the same task, we show that there is a huge (∼22.3% and ∼3.9%) performance gap to close. • Results indicate that advancing prompting techniques and retrieval-augmented generation partially help mitigate this challenge, however, their beneﬁts tend to diminish or even become counterproductive on very large language models. • Some visualizations are presented to help delve into the underlying explanations and mechanisms. Exploring the ﬁne-grained LSU ability of LLMs can oﬀer crucial insights into NLP research, consequently leading to a comprehensive enhancement in the eﬃcacy of various downstream tasks like translation and summarization. Our ﬁndings indicate that despite remarkable performance on challenging NLU tasks, state-of-the-art (SOTA) language models like ChatGPT still have fundamental weaknesses in lexical context understanding and extracting proper meanings from the input. Beyond solving more challenging problems, future work should also take into account the underestimated challenge of ﬁne-grained LSU, thus oﬀering guidance and assistance in developing more powerful supermodels. 2

Figure 2: The workflow of LeSC. In stage 1, we first construct the LeSC dataset using GAOKAO and CET sources. After that, in stage 2, we employ advanced strategies to obtain benchmarking results for LLMs.

The paper is organized as follows. Section 2 introduces the constructions of datasets, and models and methods used in this paper. Then experimental results and analysis are presented in section 3. After that, we show the case study in 4. Finally, conclusions and future works are discussed in section 5 and 6. 2. Materials and methods In this section, we present the construction process of LeSC benchmark (§2.1) with evaluation details. Selected models and methods (prompting techniques and retrieval-augmented-generation) are also introduced (§2.2) The overall workﬂow is illustrated in Figure 2. 2.1. The LeSC datasets In this part, we describe the creation process of the LeSC dataset (§2.1.1) and the evaluation metrics (§2.1.2). 2.1.1. Dataset creation Aiming at measuring LLMs’ ﬁne-grained language comprehension capacity in scenarios with low error tolerance, we propose a novel dataset (LeSC) focusing on the lexical level, the ﬁrst benchmark encompassing both ﬁne-grained and cross-lingual dimensions. Speciﬁcally, inspired by datasets originating from standardized testing (Hendrycks et al., 2021; Zhang et al., 2023), we ﬁrst collect polysemous words from the Chinese college entrance examination (GAOKAO) and College English Test (CET-4) online resources, and reﬁne them manually. Then, utilizing online dictionaries2, we create a polysemy dictionary linking the above English words with their various Chinese meanings. After that, previous exams, particularly reading comprehension parts, are crawled, and using ChatGPT, irrelevant parts are ﬁltered out to build a 5k sentence dictionary with keys as sentences and values as bilingual word-deﬁnition pairs. Finally, after a meticulous human examination and curation process, ambious sentences are removed, resulting in 600 high-quality samples containing a list of multiple semantics for each key word.

Given LLMs’ remarkable performance in multiple-choice

QA (MCQA) tasks (Hendrycks et al., 2021; Pal et al., 2022;

Bubeck et al., 2023) like sentiment analysis and text classiﬁca-tion (essentially equivalent to MCQA with limited answer options), we have formulated the above-curated samples in the simple multiple-choice format to obtain a fair and objective assessment of language comprehension capabilities. Here, we deﬁne the input in LLMs is the combination of a prompt P and a question x: [P, x], where [, ] denotes the concatenation operation. For P, following previous work (Zhu et al., 2023), with human design and GPT-43, we construct prompts for both task-oriented and role-oriented types, each consisting of three samples. For x, the order of options remains entirely random, as (Robinson and Wingate, 2023) highlights LLM’s sensitivity while facing a simple change in the order of options (details in Appendix A). In total, 3600 samples are obtained for LSU evaluation. Noteworthy, our evaluation encompasses not only the model’s understanding of ﬁne-grained lexical semantics but also its capacity for cross-lingual transfer. We have formulated the questions in English while the candidate answers are presented in Chinese (Figure 1 is an example). For comparison, we additionally render the original options in English to facilitate further validation and analysis. More descriptions can be

2.1.2. Evaluation metrics For a question q, we denote its standard answer as a(q), and the answer of LLM M by M(q). To evaluate the comprehension ability of M, we consider the following two metrics: Absolute Accuracy. This is the most commonly adopted average accuracy in NLP tasks. In the following experiments, unless otherwise speciﬁed explicitly, accuracy all refers to this type. Accabs(M; Q) =

|Q| (1)

where Q is the set of all test questions, and I is the indicator function, which equals 1 if the model answers correctly; otherwise, 0. Weighted Accuracy. As illustrated in §2.1.1, LLMs exhibit a marked sensitivity to the order of options. That means that for questions correctly answered by the model M, it may be solely due to ground truth labels coincidentally aligning with the option symbols (e.g., ’B’) that M tends to favor and select, thus not convincing to indicate M’s perfect understanding ability (Appendix A). Hence, we introduce a novel weighted metric to mitigate impact of model symbol bias: Accwtd(M; Q) =

�w∈W W (2) W(i) = 1 accoptions(gt(i))/sum(accoptions) (3) where W denotes the weight matrix with shape (1, |Q|), idx(q) and gt(i) are the index of question q in LeSC and its label, and accoptions consists of ﬁve elements, with each element representing the absolute performance of M while the ground truth answer for each question in Q is placed at the corresponding element location. Additionally, if a question has fewer than ﬁve options (n < 5), we maintain its original order of options when computing the value of accoptions for positions beyond n. 2.2. Models and methods In this part, we ﬁrst describe the models utilized in our research (§2.2.1). Then human evaluation baseline alongside random baseline is presented for comparative analysis (§2.2.2). Additionally, we delve into advanced strategies, such as prompting techniques (§2.2.3) and retrieval-augmented generation (§2.2.4), to oﬀer a more comprehensive analysis. Finally, attention visualization is also introduced to give some intuitive explanations (§2.2.5). 2.2.1. Selected models To ensure the eﬃcacy and generalizability of our research for both academic researchers and commercial applications, we consider models with varying architectures and sizes, and both open-source and closed-source: GPT-3.5 and GPT-4, Vicuna-v1.5 (7B, 13B, 33B) (Chiang et al., 2023), Llama2 (7B,

13B) (Touvron et al., 2023), Qwen (7B, 14B) (Bai et al., 2023),

Baichuan2 (7B, 13B) (Baichuan, 2023), and ChatGLM3-6B (Du et al., 2022). This allows for a comprehensive quan-tiﬁcation of LLMs’ language understanding capabilities across various dimensions. We give some brief description below. For more details, please refer to oﬃcial websites or the corresponding Huggingface repository4.

• ChatGPT: Developed by OpenAI, ChatGPT is a large language model designed to produce human-like text in response to given prompts. Built on the GPT-3 architecture, the GPT-3.5 series has undergone ﬁne-tuning for enhanced interactivity and conversational capabilities. Notably, GPT-4 stands out as the most proﬁcient LLM in performance. • Vicuna-v1.5: The Vicuna model, derived from ﬁne-tuning the LLaMA-2 base model by LMSYS, was developed using around 70K user-shared conversations obtained from ShareGPT.com through public APIs. • Llama2: The Llama2 model, developed by Meta AI’s FAIR team, is a widely-used autoregressive language model based on the Transformer architecture. • Qwen: Proposed by Alibaba Cloud, Qwen series are strong base language models, which have been stably pretrained for up to 3 trillion tokens of multilingual data with a wide coverage of domains, languages (with a focus on Chinese and English), etc. They can achieve competitive performance on benchmark datasets. • Baichuan2: Baichuan 2 is the new generation of large-scale open-source language models launched by Baichuan Intelligence. It is trained on a high-quality corpus with 2.6 trillion tokens and has achieved excellent performance in authoritative Chinese and English benchmarks of the same size. • ChatGLM3: ChatGLM3 is the latest open-source model in the ChatGLM series by Zhipu. While retaining many excellent features such as smooth dialogue and low deployment threshold from the previous two generations, ChatGLM3 introduces the following features: more powerful base model, more comprehensive function support. 2.2.2. Human evaluation and random baseline Humam evaluation. Following (Jang et al., 2023), we provide human performance on LSU task to explore the gap between SOTA LLMs and humans. We randomly selected 300 samples from LeSC and then conducted evaluations on 16-year-old humans drawn randomly from international high schools, representing diverse backgrounds, ensuring reliable results. This age is selected based on the widely acknowledged commonsense that sixteen is a milestone marking a signiﬁcant transition from childhood to adulthood, with the emergence of mature cognitive abilities observed across various cultures (Cowie, 2019; Gordon, 2022). Our study aims to precisely measure the understanding level of LLMs compared to our hypothesis that even 16-year-old humans excel in the simple LSU task. Therefore, if SOTA GPT-4 struggles to surpass human performance, it suggests signiﬁcant shortcomings in LLMs’ language understanding from the perspective of human intelligence. Random baseline. The number of options in LeSC ranges from 2 to 5, leading to an average of 4.39 across all samples. Here, the random selection level are deﬁned as the reciprocal of this 4

average (22.77%), which we intuitively hold the belief that existing splendid LLMs will undoubtedly outperform. 2.2.3. Prompting methods Researchers have proposed various prompting methods to inspire inherent energy for improved performance in LLMs. Ear-

lier explorations (Brown et al., 2020; Bubeck et al., 2023) have

revealed the noteworthy inﬂuence of few-shot prompting on enhancing LLM performance. Chain of thought (CoT) prompt-

ing (Wei et al., 2022; Kojima et al., 2022), a type of few-shot

prompting, achieved notable improvements across challenging benchmarks by suggesting a straightforward solution: changing the answers in few-shot examples to step-by-step answers, especially when combined with very large language models like

ChatGPT (OpenAI, 2023). Furthermore, (Kojima et al., 2022),

demonstrated that suitable hints in prompts also result in a respectable performance, even in the absence of any exemplar. However, (Shi et al., 2023) also pointed out that inappropriate prompts might meanwhile impede the model’s comprehension and decision-making processes to some extent. In this paper, we investigate the impact of these cutting-edge prompting techniques on our benchmark LeSC to explore their real eﬀects on the LSU problem. 2.2.4. Retrieval-augmented generation While LLMs have demonstrated impressive capabilities, they face issues such as hallucination and reliance on out-

dated knowledge (Huang et al., 2023; Kandpal et al., 2023).

Retrieval-augmented generation has emerged as a promising

solution these years (Gao et al., 2023; Zhao et al., 2024). By

integrating pertinent external knowledge and information, this method facilitates accurate and credible generated content and reasoning process, particularly in knowledge-intensive tasks. Earlier work use retrieval models to obtain relevant documents, and directly feed into the generation model with orig-

inal input (Chen et al., 2017; Qu et al., 2021). Recent ex-

plorations focus on enhancing the performance of retrievers and how to only utilize useful information in retrieved doc-

uments (Kandpal et al., 2023; A sai et al., 2024; Yoran et al.,

2024). For example, (Luo et al., 2023) ﬁne-tunes a language model on the Alpaca instruction-tuning data with top retrieved passages inserted before instructions. GenRead (Yu et al., 2023) generates contextual information with the implicit knowledge of LLM, and then produce the ﬁnal answers with these relevant documents. Self-RAG (Asai et al., 2024) designs four special tokens and then generates instruction-tuning data labeled by GPT-4, which is then utilized to ﬁne-tune a Llama2. In this work, given the unique attributes of our research, retrieving relevant documents from external databases like Wikipedia or conventional search engines may be challenging and ineﬃcient. Furthermore, leveraging ChatGPT’s extensive training and learning from over 40TB of internet data, we believe that ChatGPT is adept at serving as an external knowledge source. Thus, given an input x consisting of task instructions and input content, we begin by using ChatGPT to generate relevant documents d, followed by extracting and summarizing useful snippets to create the ﬁnal evidence passages p. Ultimately, for LLM M, the answer A is produced as follows: A = M(x, p) (4) 2.2.5. Visualization Technique

intuitively and eﬀectively convey information. To evaluate LLM’s recognition capabilities on ﬁne-grained semantic information within the sentence (word or phrase), we introduce an attention visualization technique here, i.e. Attention by Gradient, which assigns scores based on gradient norms for each word. As an eﬀective method utilized in open-source models, attention by gradient aims to determine the signiﬁcance of each word in the input through gradient analysis. Speciﬁcally, given an input x consisting of k words and n tokens labeled as y, we

express it as x = [t11, t21, ..., tnk], where t ji denotes the i-th word

and j-th token. Thus, tokens corresponding to the same word should be concatenated using a mapping function wi = fmap(t ji ). For the model M, we ﬁrst compute the token-level gradient as follows:

where L is the loss function (cross-entropy default in our paper), fM refers to the function of model M. Upon obtaining the above gradient, word-level gradient can be computed by summing the token-level gradients corresponding to each word as follows:

gt ji s.t. wi = fmap(t ji ) (6) Finally, we calculate the l2 norm of each word’s gradient, subsequently applying normalization to yield a score, denoted as swi for each word: swi = ||gwi||2 − min gwimax gwi − min gwi (7) It should be noted that the magnitude of attention scores re-ﬂects the model’s focus on speciﬁc words or nuanced semantics. In other words, the higher attention scores, the correspondingly greater the color depth in visualizations. 3. Experimental results and analysis In this section, we provide comprehensive experiments and anticipate answering three challenging yet meaningful questions: §3.1 whether the LSU issue exists in SOTA LLMs, §3.2 how to mitigate, and §3.3 why it persists and proves diﬃcult to resolve. We implement all experiments in this paper with an Nvidia A100 GPU. 5

Figure 3: Results for the overall performance on LeSC dataset under different settings. Tile ’Average’, ’Role-oriented Prompts’, and ’Task-oriented Prompts’ refer to the accuracy (100) of LLMs on all, role-oriented, task-oriented prompts, respectively. We also plot the performance levels of humans (92%) and random selection (23%) as a reference.

3.1. Are LLMs proﬁcient in lexical semantic understanding? We comprehensively compare model performances on LeSC with the main results illustrated in Figure 3 and 4, and detailed results are listed in Appendix B. Each reported value is the absolute accuracy on the whole dataset under diﬀerent settings (prompt types, model architectures, sizes, etc.). Our results offer three key insights as follows. 3.1.1. LLMs including GPT-4 perform poorly on LSU As shown in Figure 3, all considered cutting-edge LLMs, both open-source and closed-source, with diverse scales and architectures, consistently exhibit less-than-satisfactory performances far below expectations on the LSU task. Speciﬁ-cally, as one of the most powerful LLMs in open-source domains (Liu et al., 2023), the 30B vicuna model surprisingly achieves less than 50% accuracy on this task. At the same time, vicuna-7B exhibits a complete lack of understanding, with a performance of 16.94% that is even worse than random selections. Furthermore, despite slightly better improvements from instruction ﬁne-tuning on llama families and better generalization to new instructions in zero-shot scenarios, vicuna-family models still fall short in this fundamental comprehension task. This suggests that the emergence of this issue could not be attributed to LLMs’ diﬃculty in adhering to instructions (Jang et al., 2023), but rather stems from essential challenges in intrinsic ﬁne-grained comprehension. For larger-scale commercial models, SOTA GPT-4 exhibits an average performance gap of 3.9% compared with 16-year-old humans while GPT-3.5 leads to 22.3% performance gap, which contrasts starkly with their near-perfect performance on diverse semantic understanding tasks like reading comprehension (Bubeck et al., 2023). This thought-provoking discrepancy prompts us to reﬂect on whether LLMs truly possess ﬁne-grained accurate human-like language comprehension and analytical skills rather than simply relying on co-occurrence contrasts or mimicing language patterns, i.e. stochastic parrots.

Figure 4: Accuracy (100) for different model scales and architectures, and pretraining corpora concerning languages (CN, EN) of options.

3.1.2. Role-oriented prompts outperform task-oriented Despite the overall suboptimal performance, a comparison between the two prompt types (Figure 3) illustrates a certain excellence in role-oriented prompts over task-oriented ones. This could be attributed to the inherent properties of LLMs in emulating acquired knowledge and task types, enabling eﬀective handling of various problems. Thus, when endowed with role-speciﬁc instructions, LLMs may swiftly and accurately navigate the relevant state space, thereby fully leveraging their intrinsic comprehension powers. 3.1.3. Cross-lingual transfer capacity of LLMs encounters limitations Given the original questions and options are set in English and Chinese, respectively, this conﬁguration inadvertently assesses the language capability transfer in LLMs. For in-depth analysis, we employ GPT-4 to convert all options into English, subsequently creating a new dataset named LeSC-EN. As de-6

picted in Figure 4, partial test results on LeSC-EN are presented, leading us to the following ﬁndings. Note that we have transformed the metric value of Vicuna-7B-CN (+10%) for better presentation, which, nevertheless, doesn’t inﬂuence our conclusions. Firstly, for models of the same types (Vicuna-CN), we observe a progressive augmentation tendency in the transfer effect of cross-lingual comprehension abilities as the scale ranges from 7B to 33B. However, this trend remains limited to some extent and exhibits a poor correlation with the higher inference costs that come with model scaling. Secondly, disparities in pre-training corpora may account for substantial divergences in cross-lingual transfer among diﬀerent models. Taking the Vicuna and Baichuan models as an example, Baichuan2-13B, beneﬁting from a larger share of Chinese pre-training corpora, surpasses Vicuna-13B in the original LeSC task. Nevertheless, upon switching to English (LeSC-EN) for testing, the Vicuna models, endowed with a greater proportion of English corpora, manifest a pronounced performance improvement (about 10%), whereas the Baichuan model undergoes a relative decline in performance. Our results reveal notable constraints in the cross-lingual transfer proﬁciency of current LLMs. Consequently, we advocate for an in-depth assessment of models’ capabilities and attributes across diverse transfer scenarios, including domains, cultures, and linguistic styles. We consider this imperative for better advancements in LLM research and leave further exploration for future work. 3.2. Can prompting techniques and retrieval-augmented-generation help mitigate? To investigate the eﬃcacy of advanced strategies in mitigating the issues mentioned in §3.1, we adopt a wide range of prompting approaches described in §2.2.3, including zero-shot prompting (Zero-shot), few-shot in-context learning (ICL) (Brown et al., 2020), chain-of-thought prompting (CoT) (Wei et al., 2022), zero-shot chain-of-thought (0-CoT) (Kojima et al., 2022) here. Notably, for CoT, to maintain acceptable time cost and prevent over-ﬁtting in prompt engineering, we align with (Shi et al., 2023) on exemple creation; that means, we only utilize one simple exemple. For 0-CoT, we follow (Kojima et al., 2022) and directly present the candidate problem followed by “Answer: Let’s think step by step:”. Additionally, we incorporate manually designed prompts to explicitly guide LLMs towards adherence to human intuitions, commonly termed instructed prompting. For retrieval-augmented-generation (RAG), the overall process is illustrated in §2.2.4. Three notable ﬁndings can be discerned as follows. 3.2.1. ICL enhances performance within a certain range As illustrated in Figure 5 on Llama2-13B, several key conclusions could be derived. Firstly, in terms of both stability and two speciﬁed metrics, incorporating appropriate examples (few-shot) consistently surpasses zero-shot scenarios. While prior work has shown that prompting with exemplars improves model robustness (Shi et al., 2023), our work indicates that

Figure 5: Results on LeSC corresponding to k (shot) in few-shot prompting. ’Acca ’, ’Accwtd’, ’Std’ refer to absolute and weighted accuracy, and standard deviation, respectively.

in-context learning also results in notable comprehension improvement. This can be attributed to its eﬃcacy in enhancing the model’s ability to capture contextual information adeptly, thereby facilitating a more comprehensive understanding of the input context and user intent. Secondly, with the increase in the number of examples (up to 4), the model performance exhibits a gradual upward trend, which can be explained by the extensive pre-training of LLMs on massive data, enabling the internalization of diverse language structures and knowledge. The introduction of additional examples serves to better activate LLMs’ inherent prior knowledge, coupled with contextual information, fostering a deeper understanding of the context and thereby effectively accomplishing NLU tasks. However, an excess of examples may introduce noise, potentially impairing the model’s generalizability and decision-making processes in novel scenarios. More results are shown in Appendix B. 3.2.2. RAG enhances model semantic understanding, especially for small models As shown in Table 1, introducing external knowledge contributes to enhancing LLMs’ comprehension capacities for nuanced semantics, especially for small models. For example, for vicuna-7B, the performance increased by 194%, indicating that RAG can enhance semantic understanding. However, its eﬀec-tiveness is still limited, and the impact of model sizes has been removed to some extent. In other words, the ﬁnal model performance largely depends on the eﬀectiveness of retrieved content. 3.2.3. Scaling challenges: diminishing beneﬁts of advanced prompting in large language models Table 1 illustrates that advanced prompting techniques, such as CoT and instructed prompting, eﬀectively enhance overall performance on LeSC. However, on very large LLMs, these strategies might be ineﬀective or even counterproductive. Our observations reveal a consistent inverse scaling law for methods like CoT and few-shot in-context learning, diminishing 7

Table 1: Absolute and weighted accuracies (100) with prompting techniques on the LeSC dataset. ‘Modified’, ‘Ins-hint’, ‘Ins-false’, and ‘Ins-false-hint’ are all instructed prompting, which denotes more easy problems after modification, instruction with correct understanding hints, false information, false information plus correct hints like ‘Feel free to ignore false information in the inputs’, respectively. Results with bold and underlining labels are better than others. Here we consider the vicuna-family models with three sizes, 7B, 13B, 33B.

Figure 6: An example from LeSC dataset that showcases the CoT result of Vicuna-33B. Option ’A’, ’B’, ’C’ refer to ’the part on which it rests or is sup- ported’, ’contemptibly low in position’, ’to use as a foundation’, respectively.

their eﬃcacy in improving LLMs’ performance. Notably, on the Vicuna-33B model, CoT has unexpectedly resulted in detrimental eﬀects. We conjecture that as the model scale increases, the biases in the built-in pre-training corpus escalate, leading to overconﬁdence in its existing prior knowledge when processing tasks. Simultaneously, larger models, calibrated with human intent (Ouyang et al., 2022), may exhibit more heightened adherence to human instructions. In such cases, additional cues like CoT may conﬂict with LLM’s intrinsic knowledge, causing uncertainty, potential hallucinations, and ultimately impacting decision-making accuracy. Taking Figure 6 as an example, the question is to comprehend and infer the meaning of the word ’base’ within the sentence ”Their base conduct was condemned by everyone.” Employing the Cot method for stepwise analysis, the model is initially required to perform part-of-speech (POS) tagging to eliminate inappropriate options. While LLMs typically excel

in POS tagging tasks (Chang et al., 2023; Bubeck et al., 2023),

the Vicuna-33B model, as outlined above, anomalously identi-ﬁes ”base” as a noun, deviating from the anticipated outcome. 3.2.4. LLMs signiﬁcantly focus more on misleading information over corrective instructions To further investigate the underlying reasons for the above issues, we employ instructed prompting methods here (Shi et al., 2023). By providing both correct and incorrect information, we aim to discern whether LLMs are just stochastic parrots and genuinely lack nuanced semantic understanding or conform to previous research conclusions (Jang et al., 2023) that ‘LLMs merely struggle with following instructions, not its comprehension ability’. Speciﬁcally, three scenarios are conﬁgured: ‘instruction with correct hint’, ‘instruction with false information’, and ‘instruction with false information and also ‘correct ignoring hint’. The outcomes reveal a noteworthy impact of these instructions on LLMs, reaching up to 96% eﬀectiveness. In other words, within the deﬁned scope here, models demonstrate proﬁciency in following instructions and subsequently executing tasks. Thus, we attribute the subpar performances in §3.1 primarily to LLMs’ inadequate semantic comprehension. Additionally, we observe an intriguing ﬁnding that, in lexical comprehension tasks, LLMs signiﬁcantly prioritize misleading information over valuable instructions. More speciﬁcally, with the ‘ins-hint’ setup, the beneﬁts of correct instructions are relatively limited. Nevertheless, while introducing erroneous information (‘ins-false’), LLMs experience a pronounced decline in performance, exempliﬁed by Vicuna-33B’s 96% drop rate. Even provided with explicit instructions to ignore incorrect information, their dependency on incorrect instructions remains challenging to mitigate. This ﬁnding highlights the need for future research to focus more on the impact of inevitable false information and explore practical alleviating strategies. Table 2 is an example using attention visualization illustrated in §2.2.5, where LLMs directly disregard the correct instruction ‘The following note has some false information, and just feel free to ignore them’. 8

Table 2: Response analysis based on attention visualization. The green and red colors in ’pred’ denote right and wrong answers, respectively. Color intensity denotes attention weights (heavier color means larger weights). False information (’Ins-false’) results in incorrect responses. Even with correct ignoring hints (’Ins-false-hint’), LLMs directly disregard them, remaining influenced by misleading information.”

Table 3: Comparison between zero-shot performance and translation process, involving initial translation acquisition for a candidate problem, followed by GPT-4 evaluation for translation accuracy.

3.3. Conjectures on why this problem exists In this section, we perform an in-depth analysis to unravel the intricacies of this issue, addressing the question of why this problem is hard to solve for LLMs. 3.3.1. Inherent limitation of stochastic parrots We further supplement an experiment to delve into its underlying mechanisms, since word, phrase, or sentence-level semantic understanding tasks are inherently akin to translation tasks, which LLMs excel at (Chang et al., 2023). Particularly, we guide LLMs to translate the provided questions, subsequently employing GPT-4 to assess the results, examining their eﬀectiveness in conveying nuanced semantic information. That means, for a candidate question, if GPT-4 determines that a translation accurately captures uncommon meanings of common words, we consider it a successful sample. As shown in Table 3, while introducing translation steps, we observe a consistent improving trend on models of varying scales (7B, 13B, 33B) and types (instructed-tuning Vicuna-v1.5 and original Llama-2). Given this phenomenon, we argue that, in direct question-answering scenarios, the model’s comprehension and cross-lingual transfer capabilities are often constrained by substantial pre-training corpus biases and unfamiliarity with question formats. However, introducing translation steps fundamentally exposes the model to a more familiar and proﬁcient task paradigm, thereby capitalizing on its advantages in co-occurrence contrast and imitation. This produces eﬃcient extraction of relevant information from pre-training knowledge and semantic contextual details from provided questions, leading to more precise answers. Nevertheless, it must be noted

Figure 7: Comparison between correctly answered and incorrectly answered questions on uncertainty. The results show that LLMs tend to give false re- sponses while exhibiting low uncertainty and heightened confidence.

that the translation gains appear to diminish with the increasing model scales because of the current constraints on the LLM’s sophisticated comprehension. This phenomenon also emphasizes the limited eﬀectiveness of current approaches, highlighting the urgency for further research into more potent solutions. 3.3.2. Overconﬁdence to some degree

cumulate substantial knowledge within their parameters, and this knowledge capacity scales with model size. Here, we conjecture that LLMs, having undergone pre-training on vast data with billions of parameters, internalize extensive knowledge, thereby fostering heightened self-conﬁdence and occasionally manifesting biased tendencies. In such a hypothesis, confronted with relatively novel tasks that necessitate ﬁne-grained comprehension, even applying advanced techniques such as CoT to structure the problem into familiar subtasks like part-of-speech tagging, these models encounter challenges. Instead, they persistently furnish responses based on their perceived correctness. We believe it will be especially more diﬃcult for large LLMs, 9

Table 4: Output example for different models on a LeSC problem and corresponding zero-shot and CoT solutions, which may potentially contain false or hallucination contents. The left answers correspond to zero-shot, while the solutions and answers on the right correspond to CoT. And the black (e.g. C �) and red colors (e.g. A ) denote right and wrong answers, respectively. Note that options ’A: 空气’, ’B: 外观’, ’C: 公开发表’, ’D: 晒干 ’ refer to ’A: the invisible gaseous substance surrounding the earth, a mixture mainly of oxygen and nitrogen’, ’B: the appearance or impression given by a person or thing’, ’C: to make (a grievance or problem) public’, ’D: to dry (something) by exposing it to air’.

Model Question: Prompt: In the given sentence, which option best represents the meaning of ’air’ in Chinese? Input: The weekly meeting enables employees to air their complaints.

ChatGLM3-6B A � <Step 1>: In the given sentence, the word ’air’ is likely used as a verb meaning to express or share something. So, options ’A’, ’B’, ’C’, and ’D’ do not ﬁt this context. <Step 2>: Option ’E’ can be eliminated because ’air’ is not used as a noun or a verb related to the meeting in this sentence. Therefore, the best option for the Chinese meaning of ’air’ in this context is ’A’. <Step 3>: The

Llama2-7B A � <Step 1>: In the given sentence, the word ’air’ is a verb, so option ’A’ (noun) is non-conforming and must be removed, leaving only ’B’, ’C’, and ’D’. <Step 2>: Then for remaining options [’B’, ’C’, ’D’], option B refers to the appearance or look of something, which does not align with the context of airing complaints. Option C means to publicly express or reveal something, which is close but not the best ﬁt for the context of airing complaints. Option D means to dry or expose something to the air, which is also not the best ﬁt for the context of airing complaints. Therefore, the best option for the Chinese meaning of ’air’ in the sentence is ’B’. <Step 3>: The ﬁnal answer

Llama2-13B A � <Step 1>: In the given sentence, the word ’air’ is a noun, so we need to ﬁnd the option that best represents the meaning of ’air’ as a noun. <Step 2>: Options A and D both relate to the physical sensation of air, but option D speciﬁcally refers to the sensation of hot air, which is not the meaning of ’air’ in the sentence. Option C is a noun, but it does not ﬁt the context of the sentence. <Step 3>: Therefore, the best option for the Chinese meaning of ’air’ based on how it is used in this sentence is option B, which refers to the appearance or look of something. <Step 4>: The ﬁnal answer is B

Vicuna-13B A � <Step 1>: In the given sentence, the word ’air’ is a verb, so option ’A’ (noun) is non-conforming and must be removed, leaving only ’B’, ’C’, and ’D’. <Step 2>: Then for remaining options [’B’, ’C’, ’D’], option B refers to appearance or exterior, which doesn’t ﬁt the context of the sentence. Option C means ’to make something public or known’, which is not the meaning of ’air’ in this sentence. Option D means ’to expose to the air’, which is not the meaning of ’air’ in this sentence. Therefore, none of the options B, C, or D are suitable for the Chinese meaning of ’air’ based on how

it is used in this sentence. <Step 3>: Since none of the options are suitable, the answer is ’None of the above’

Baichuan-13B D � <Step 1>: In the provided phrase, ”air” means express oneself freely without reservation or fear about what others might say as if you are speaking directly into an open space like freshly cleaned room where dust particles have been cleared out. Option D (”晒”) also has similar connotation but usually implies sharing personal experiences/stories publicly. Option C doesn’t ﬁt well because it suggests publishing something formally while we need more freedom when using “air” here. <Step 2>: Among these choices, there seems no direct match between English words and corresponding terms in Mandarin except for ”air”. So let’s check whether any other term can replace them ac-

curately. <Step 3>: Finally, considered above factors, I believe that the correct response should include ”air”

Qwen-14B C � <Step 1>: In the given sentence, the word ’air’ is a verb, so option ’A’ (noun) is non-conforming and must be removed, leaving only ’C’ and ’D’. <Step 2>: Then for remaining options [’C’, ’D’], option C translates to ’publicly express’, which aligns with the context of airing complaints. And option D refers to drying, not ﬁtting the meaning of airing complaints in this sentence. Therefore, C is the best choice for the Chinese meaning of ’air’ based on how it’s used in sentence. <Step 3>: The ﬁnal answer is C �

since they are powerful language modeling representers, meaning that it would be harder to make them revert original thoughts by focusing on real-scenario problems. Our conjecture can be partially proven by Figure 7. We randomly collect 1200 samples each of correctly answered and incorrectly answered questions. Subsequently, we extract their outputs from the model’s ﬁnal layer and calculated the uncertainty score as the negative likelihood value. Our results reveal that LLMs frequently make erroneous predictions when exhibiting low uncertainty and heightened conﬁdence. Conversely, LLMs provide correct responses when they demonstrate a more cautious tendency. 3.3.3. Coarse-Grained Competence: Impaired DetailOriented Task Performance LLMs exhibit remarkable proﬁciency in tasks mainly involving macro-level comprehension, such as translation and summarization. However, given their inclination towards imitating and co-reference resolution characteristics, in the process of massive data-driven pre-training, these models may merely map the acquired knowledge to a relatively coarse and sparse latent space, overlooking the profound semantics of fundamental discrete units like words. In other words, although LLMs are regarded as understanding the world well, they exhibit language patterns and logic mimicking characteristics on these ﬁne-grained tasks, which we refer to as ‘stochastic parrots’. Hence, we propose that future research should delve into analyzing LLMs’ understanding of basic logical symbols and other ﬁne-grained elements, aiming to facilitate the construction of more nuanced and enriched world-mapping supermodels. 4. Case Study In addition to the example shown in Figure 1, we include more example problems and predictions by diﬀerent models and prompting methods (Tables 4). We can observe that even with the implementation of advanced prompting techniques such as CoT, the generation of accurate responses remains challenging due to issues such as hallucination and the accumulation of errors. These challenges may stem from an insuﬃcient capacity for ﬁne-grained semantic comprehension. 5. Conclusions Increasing studies highlight LLMs’ exceptional performance across complex NLU tasks. To enhance future progress in LLMs, researchers are increasingly exploring their fundamental nature: are they merely stochastic parrots based on probabilistic statistics, or do they truly possess human-like nuanced semantic understanding? This question remains a mystery, lacking ﬁne-grained scientiﬁc assessments. To bridge this gap, this paper proposes a brand-new benchmark, LeSC, aiming at exploring the genuine linguistic-cognitive skills of LLMs. Comprehensive empirical results indicate that existing LLMs face signiﬁ-cant challenges in accurately capturing nuanced lexical semantic information, even GPT-4 and GPT-3.5 exhibit noticeable disparities from 16-year-old human performance, with gaps of 3.9% and 22.3%, respectively. Furthermore, we examine various advanced prompting techniques and retrieval-augmented-generation, and reveal their limited alleviating eﬀects. Finally, some explanations and case studies are also discussed. The ﬁndings of our work can aid in the development of more intelligent LLMs, and also extend to other AI areas like realistic image generation. 6. Future directions In addition to oﬀering novel insights into LLMs’ comprehension capabilities, we also emphasize three promising future directions. Firstly, given our ﬁndings, we encourage researchers to prioritize addressing this fundamental limitation while devoting eﬀorts towards advancing LLM development. Secondly, incorporating additional models and prompting techniques is recommended to enhance the comprehensiveness of evaluation results, thereby contributing to a more profound understanding of LLMs. Last but not least, beyond the noted deﬁciency in language cross-lingual transfer, it’s worthwhile to further investigate LLMs’ transferability in domains such as cross-cultural and cross-linguistic styles, which may pave the way for powerful artiﬁcial general intelligence. 7. Data availability The datasets analyzed during the current study are available at https://github.com/jinyangwu/LeSC. 8. Acknowledgements This research was supported by the National Key Research & Development Plan of China (No. 2023YFC3305903), and the National Natural Science Foundation of China (NSFC) (No. 62322120, No. 62306316). References

Anil, G.T.G.R., Borgeaud, S., Wu, Y., et al., 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 .

Asai, A., Wu, Z., Wang, Y., Sil, A., Hajishirzi, H., 2024. Self-RAG: Learn- ing to retrieve, generate, and critique through self-reflection, in: The Twelfth International Conference on Learning Representations. URL: https://openreview.net/forum?id=hSyW5go0v8.

Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., Hui, B., Ji, L., Li, M., Lin, J., Lin, R., Liu, D., Liu, G., Lu, C., Lu, K., Ma, J., Men, R., Ren, X., Ren, X., Tan, C., Wang, S., Wang, W., Wu, S., Xu, B., Xu, J., Yang, A., Yang, H., Yu, B., Yuan, H., Yuan, Z., Zhang, J., Zhang, X., Zhang, Y., Zhang, Z., Zhou, C., Zhou, J., Zhou, X., Zhu, T., 2023. Qwen technical report. arXiv preprint arXiv:2309.16609 .

Baichuan, 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305 URL: https://arxiv.org/abs/2309.10305.

Bender, E.M., Gebru, T., McMillan-Major, A., Shmitchell, S., 2021. On the dangers of stochastic parrots: Can language models be too big?, in: Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp. 610–623.

Borji, A., 2023. Stochastic parrots or intelligent systems? a perspective on true depth of understanding in llms. A Perspective on True Depth of Understanding in LLMs (July 11, 2023) .

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al., 2020. Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901.

Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y.T., Li, Y., Lundberg, S., et al., 2023. Sparks of arti-ficial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 .

Cao, N., Cui, W., Cao, N., Cui, W., 2016. Overview of text visualization tech- niques. Introduction to Text Visualization , 11–40.

Chang, Y., Wang, X., Wang, J., Wu, Y., Zhu, K., Chen, H., Yang, L., Yi, X., Wang, C., Wang, Y., et al., 2023. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109 .

Chen, D., Fisch, A., Weston, J., Bordes, A., 2017. Reading Wikipedia to answer open-domain questions, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Vancouver, Canada. pp. 1870–1879. URL: https://aclanthology.org/P17-1171, doi:10.18653/v1/P17-1171.

Chen, W., Su, Y., Zuo, J., Yang, C., Yuan, C., Chan, C.M., Yu, H., Lu, Y., Hung, Y.H., Qian, C., Qin, Y., Cong, X., Xie, R., Liu, Z., Sun, M., Zhou, J., 2024. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors, in: The Twelfth International Conference on Learning Representations. URL: https://openreview.net/forum?id=EHg5GDnyq1.

Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., et al., 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) .

Choi, M., Pei, J., Kumar, S., Shu, C., Jurgens, D., 2023. Do LLMs understand social knowledge? evaluating the sociability of large language models with SocKET benchmark, in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore. pp. 11370– 11403. URL: https://aclanthology.org/2023.emnlp-main.699, doi:10.18653/v1/2023.emnlp-main.699.

Clark, H.H., Gerrig, R.J., 1983. Understanding old words with new meanings. Journal of verbal learning and verbal behavior 22, 591–608.

Cowie, H., 2019. From birth to sixteen: Children’s health, social, emotional and linguistic development. Routledge.

De Deyne, S., Warner, S., Perfors, A., 2023. Common words, uncommon meanings: Evidence for widespread gender differences in word meaning., in: Proceedings of the Annual Meeting of the Cognitive Science Society.

Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z., Tang, J., 2022. Glm: General language model pretraining with autoregressive blank infilling, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 320–335.

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, H., 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 .

Geeraerts, D., 2002. The theoretical and descriptive development of lexical semantics. The lexicon in focus. Competition and convergence in current lexicology , 23–42.

Gordon, S., 2022. 16-year-old child development milestones: Your child’s growth and development at age 16. Very well Family. Medically reviewed by a board-certified physician .

H¨aggstr¨om, O., 2023. Are large language models intelligent? are humans?, in: Computer Sciences & Mathematics Forum, MDPI. p. 68.

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Stein- hardt, J., 2021. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR) .

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., et al., 2023. A survey on hallucination in large lan- guage models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232 .

Jang, J., Ye, S., Seo, M., 2023. Can large language models truly understand prompts? a case study with negated prompts, in: Transfer Learning for Natural Language Processing Workshop, PMLR. pp. 52–62.

Kandpal, N., Deng, H., Roberts, A., Wallace, E., Raffel, C., 2023. Large lan- guage models struggle to learn long-tail knowledge, in: International Con-

ference on Machine Learning, PMLR. pp. 15696–15707. Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y., 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems 35, 22199–22213. Li, Z., 2023. The dark side of chatgpt: legal and ethical challenges from stochastic parrots and hallucination. arXiv preprint arXiv:2304.14347 . Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., Gu, Y., Ding, H., Men, K., Yang, K., et al., 2023. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688 . Lonergan, B.J., 1957. Insight; a study of human understanding. . Luo, H., Zhang, T., Chuang, Y.S., Gong, Y., Kim, Y., Wu, X., Meng, H., Glass, J., 2023. Search augmented instruction learning, in: Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore. pp. 3717–3729. URL: https://aclanthology.org/2023.findings-emnlp.242, doi:10.18653/v1/2023.findings-emnlp.242. Miller, G.A., 1967. The psychology of communication. Human Resource Man- agement 6, 43. OpenAI, 2023. Introducing chatgpt. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al., 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35, 27730–27744. Pal, A., Umapathi, L.K., Sankarasubbu, M., 2022. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering, in: Conference on Health, Inference, and Learning, PMLR. pp. 248–260. Peng, D., Wei, W., Mao, X.L., Fu, Y., Chen, D., 2023. An empirical study on the language modal in visual question answering, in: Proceedings of the ThirtySecond International Joint Conference on Artificial Intelligence, IJCAI-23, International Joint Conferences on Artificial Intelligence Organization. pp. 4109–4117. URL: https://doi.org/10.24963/ijcai.2023/457, doi:10.24963/ijcai.2023/457. main Track. Qu, Y., Ding, Y., Liu, J., Liu, K., Ren, R., Zhao, W.X., Dong, D., Wu, H., Wang, H., 2021. RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online. pp. 5835– 5847. URL: https://aclanthology.org/2021.naacl-main.466, doi:10.18653/v1/2021.naacl-main.466. Rai, D., Zhou, Y., Wang, B., Yao, Z., 2023. Explaining large language model-based neural semantic parsers (student abstract), in: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI Press. URL: https://doi.org/10.1609/aaai.v37i13.27014, doi:10.1609/aaai.v37i13.27014. Riccardi, N., Desai, R.H., 2023. The two word test: A semantic benchmark for large language models. arXiv preprint arXiv:2306.04610 . Robinson, J., Wingate, D., 2023. Leveraging large language models for multiple choice question answering, in: The Eleventh International Conference on Learning Representations. URL: https://openreview.net/forum?id=yKbprarjc5B. Shi, F., Chen, X., Misra, K., Scales, N., Dohan, D., Chi, E.H., Sch¨arli, N., Zhou, D., 2023. Large language models can be easily distracted by irrelevant context, in: International Conference on Machine Learning, PMLR. pp. 31210–31227. Si, C., Gan, Z., Yang, Z., Wang, S., Wang, J., Boyd-Graber, J., Wang, L., 2023. Prompting gpt-3 to be reliable, in: International Conference on Learning Representations (ICLR). URL: https://arxiv.org/abs/2210.09150. Sternberg, R.J., 1983. Components of human intelligence. Cognition 15, 1–48. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al., 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 . Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S., 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding, in: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Association for Computational Linguistics, Brussels, Belgium. pp. 353–355. URL: https://aclanthology.org/W18-5446, doi:10.18653/v1/W18-5446. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V.,

Zhou, D., et al., 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35, 24824–24837.

Xu, R., Lin, H., Han, X., Sun, L., Sun, Y., 2024. Academically intelligent llms are not necessarily socially intelligent. arXiv preprint arXiv:2403.06591 .

Ying, J., Cao, Y., Xiong, K., He, Y., Cui, L., Liu, Y., 2023. Intuitive or depen- dent? investigating llms’ robustness to conflicting prompts. arXiv preprint arXiv:2309.17415 .

Yoran, O., Wolfson, T., Ram, O., Berant, J., 2024. Making retrieval-augmented language models robust to irrelevant context, in: The Twelfth International Conference on Learning Representations. URL: https://openreview.net/forum?id=ZS4m74kZpH.

Yu, W., Iter, D., Wang, S., Xu, Y., Ju, M., Sanyal, S., Zhu, C., Zeng, M., Jiang, M., 2023. Generate rather than retrieve: Large language models are strong context generators, in: The Eleventh International Conference on Learning Representations. URL: https://openreview.net/forum?id=fB0hRu9GZUS.

Zhang, X., Li, C., Zong, Y., Ying, Z., He, L., Qiu, X., 2023. Evaluating the performance of large language models on gaokao benchmark. arXiv preprint arXiv:2305.12474 .

Zhao, P., Zhang, H., Yu, Q., Wang, Z., Geng, Y., Fu, F., Yang, L., Zhang, W., Cui, B., 2024. Retrieval-augmented generation for ai-generated content: A survey. arXiv preprint arXiv:2402.19473 .

Zheng, Y., Zhou, J., Qian, Y., Ding, M., Liao, C., Jian, L., Salakhut- dinov, R., Tang, J., Ruder, S., Yang, Z., 2022. FewNLU: Benchmarking state-of-the-art methods for few-shot natural language understanding, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland. pp. 501–516. URL: https://aclanthology.org/2022.acl-long.38, doi:10.18653/v1/2022.acl-long.38.

Zhu, K., Wang, J., Zhou, J., Wang, Z., Chen, H., Wang, Y., Yang, L., Ye, W., Gong, N.Z., Zhang, Y., et al., 2023. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528 .

Designed for Accessibility and to further Open Science