b

DiscoverModelsSearch
About
Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context
4 weeks ago
·
NeurIPS
Abstract

When making decisions under uncertainty, individuals often deviate from rational behavior, which can be evaluated across three dimensions: risk preference, probability weighting, and loss aversion. Given the widespread use of large language models (LLMs) in supporting decision-making processes, it is crucial to assess whether their behavior aligns with human norms and ethical expectations or exhibits potential biases. Although several empirical studies have investigated the rationality and social behavior performance of LLMs, their internal decision-making tendencies and capabilities remain inadequately understood. This paper proposes a framework, grounded in behavioral economics theories, to evaluate the decision-making behaviors of LLMs. With a multiple-choice-list experiment, we initially estimate the degree of risk preference, probability weighting, and loss aversion in a context-free setting for three commercial LLMs: ChatGPT-4.0-Turbo, Claude-3-Opus, and Gemini-1.0-pro. Our results reveal that LLMs generally exhibit patterns similar to humans, such as risk aversion and loss aversion, with a tendency to overweight small probabilities, but there are significant variations in the degree to which these behaviors are expressed across different LLMs. Further, we explore their behavior when embedded with socio-demographic features of human beings, uncovering significant disparities across various demographic characteristics. For instance, when modeled with attributes of sexual minority groups or physical disabilities, Claude-3-Opus displays increased risk aversion, leading to more conservative choices. These findings underscore the need for careful consideration of the ethical implications and potential biases in deploying LLMs in decision-making scenarios. Therefore, this study advocates for the development of standards and guidelines to ensure that LLMs operate within ethical boundaries while enhancing their utility in complex decision-making environments.

In recent years, the deployment of large language models (LLMs) such as ChatGPT-4.0-Turbo [38], Claude-3-Opus [12], and Gemini-1.0-pro [33] has revolutionized various fields by providing sophisticated, human-like responses to a multitude of queries [6, 40, 25, 41, 11, 26]. Their applications span from answering everyday questions and content generation to complex decision-support systems in healthcare, finance, and beyond [31, 8, 37]. Domain-specific LLMs have also emerged, serving as customer service agents, investment assistants, and more [29, 23, 17]. Understanding their internal decision-making tendencies becomes crucial as these models become increasingly integral to decision-making processes. How do LLMs handle risk and uncertainty in comparison to human decision-makers? To what extent do LLMs exhibit biases when socio-demographic features are introduced into their decision-making processes [21]? Can we trust LLMs to make fair and ethical decisions across diverse contexts and populations? All of these questions deserve careful investigation and confident answers.

Human decision-making under uncertainty is extensively studied in behavioral economics theories, highlighting systematic deviations from rational behavior [32]. These deviations can be shaped by three parameters: i. Risk preference: Subjects often exhibit varying degrees of risk aversion or risk-seeking behavior. ii. Probability weighting: Subjects may overweight small probabilities or underweight large ones. and iii. Loss aversion: Losses generally have a greater psychological impact than equivalent gains. Biases in these areas can lead to suboptimal decision-making. For instance, risk aversion might lead to conservative investment choices, missing out on high-reward opportunities. Probability weighting can result in excessive insurance for unlikely events. Loss aversion may lead to resistance to change, as the pain of potential loss outweighs the pleasure of potential gain.

Given the prominence of LLMs in aiding decision-making, it is imperative to assess whether their behaviors align with or diverge from human tendencies. Previous empirical studies have examined the rationality and social behaviors of LLMs, yet their intrinsic decision-making processes remain insufficiently understood [13, 15], especially when the LLMs are given contextual prompts about the identities or characteristics of the user. In this study, we experiment with three feature assignments: a baseline with no demographic context (context-free), human-like demographic distributions, and augmented distributions with minority group characteristics. We assess how LLMs’ decision-making process aligns with or diverges from human-like behavior and uncovers potential biases and ethical concerns, emphasizing the consideration for fairness in deploying these models across diverse user groups. To summarize, our contributions are threefold:

1. We develop a comprehensive framework to evaluate LLMs’ decision-making behavior pattern, grounded in behavioral economic theories, particularly the value function model proposed by Tanaka, Camerer, and Nguyen [32] (TCN model). Using three sets of experiments, we evaluate the risk preferences, probability weighting, and loss aversion of LLMs. This framework represents the first application of behavioral economics to LLMs without any preset behavioral tendencies, providing a robust foundation for evaluating LLM decision-making behaviors.

2. We apply this framework to three state-of-the-art commercial LLM models: ChatGPT-4.0-Turbo, Claude-3-Opus, and Gemini-1.0-pro, to assess their decision-making behavior in a context-free setting. Our results indicate that LLMs generally exhibit human-like patterns: risk aversion, loss aversion, and overweighting small probabilities. Nevertheless, there are significant variations in the degree to which these behaviors are expressed across different LLMs.

3. We conduct further experiments embedding socio-demographic features to evaluate how these characteristics influence LLM decision-making compared to humans. Our findings reveal a range of distinctive behavior patterns, such as increased risk aversion in certain contexts and varying levels of risk aversion across different models. These results underscore the necessity for a detailed examination of the ethical implications and potential biases in LLM deployment, advocating for the development of standards and guidelines to ensure fair and ethical decision-making.

2.1 LLMs’ Behavior Study

Numerous studies within social science have evaluated the alignment between LLM and human behaviors and decision-making processes. For instance, research has shown that LLMs demonstrate economic behaviors, including adherence to downward-sloping demand curves, diminishing marginal utility, status quo bias, and the endowment effect [5, 19, 7]. LLMs also show consistency with human-like behavior patterns in the psychology binary moral judgment experiment [9].

Previous works have also assessed the rationality exhibited by LLMs. LLMs have demonstrated a higher rationality score compared to humans in financial decision-making[7]. Initial investigations using repeated game experiments and more complex decomposed game theory experiments have delineated how LLMs act as rational players within a game-theoretical framework [13, 1, 14]. Additionally, [28, 39] have developed frameworks to guide LLMs in making optimal decisions based on Expected Utility Theory (EUT). However, these frameworks pre-assume that the intrinsic feature of LLMs aligns with human decision-making processes. Whether LLMs actually ensure alignment with human behavior and how LLMs optimize their utility by integrating human demographic features and psychological considerations into the decision-making process are left unanswered.

Another critical issue is the fairness and bias in LLM processing, particularly concerning specific demographic or personality traits. Work[15] has shown that ChatGPT-3.5 and ChatGPT-4 exhibit significant biases and a decline in performance when handling information pertaining to minority groups in humans. This underscores the need to develop a framework that overcomes existing limitations and quantitatively evaluates LLM behavior under uncertainty, exploring the effect and bias related to human demographic features.

2.2 Expected Utility Theory and Prospect Theory

Conventionally, the Expected Utility Theory (EUT) characterizes risk aversion as the sole determinant that shapes the curvature of the utility function that defines an individual’s risk preferences. Prospect Theory (PT) [20] describes how real decision-making processes deviate from the rational models. It emphasizes the significance of psychological factors from three main behavioral parameters: risk aversion level, loss aversion level, and probability weighting tendency. Many studies have estimated the parameters that capture these three aspects with human samples, as summarized in Table 1.

image

Table 1: Decision-making behavior study on human beings

Although researchers have established frameworks to assess human behaviors, appropriate frameworks for LLMs are still missing. EUT alone is insufficient as previous studies have revealed distinct patterns of loss aversion in advanced LLMs, such as GPT-3.5 and GPT-4, that deviate from EUT’s rational agent model [22, 30]. Additionally, the status quo bias observed in LLMs, where decision-makers irrationally prefer existing conditions over objectively better alternatives [19], further challenges the applicability of EUT. PT cannot adequately model LLM either, as its assertion about utility function curvature is based on empirical evidence from generic human behaviors. To assess LLM with such pre-assumed evidence without testing LLM’s actual tendency baselines can only result in inaccuracy. Several studies have used the pure PT value function to analyze the level of risk aversion in LLMs [24, 30], but relying on the PT value function without preliminary verification might lead to circular reasoning, where assumptions are used to test assumptions.

Given the limitations of both EUT and PT, we adapt and improve the TCN model [32], which combines the essence of EUT and PT, for a comprehensive evaluation of decision-making behaviors. This model allows us to estimate three key parameters: risk preference, loss aversion, and non-linear probability weighting, providing a balanced framework to understand LLM behaviors better.

In behavior studies, understanding individual preferences and decision-making processes under uncertainty is often approached through the lens of revealed preference. The decision-making process is mathematically represented as follows:

Let  O = {(xi, pi)}Ni=1represent a dataset of N observations, where each observation consists of a chosen bundle  xiand a price vector  pi. A utility function  U : RK → R, where utility represents the satisfaction or value derived from choices and K represents the number of different goods or dimensions in the bundle  xi, rationalizes the dataset if there exists an i that satisfies:

image

indicating that the chosen bundle  ximaximizes the consumer’s utility subject to the budget constraint defined by  piand any alternative bundle x.

The axiom of revealed preference provides a crucial foundation for testing the consistency of observed choices with utility maximization:

image

The Generalized Axiom of Revealed Preference (GARP) extends this idea by requiring that for all pairs of choices  (xi, xj), if xiis revealed preferred to  xj, then xjshould not be revealed preferred to xi, ensuring there are no cycles in preference relations:

image

Following the theoretical framework provided by the GARP, our study integrates these principles with practical applications in behavioral economics. To explore deeper into how LLMs make decisions under risk and uncertainty, we employ TCN model’s experimental design that incorporates EUT and PT [18, 32]. The utility function is as the following form:

image

image

image

where x, y are experiment outcomes, and p, q are the associated probabilities. Probabilities are weighted by w(p) and  w(q). σcaptures the curvature of the value function in which the person is risk-seeking if  σ < 0, risk-neutral if  σ = 0, and risk-averse if  σ > 0.High  λimplies a more loss averse person;  αdetermines the weighting of choices of different risk and outcome.  α < 1 indicatesan overweighting of small probabilities. When  λ = α = 1, the model reduces to the expected utility theory. The interpretation of the three parameters  σ, α, and λare illustrated in Figure 1.

image

Figure 1: Illustration of the three parameters

The entire framework to evaluate LLM’s decision-making behavior patterns is shown in Figure 2. It starts with an evaluation module containing multiple-choice-list experiments to elicit decision-making preferences. These experiments are applied to the Responder Model, such as LLMs and Generative Artificial Intelligence (GAI), marking the switching points for preference changes. The TCN Model then derives key preference parameters from these data for both context-free and demographic-feature-embedded LLMs. The resulting parameters eventually assess the capability of LLMs to understand and respond to socio-demographic features through regression models and behavioral analysis.

Step 1: Experimentation Design Three series of multiple-choice-list experiments, as shown in Appendix B, are used to elicit the decision-making preferences of the subjects. Each experiment involves presenting subjects with a series of choices between different probabilistic outcomes, also known as lottery games. In each lottery game, subjects are asked to make choices between the

image

Figure 2: Framework and evaluation illustration

options with varying probabilities and outcomes, allowing researchers to infer their preferences and behavioral patterns. Each series is designed to test different aspects of decision-making under uncertainty: Series 1 and 2 focus on positive outcomes to determine the effects of  σ and α; and Series3 introduces negative outcomes to evaluate the parameter  λfor loss aversion. In our experiment design, we treat LLMs similarly to human populations, where each query to the LLM represents an individual interaction, analogous to testing a different human subject. Just as individual humans have fixed risk preferences in a given context, we assume fixed parameters for each interaction with the LLM. By repeating these interactions across multiple queries, we capture the overall behavioral tendencies of the LLM, just as repeated studies in human populations help reveal broader trends.

Step 2: Recording Switching Points Switching points are delineated through the comparative utility of a consecutive multiple-choice list. A switching point is the point at which a participant changes their preference from one option to another. It is defined as the smallest lottery number n such that:

image

where utilityA(n) and utilityB(n)represent the utilities of options A and B at lottery n, respectively.

Step 3: Setting Up Inequalities The utility function is evaluated at each switching point to derive inequalities. In series 1 and 2, assume x > y > 0 at specific lotteries n and n + 1, which triggers a preference switch. The value function v(x) and the probability weighting function w(p) are applied as follows:

image

Pre-switch inequality at lottery  n, where xAn > yAn > 0 and xBn > yBn > 0:

image

image

image

Post-switch inequality at lottery n + 1:

image

image

where x, y are the outcomes and p, q are the probabilities for lotteries A and B. In series 3 (when x < 0), inequalities are set up follow the same way to evaluate  λ. These inequalities are used to establish parameter boundaries for  σ, α, and λin step 4.

Step 4: Estimating Parameters Parameter estimation proceeds iteratively through three steps: 1. Initialize intervals for  σ and αbased on theoretical evidence; 2. Narrow these intervals by evaluating the utility calculations at each switch point, adjust parameter values to satisfy the defined inequalities; and 3. Converge upon narrowed intervals that satisfy all inequalities. 4. Input the estimated interval of  σand  αand obtain the estimated interval of  λthrough the inequalities. The midpoints of these final intervals are the estimates of the parameters, aligning with the methodology endorsed by [32].

Step 5: Behavior Evaluation Finally, the estimated parameters are used to evaluate the decision-making behavior of the responder models. In this study, the evaluation is conducted for both context-free LLMs and LLMs embedded with socio-demographic features. The specific findings from these evaluations are detailed in Section 5, highlighting how LLMs’ decision-making patterns compare to human behavior and the implications of embedding demographic features.

We designed the evaluation framework of financial-related decision-making behavior for three state-of-the-art LLMs. We adhered to two principal guidelines for model selection: (i) The model must be sufficiently large to possess the capability to make decisions in open-ended lottery games. (ii) The model must be pre-trained on natural human utterances, thereby potentially exhibiting a human-like personality. Following these guidelines, we selected three commercial LLMs: ChatGPT-4-Turbo, Claude-3-Opus, and Gemini-1.0-Pro. Version names will be omitted hereafter.

We implemented a data collection pipeline to conduct each experiment through API calls to ensure consistency. All three models were tested across two context settings, including context-free and embedded demographic features. Prompt templates were specifically designed to optimize for responsiveness and answer validity, with an example prompt for ChatGPT provided in Appendix C. We chose a sample size of 300 data pieces, which represented the upper bound typically observed in human financial decision-making behavior experiments. To guarantee that each participating LLM can access all necessary information from the conversation history from the onset of the lottery game, the same session was maintained for each trial during data collection. History from the previous game sets was cleared to prevent LLMs from recollecting previous games.

5.1 Context-Free

In the context-free experiments, only the instructions for the lottery games are provided, mirroring the method used for data collection with human participants. After repeatedly prompting LLMs to make decisions in the lottery game 300 times, we establish the distribution of the responses for each model. Though they do not always produce identical responses in each round, the resulting distribution elucidates the behavior patterns, which can be explained by the parameters  α, σ, and λ.

5.1.1 Results and Key Findings

image

Figure 3: Comparison of context-free decision-making

We rank the models according to three dimensions of decision-making behavior under uncertainty, as shown in Figure 3, raw data is available in the Appendix Table 5. Interestingly, all three models share a unanimous preference for risk-averse decision-making tendencies. Across these LLMs, the average  σvalue exceeds 0, and even their minimum  σvalue remains greater than 0. The TCN model allows LLMs to exhibit diverse risk preferences, but despite this flexibility, our findings still reveal a unanimous preference for risk-averse behavior among LLMs, which is consistent with human behavior to a certain extent. ChatGPT leans towards conservative choices with higher rewards, as reflected in its higher risk aversion (σ) but it shows the lowest concern for potential losses (λ). Withprobability weighting parameter  α > 1, it diverges from human norms. Conversely, Claude adopts a riskier approach, with lower risk aversion, but has higher loss aversion, and small-probability overweighting. Gemini balances risk and caution, exhibiting moderate risk-taking tendencies, loss aversion, and balanced probability weighting behavior. An uncommon greater-than-1  αsuggests that ChatGPT would perceive unlikely events as even less likely than they are. This could have the following implications: (1) Dialogue Systems: It may produce more conservative responses, which might make it better suited for providing safe, predictable information.(2) Content Generation: It may avoid rare scenarios, leading to content that aligns more with conventional or high-probability outcomes, which could reduce risk but also the potential for novelty and creativity.

5.2 Embedded Demographic Features

image

Table 2: The Personas across 10 socio-demographic groups that we explore in this study.

In addition to the context-free decision-making baseline, we investigate potential variations in decision patterns among LLMs based on embedded demographic features, including gender, age, education level, marital status, and living area. Given the existing literature highlighting biases within LLMs when encountering such features [15], we introduce an augmentation technique that involves randomly incorporating other minority features into the basic demographic framework. These features encompass sexual orientation, disability, religion, race, and political affiliation. The augmentation of our demographic profiles results in 10 distinct socio-demographic groups as outlined in Table 2. These personas provide a comprehensive representation of the diverse demographic landscape under examination. We first generate a distribution of demographic profiles by randomly assigning LLMs foundational demographic features and recording their answers. We then apply the real-world distribution across the countries to reflect realistic demographics of human communities and document the results, where the statistical data are from the World Bank dataset [36].

5.2.1 Results and Key Findings

image

Figure 4: Comparison of the three context settings within each LLM (Mean +/- Std. Dev.)

Embedding demographic features resulted in notable changes in the decision-making behavior of the LLMs compared to their context-free evaluations, as illustrated in Figure 4 (raw data is available in the Appendix Table 6). Although there is no significant difference in parameters’ values between the random and real-world distribution of demographic features for the three LLMs, we identify some patterns specific to each model.

ChatGPT shifts towards much riskier decision-making with lower risk aversion (σ), while maintaining consistent loss aversion (λ). Notably,  αshifts from small probability underweighting to overweighting, aligning more closely with typical human behavior. Claude shows stable risk preferences and probability weighting but reduced loss aversion, aligning more closely with human tendencies. Gemini becomes more conservative in risk-taking and shows heightened loss aversion with demographic embedding. Contrary to ChatGPT, this model shifts from small probability overweighting to underweighting with  αslightly higher than 1.

image

Table 3: Summary of sensitivity to foundational features

Models’ Sensitivity to Demographic Features

a. Foundational Demographic Information: We use Ordinary Least Squares (OLS) regression to explore the determinants of decision-making behaviors under uncertainty, using  σ, α, and  λas dependent variables and foundational demographic features as independent variables. We encode categorical features with binaries. For example, "Age < 25 years" is set to 1 for individuals younger than 25 years old and 0 otherwise. We then have the regression model below, where  βnare coefficients for demographic variables and  β0is the intercept. n denotes n-th feature, i denotes the i-th observation, and  ϵdenotes the error offset in regression.

image

Figure 5: Average parameter values where regression coefficients are significant in this category group. Significant categories are marked in red.

Table 3 highlights key findings from this regression study. As an example of features that have a significant impact on these parameters, we graph comparisons of the average parameter values in Age Impact and Education Level in Figure 5a and 5b. The coefficients are extracted from the complete regression as presented in Figure 6. The raw data are also included in Table 7 in Appendix A. Incorporated with demographic features, the average parameters across all three models show distinct differences. Additionally, each model exhibits unique sensitivity to the features that affect its application in various demographic contexts.

In the above examples, Claude demonstrates a broader sensitivity to demographic variables in both loss aversion and probability weighting, particularly for young and rural populations. It is also sensitive to education level and marital status. This model’s extensive demographic responsiveness could enhance its adaptability in diverse settings but also introduce the risk of unfairness. For ChatGPT and Gemini, the analysis indicates a generally lower sensitivity to demographic variables, consistent across diverse user groups. ChatGPT exhibits a significant gender difference in risk preferences, in which females show reduced risk aversion than males. Gemini shows strong effects on loss aversion in the younger age group and lower education level group.

b. Advanced Demographic Information: We incorporate more advanced demographic features to the OLS regression equation as the independent variables and have the regression model below:

image

In examining the additional advanced demographic sensitivities of LLMs, distinct patterns of behavior highlight their unique capabilities and potential biases. Table 4 highlights key findings from this

image

Figure 6: Influence of Fundamental Demographic Feature: Estimated Coefficients

regression study. Similarly, as an example, we graph comparisons of the average parameter values in Religious Background and Sexual Orientation in Figure 5c and 5d. Complete regression results and raw data are presented in Figure 7 and Table 8 in Appendix A. The distinct response patterns from advanced features-embedded LLMs may suggest unique capabilities or potential biases. Claude demonstrates broader adaptability to various demographic factors, including sexual orientation, ethnicity, disability, religious background, and political beliefs.

In these two examples, ChatGPT exhibits a pronounced sensitivity to political beliefs in terms of loss aversion. This model also shows targeted sensitivity toward physically disabled individuals in probability weighting, being more likely to overweight small probability events. Meanwhile, Gemini displays a specific sensitivity to ethnicity, particularly in loss aversion for African and Hispanic groups. Unlike the other models, it maintains a politically neutral stance when making decisions.

image

Table 4: Summary of sensitivity to advanced features

In the previous sections, we demonstrated various degrees of bias in the demographic-feature-embedded LLMs. Given the presence of these biases across various models and demographic features, it is crucial to carefully consider the implications of embedding demographic features in LLMs and the unintended effects that may ensue.

Implications for users: The potential biases with demographic features in LLMs are of significant concern, even for ordinary users. Companies are increasingly offering customized ChatBots, and users can personalize their models through open versions of LLMs. However, if users from minority groups customize a model expecting it to understand their personalities, it could lead to problematic outcomes, as LLMs might produce responses based on generalized stereotypes. For instance, these users might receive misleading advice on critical decisions like investments in financial decision-making. Users may need to consider more carefully when assessing decisions made by LLMs.

Implications for developers and researchers: Previous research [15] has demonstrated that biases can be injected at various levels using different instructions, and our experiments provide evidence that injected demographic features can significantly affect how LLMs make decisions. Critical questions arise: should LLMs mirror human decision-making processes, including preexisting biases, or aim to satisfy ethical standards that remediate these flaws? How should LLMs perform when assisting in human decision-making? Prejudices objectively exist in human society, which are inevitably absorbed by the LLM from their datasets, training, and even aligning with Reinforcement Learning from Human Feedback (RLHF), with advancements such as Chain-of-Thought (CoT) prompting [35], LLMs could potentially raise the bar for ethical decision-making by helping identify and correct biases through more structured reasoning. Moreover, LLMs that leverage these structured reasoning frameworks could potentially distinguish between decisions that reflect societal biases and those that follow ethical standards, prompting users to reconsider biased responses. If biases are corrected, methods such as targeted fine-tuning and RLHF adjustments could be used to realign the model’s behavior toward fairness. However, a key question remains: in doing so, do we risk making LLMs less reflective of human-like behavior, and how do we balance ethical responsibility with the usability of these models in real-world applications?

Our study presents intriguing discrepancies between human and LLM behaviors. For example, while human studies may not find significant differences in risk preferences among sexually minority groups [3], our LLM results suggest otherwise. This discrepancy may result from two potential reasons: 1. Flaws in Human Studies: Studies on minority groups often face privacy challenges in obtaining representative samples for their personal information. 2. LLM misunderstanding: LLMs may possess flawed understandings due to biases from data and training. If LLMs could reflect real-world trends more accurately from broader and cleaner data and more careful training, should researchers consider LLM outputs as supplementary evidence for social studies? This can only be answered when further studies can address these discrepancies between human and LLM data.

One more thing: Our observation indicates that LLMs behave quite differently from their context-free baselines after specifying demographic features or who LLMs are "pretending to be." Some might argue that LLMs should function as machines with an ensemble of human knowledge, particularly when set in a no-context situation, but it remains unclear what natural preferences LLMs should exhibit. We anticipate further research on defining what LLM behavior should entail when not embedded with human-demographic features, marking the boundary between a closed-loop answering machine and real artificial intelligence. Essentially, who are LLMs in their most fundamental form?

Limitations: One limitation of this work is the difficulty in directly comparing LLM behaviors to humans due to the complex and sensitive nature of many demographic features, especially those related to minority groups, making it challenging for human subjects to process in studies. Additionally, while this study uses financial experiments to elicit overall preferences, exploring other domains with other experiments could provide a more comprehensive understanding of LLM behaviors. We recommend future work explore multiple models to compare model fit and robustness for a deeper understanding of LLM decision-making.

In conclusion, our work establishes a foundational framework for evaluating LLM behaviors and opens avenues for future research aimed at aligning these models more closely with ethical standards and human values. We evaluated three LLMs regarding their capability to process decision-making decisions through three perspectives: risk, probability, and loss. While all models exhibit general tendencies akin to human behavior, each demonstrates unique deviations to an extent. The incorporation of socio-demographic features into our analysis revealed significant impacts of human-related features on the LLM decision-making process, underscoring the potential biases and variability in their outputs. By addressing the complexities and biases inherent in LLM decision-making, we can better harness their capabilities to support fair and equitable outcomes in real-world applications. These findings emphasize the critical need for ongoing scrutiny and refinement of LLMs to ensure they do not perpetuate or exacerbate societal biases. Additionally, our work raises further exploration about how LLMs should be designed to balance realism with ethical responsibility and what their intrinsic behaviors should reflect, with and without human-demographic embeddings. This marks the boundary between closed-loop answering systems and genuine artificial intelligence.

[1] Akata, E., Schulz, L., et al.: Playing repeated games with large language models. arXiv preprint arXiv:2305.16867 (2023)

[2] Andersen, S., Harrison, G.W., et al.: Eliciting risk and time preferences. Econometrica (2008)

[3] Apicella, C.L., Dreber, A., et al.: Testosterone and financial risk preferences. Evolution and human behavior (2008)

[4] Binswanger, H.P.: Attitudes toward risk: Theoretical implications of an experiment in rural india. The Economic Journal (1981)

[5] Brand, J., Israeli, A., Ngwe, D.: Using gpt for market research. Available at SSRN 4395751 (2023)

[6] Brown, T.B.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)

[7] Chen, Y., Liu, T.X., et al.: The emergence of economic rationality of gpt. Proceedings of the National Academy of Sciences (2023)

[8] Chiang, C.W., Lu, Z., et al.: Enhancing ai-assisted group decision making through llm-powered devil’s advocate. In: Proceedings of the 29th International Conference on Intelligent User Interfaces (2024)

[9] Dillion, D., Tandon, N., et al.: Can ai language models replace human participants? Trends in Cognitive Sciences (2023)

[10] Dohmen, T., Falk, A., et al.: Individual risk attitudes: Measurement, determinants, and behavioral consequences. Journal of the european economic association (2011)

[11] Du, N., Huang, Y., Dai, A.M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A.W., Firat, O., et al.: Glam: Efficient scaling of language models with mixture-of-experts. In: International Conference on Machine Learning. pp. 5547–5569. PMLR (2022)

[12] Enis, M., Hopkins, M.: From llm to nmt: Advancing low-resource machine translation with claude (2024)

[13] Fan, C., Chen, J., Jin, Y., He, H.: Can large language models serve as rational players in game theory? a systematic analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)

[14] Guo, F.: Gpt in game theory experiments. arXiv preprint arXiv:2305.05516 (2023)

[15] Gupta, S., Shrivastava, V., et al.: Bias runs deep: Implicit reasoning biases in persona-assigned llms. arXiv preprint arXiv:2311.04892 (2023)

[16] Harrison, G.W., Rutström, E.E.: Expected utility theory and prospect theory: One wedding and a decent funeral. Experimental economics (2009)

[17] Harvel, N., Haiek, F.B., Ankolekar, A., Brunner, D.J.: Can llms answer investment banking questions? using domain-tuned functions to improve llm performance on knowledge-intensive analytical tasks. In: Proceedings of the AAAI Symposium Series (2024)

[18] Holt, C.A., Laury, S.K.: Risk aversion and incentive effects. American economic review (2002)

[19] Horton, J.J.: Large language models as simulated economic agents: What can we learn from homo silicus? Tech. rep., National Bureau of Economic Research (2023)

[20] Kai-Ineman, D., Tversky, A.: Prospect theory: An analysis of decision under risk. Econometrica (1979)

[21] Kamruzzaman, M., Kim, G.L.: Prompting techniques for reducing social bias in llms through system 1 and system 2 cognitive processes. arXiv preprint arXiv:2404.17218 (2024)

[22] Kim, J., Kovach, M., et al.: Learning to be homo economicus: Can an llm learn preferences from choice. arXiv preprint arXiv:2401.07345 (2024)

[23] Ko, H., Lee, J.: Can chatgpt improve investment decisions? from a portfolio management perspective. Finance Research Letters (2024)

[24] Leng, Y.: Can llms mimic human-like mental accounting and behavioral biases? Available at SSRN 4705130 (2024)

[25] Lewis, M.: Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)

[26] Li, Y., Huang, Y., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., Chen, D.: Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469 (2024)

[27] Liu, E.M.: Time to change what to sow: Risk preferences and technology adoption decisions of cotton farmers in china. Review of Economics and Statistics (2013)

[28] Liu, O., Fu, D., Yogatama, D., Neiswanger, W.: Dellma: A framework for decision making under uncertainty with large language models. arXiv preprint arXiv:2402.02392 (2024)

[29] Malkiel, I., Alon, U., et al.: Segllm: Topic-oriented call segmentation via llm-based conversation synthesis. In: ICASSP (2024)

[30] Qiu, L., Singh, P.V., Srinivasan, K.: How much should we trust llm results for marketing research? Available at SSRN 4526072 (2023)

[31] Rao, A., Kim, J., et al.: Evaluating chatgpt as an adjunct for radiologic decision-making. MedRxiv (2023)

[32] Tanaka, T., Camerer, C.F., Nguyen, Q.: Risk and time preferences: Linking experimental and household survey data from vietnam. American economic review (2010)

[33] Team, G., Anil, R., et al.: Gemini: A family of highly capable multimodal models (2024)

[34] Von Gaudecker, H.M., Van Soest, A., Wengström, E.: Heterogeneity in risky choice behavior in a broad population. American Economic Review (2011)

[35] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35, 24824–24837 (2022)

[36] WorldBank: World development indicators (2024), https://databank.worldbank.org/ source/world-development-indicators

[37] Wu, S., Irsoy, O., et al.: Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564 (2023)

[38] Wu, T., He, S., et al.: A brief overview of chatgpt: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica (2023)

[39] Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q.V., Zhou, D., Chen, X.: Large language models as optimizers. arXiv preprint arXiv:2309.03409 (2023)

[40] Zhou, Y., Muresanu, A.I., Han, Z., Paster, K., Pitis, S., Chan, H., Ba, J.: Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910 (2022)

[41] Ziems, C., Held, W., Shaikh, O., Chen, J., Zhang, Z., Yang, D.: Can large language models transform computational social science? Computational Linguistics 50(1), 237–291 (2024)

image

Table 5: Summary of Parameters for Baseline Results

image

Note: Column (1) displays the context-free responses from LLMs without any embedded demographic feature. Column (2) shows the responses with randomly assigned feature as we display in Table 2, Panel 1. In column (3), we further modify the distribution of the features, following the real-world distribution accross the countries.

Table 6: Comparisons of the Risk Parameters

image

Table 7: Regression Analyses: LLMs Sensitivity to Foundational Demographic Features

Note: Standard errors are in parentheses. * p < 0.05, ** p < 0.01, *** p < 0.001. Independent variables include foundational demographic features such as age, gender, education level, marital status, and living area. The coefficients indicate how each demographic feature influences the respective parameter.

image

Figure 7: Influence of Advanced Demographic Features

image

Table 8: Regression Analyses: LLMs Sensitivity to Advanced Demographic Features

Note: Standard errors are in parentheses. * p < 0.05, ** p < 0.01, *** p < 0.001. Independent variables include advanced demographic features such as sexual orientation, disability, ethnicity, religion, and political affiliation. The coefficients indicate how each advanced demographic feature influences the respective parameter.

In this session, we will showcase the three series of multiple choice lists which contains 35 rows of lottery games designed and utilized in our framework. The first two series at Table 9 and Table 10 are primarily employed to assess the risk preferences and probability weighting parameters of the models. The last series at Table 11 introduces the concept of losing money and is thus mainly used to determine the loss aversion parameter. This structured approach allows for a comprehensive evaluation of the decision-making characteristics of LLMs under different financial scenarios.

image

Table 9: Multiple Choice List: Series 1

image

Table 10: Multiple Choice List: Series 2

image

Table 11: Multiple Choice List: Series 3

In this section, we showcase the complete prompt utilized in our experiments. We conducted two types of experiments in all cases. The first is the context-free LLM behavior evaluation, which assumes no pre-set knowledge about humans, allowing the model to make decisions based solely on its own understanding. The second is the demographic-embedded behavior evaluation, where each prompt is preceded by the addition of specific demographic features. This setup enables us to assess how incorporating demographic information influences the decision-making process of the LLMs.

C.1 Context-free Experiment Prompt

The prompts used for the context-free evaluation of LLMs are listed in Table 12. These prompts are repeatedly sent using the LLMs’ API. In each experiment, we maintain a history of collecting all responses from the LLMs, and this information is aggregated into the subsequent inputs. This process simulates a human-like interaction, akin to how a researcher would present the lottery and a list of questions to human subjects under the same context.

image

We will show you two options for each lottery, and you will choose which option you want. For each lottery, each option will have different potential earnings, with a chance to earn, showing as a percentage under each option. Each of the selections will be independent, that is, for each lottery, your choice should be independent of the previous and following lotteries. Here are lotteries with options A and B. You can choose to play A or B and get the payment following the rules below. You can choose option A from row <1>to row <x1>, choose option B from row <x+1>to row 14. See Table 9. Answer me with the value of <x1>only, please remember <x1>should be larger and equal to 1, less and equal to 13, do not explain.

Now let’s play the second lottery. You can choose option A from row <1>to row <x2>, choose option B from row <x+1>to row 14. See Table 10. Answer me with the value of <x2>only, please remember <x2>should be larger and equal to 1, less and equal to 13, do not explain.

Now let’s play the last lottery. You will start with 10 dollars. You are going to play with this money. You can take it unless you lose in the lottery. And if you win we may add some to it. Here are 7 lotteries with options A and B. You can choose option A from row <1>to row <x3>, choose option B from row <x+1>to row 7. See Table 11. Answer me with the value of <x3>only, please remember <x3>should be larger and equal to 1, less and equal to 6, do not explain.

Table 12: Prompt List for Context-free Behavior Evaluation

C.2 Demographic-feature Embedded Experiment Prompt

The prompts employed in the demographic-feature embedded experiment consist of two parts. The main body of the questions is similar to those used in the context-free prompt design. However, each prompt in the demographic-feature embedded experiment is augmented with a demographic component derived from the template shown in Table 13. The complete list of these augmented prompts is presented in Table 14. Similar to the context-free behavior evaluation, the session history is maintained in the demographic-feature-embedded experiments to allow the participating LLMs to retain the memory of the previous lottery. Demographic features are added before each prompt to ensure that all features are remembered by the LLMs, thereby preventing the loss of feature memory in long-term conversations.

image

Imagine a [’Age’] year old [’Gender’] with a [’Education’] degree, who is [’Marital Status’] and lives in a ’Location’] area. This individual identifies as [’Sexual Orientation’] and is [’Disability’], of [’Race’] descent, adheres to [’Religion’] beliefs, and supports [’Political Affiliation’] policies. Consider the risk preferences and decision-making processes of a person with these characteristics.

Table 13: Demographic Feature Prompt Template

image

1. Claims

Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] . Justification: The abstract and introduction accurately summarize the key points of the study. Guidelines:

• The answer NA means that the abstract and introduction do not include the claims made in the paper.

• The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.

• The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.

• It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] . Justification: We discussed the limitation of this work. Guidelines: • The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.

• The authors are encouraged to create a separate "Limitations" section in their paper.

• The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.

• The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.

• The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.

• The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.

• If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.

• While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] . Justification: We offered the entire theoretical and mathematical framework, together with a thorough explanation of the underlying assumptions and supporting proofs. Guidelines:

• The answer NA means that the paper does not include theoretical results.

• All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.

• All assumptions should be clearly stated or referenced in the statement of any theorems.

• The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.

• Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.

• Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] . Justification: There are experiments in the paper. We offer the step-by-step instructions for the experiment’s structure. Additionally, coding scripts, data, and prompts are also provided for replication. Guidelines: • The answer NA means that the paper does not include experiments.

• If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.

• If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.

• Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.

• While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.

(b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.

(c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

(d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] . Justification: We release our code and data, with the instructions on exact command and enviroment. Guidelines:

• The answer NA means that paper does not include experiments requiring code.

• Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/ guides/CodeSubmissionPolicy) for more details.

• While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).

• The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https: //nips.cc/public/guides/CodeSubmissionPolicy) for more details.

• The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.

• The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.

• At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).

• Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] . Justification: There is no model training in this study. Guidelines: • The answer NA means that the paper does not include experiments.

• The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.

• The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] . Justification: In this work, the statistical significance and error bars are shown in comprehensiveness. Guidelines: • The answer NA means that the paper does not include experiments.

• The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.

• The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

• The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

• The assumptions made should be given (e.g., Normally distributed errors).

• It should be clear whether the error bar is the standard deviation or the standard error of the mean.

• It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.

• For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).

• If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] . Justification: We offer comprehensive instructions for executing the code. Guidelines: • The answer NA means that the paper does not include experiments.

• The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.

• The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.

• The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? Answer: [Yes] . Justification: We carefully read and complied with the NeurIPS Code of Ethics. Guidelines: • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.

• If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.

• The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] . Justification: We address many aspects of the societal impacts in our evaluation approach. We offer future research directions to solve the drawbacks we discovered. Guidelines: • The answer NA means that there is no societal impact of the work performed.

• If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

• Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

• The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.

• The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.

• If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms

for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] . Justification: We poses no such risks. Guidelines: • The answer NA means that the paper poses no such risks.

• Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.

• Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.

• We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] . Justification: Our evaluation framework’s model was adjusted in light of earlier research, and we have included citations for all relevant publications. Guidelines: • The answer NA means that the paper does not use existing assets.

• The authors should cite the original paper that produced the code package or dataset.

• The authors should state which version of the asset is used and, if possible, include a URL.

• The name of the license (e.g., CC-BY 4.0) should be included for each asset.

• For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

• If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

• For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

• If this information is not available online, the authors are encouraged to reach out to the asset’s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] . Justification: We offer the evaluation framework’s scripts and sample dataset. These are the new assets that we assert. We provide all the necessary information. Guidelines: • The answer NA means that the paper does not release new assets.

• Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.

• The paper should discuss whether and how consent was obtained from people whose asset is used.

• At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] . Justification: This paper does not involve human-subjects. Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

• Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.

• According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human

Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were

obtained? Answer: [NA] . Justification: Since there are no human participants in this publication, there is no need for IRB. Guidelines:

• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

• Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

• We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.

• For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.

Designed for Accessibility and to further Open Science