b

DiscoverModelsSearch
About
GTA: A Benchmark for General Tool Agents
4 weeks ago
·
NeurIPS
Abstract

Significant focus has been placed on integrating large language models (LLMs) with various tools in developing general-purpose agents. This poses a challenge to LLMs’ tool-use capabilities. However, there are evident gaps between existing tool-use evaluations and real-world scenarios. Current evaluations often use AIgenerated queries, single-step tasks, dummy tools, and text-only interactions, failing to effectively reveal the agents’ real-world problem-solving abilities. To address this, we propose GTA, a benchmark for General Tool Agents, featuring three main aspects: (i) Real user queries: human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps. (ii) Real deployed tools: an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate the agents’ actual task execution performance. (iii) Real multimodal inputs: authentic image files, such as spatial scenes, web page screenshots, tables, code snippets, and printed/handwritten materials, used as the query contexts to align with real-world scenarios closely. We design 229 real-world tasks and executable tool chains to evaluate mainstream LLMs. Our findings show that real-world user queries are challenging for existing LLMs, with GPT-4 completing less than 50% of the tasks and most LLMs achieving below 25%. This evaluation reveals the bottlenecks in the tool-use capabilities of current LLMs in real-world scenarios, which provides future direction for advancing general-purpose tool agents. Dataset and code are available at https://github.com/open-compass/GTA.

Integrating tools with large language models (LLMs) has attracted broad research interest as a potential approach towards general AI assistants. Notable works include LangChain [5], AutoGPT [8], and ChatGPT Plugins [19]. These systems decompose workflow into two interactive parts: planning and execution, respectively handled by LLM controllers and callable tools. Solving complex real-world tasks requires multiple types of tools, including perception, operation, logic, and creativity, posing great challenges to LLMs’ tool-use proficiency. Consequently, evaluating the models’ tool-use capabilities for real-world tasks is crucial for enhancing the effectiveness of agent systems.

Despite the progress on benchmarking the tool-use capability of LLMs made by recent works, especially on collecting massive APIs and AI-generated user queries to enable scalable testing, there remain noticeable gaps regarding real-world scenarios, as shown in Table 1. First, AI-generated user queries, limited by the generative model, often result in overly brief or monotonous solutions.

Table 1: Comparison of benchmarks for the LLM-based agent system. *Real-world means solving the queries is helpful for humans in real life while step-implicit and tool-implicit for LLMs.

image

image

Figure 1: Some samples in the GTA benchmark. The user queries are human-designed, step-implicit, tool-implicit, and settled in real-world scenarios. Multimodal context inputs are provided. Solving these queries is helpful for users and complex for a LLM-based tool agent. The agent must use a combination of executable tools in perception, operation, logic, and creativity categories.

This is unsuitable for evaluating agent systems’ reasoning and planning capability, as shown in Table 2. Second, existing tool-use benchmarks mainly focus on text-formed user-agent interaction, lacking assessment of multimodal capabilities, thus falling short of aligning with real-world scenarios effectively. Third, existing tool-use evaluation approaches build up virtual tools. They can only evaluate isolated steps in the tool invocation chains, thus unable to reflect the agents’ capability to accomplish complex tasks end-to-end.

To ensure the evaluation closely reflects real-world scenarios, we consider the authenticity of user queries, tools, and interaction modalities. We propose a comprehensive tool-use evaluation with real-world user queries. The primary features of the evaluation are:

i. Real user queries. The user queries are designed by humans, rather than generated by AI, to reflect real-world tasks accurately. These queries describe tasks with clear objectives, but the tool-use steps are implicit. Thus, the LLM must use reasoning to deduce the suitable tools to address the given tasks. In this way, we avoid the drawbacks of using AI-generated queries in which the tool invocation steps are often explicitly hinted at. Moreover, each query requires multiple steps to resolve, necessitating the model to plan the sequence of tool invocations.

ii. Real deployed tools. We provide an evaluation platform deployed with tools across various categories, such as perception, operation, logic, and creativity. All tools are executable rather than simulated by text description. A detailed and executable ground truth tool chain is provided for each task, including each tool-use step and the final answer. Each step includes the tool name,

Table 2: Comparison of GTA queries with AI-generated queries. The steps and tool types for queries in ToolBench and m&m’s are explicitly stated, as marked in red and blue. The queries in APIBench are simple and only contain one step. Our GTA’s queries are both step-implicit and tool-implicit.

image

argument value, and return value. The detailed tool chains enable a fine-grained evaluation of the actual problem-solving abilities of tool agents.

iii. Real multimodal inputs. Each query is accompanied by one or two authentic image files, including spatial scenes, webpage screenshots, tables, code snippets, printed/handwritten materials, etc., to serve as the context for the user queries. The LLM is required to solve the problem based on the multimodal context and user queries. This setting closely aligns with the multimodal real-world problem-solving scenarios.

We manually design 229 real-world tasks and corresponding executable tool chains to evaluate mainstream LLMs. We build a platform covering a total of 14 tools across perception, operation, logic, and creation categories. Tools and some data samples are illustrated in Figure 1. We design fine-grained tool evaluation metrics that cover the entire process of tool invocation. Our findings indicate that real-world scenario queries present challenges to existing LLMs, with GPT-4 completing fewer than 50% of the tasks and most LLMs managing less than 25%.

In summary, our contributions are as follows:

• A tool-use benchmark for general tool agents. The user queries are human-designed, step-implicit, and settled in real-world scenarios. Multimodal contextual inputs are provided. Each query has a corresponding executable tool chain to enable a fine-grained tool-use evaluation.

• An evaluation platform equipped with a wide variety of executable tools covering the categories of perception, operation, logic, and creativity. Fine-grained metrics are designed for tool use, unveiling tool-augmented LLMs’ reasoning and planning capabilities in real-world scenarios.

• Evaluation and analysis of mainstream large language models. We evaluate the tool-use ability of 16 LLMs in multiple dimensions. Our findings reflect the tool-use bottleneck of existing LLMs in real-world scenarios, providing suggestions for the development path of general tool agents.

LLM-based agents. In the pursuit of developing general-purpose agents, there has been considerable focus on integrating LLMs with external tools. These LLM-based agents enable powerful capabilities in environment interaction, decision-making, and task execution. Open-source platforms have been proposed, such as LangChain [5], AutoGPT [8], and BabyAGI [17]. Moreover, several efforts have been made to achieve specialized capabilities by integrating specialized tools into LLMs. WebGPT [18], WebCPM [22], WebShop [32] are proposed to enhance the model’s web search ability. RestGPT [26] combines LLM with RESTful APIs to enable web service development. In the visual domain, Visual ChatGPT [29], MM-ReAct [31], MLLMtool [28], and LLaVA-Plus [13] prompt or finetune LLMs to interact with visual models. In the data analysis domain, DataCopilot [36] manages and processes massive data autonomously by invoking data analysis tools. HuggingGPT [24], ModelScopeAgent [11] build agent systems using LLMs integrated with massive machine learning models. In the field of human-computer interaction, AppAgent [35] allows LLMs to mimic human stapping and swiping operations to operate smartphones. In these works, the LLM serves as a central controller, invoking a certain class of tools to accomplish specialized tasks. In real-world scenarios, the environment is more complex. This requires LLMs to engage in planning and coordination among various types of tools, thereby posing a challenge to their tool-use capabilities.

Tool-use evaluations. With the rise of LLM-based agents, many studies have been conducted to evaluate the tool-use capabilities of LLMs. ToolBench [23] collects RESTful APIs and leverages ChatGPT [1] to design tool-use tasks and corresponding tool chains. Two metrics, Pass Rate and Win Rate, are devised to evaluate the efficacy of tool use. APIBench [21] is a comprehensive dataset that includes APIs from HuggingFace, TorchHub, and TensorHub, with evaluation metrics focusing on Abstract Syntax Tree (AST) accuracy. API-Bank [12] comprises 53 commonly utilized APIs, such as SearchEngine, PlayMusic, BookHotel, and ImageCaption, along with a comprehensive tool-augmented LLM workflow to evaluate the API calling, retrieving, and planning abilities. m&m’s [14] is a benchmark to evaluate tool use for multi-step multimodal tasks. It aims to evaluate different planning strategies for LLMs as planning agents. Most of the benchmarks above, however, rely on AI-generated queries. The tool-use steps are explicitly and rigidly included. Thus, these queries do not accurately represent real-world scenarios. Among many previous studies, GAIA [16] is renowned for its real-world scenario based benchmark aiming at evaluating general AI assistants, which is closer to our work. It designs conceptually simple questions for humans yet is challenging for most advanced AIs. However, GAIA focuses on artificial general intelligence (AGI). In contrast, GTA is designed to evaluate tool agents specifically, offering real-deployed tools and executable tool chains for a fine-grained evaluation in real-world scenarios. Osworld [30] is also a real-world benchmark featuring multi-step, complex tasks inspired by authentic user cases. Still, it is specifically tailored for computer environments, whereas GTA is devised for tool agents operating in more generalized real-world scenarios.

In this section, we describe the design and content of GTA. The whole dataset construction pipeline is shown in Figure 2. We first present the composition of each sample in the dataset in Section 3.1. The construction method of queries and tool chains are depicted in Section 3.2 and Section 3.3, respectively. We then present the dataset’s statistics in Section 3.4.

3.1 Dataset Formulation

Given a set of tools  Tc = {tk}Nk=1, a sample in GTA is composed of five parts (F, Q, T , C, A). Among these parts, F is a set of files containing one or two images. Q is a query based on F. It is a real-world scenario based problem of simple form but needs to be solved through multiple steps with tools in  Tc. Which tools need to be used, and in what steps are not explicitly included in the query. They require reasoning and planning by the LLM, which serves as a central controller. This procedure is given in the reference tool chain  C = {si}mi=1. The tool chain contains m steps. Each step is  si = (ti, ai, ri), where  tiis the tool used in step  i. aiand  riindicate arguments and return values.  T = �mj=1{tj} ⊆ Tcnotes the set of tools involved in this query. A is the final answer yielded by the LLM after reasoning with tools.

In our setting,  Tccontains 14 tools across four categories, including perception, operation, logic, and creativity. The full list of tools is shown in Figure 1, and more detailed information can be found in Appendix B.1. The queries Q are classified into three types: subjective, objective, and image generation. Examples of the three types of queries are shown in Appendix B.2. For a subjective query Qs, the final answer A is usually some descriptive text. It is not unique, but the general idea is the same. In this case, A contains a list of three reference answers. For an objective query  Qo, Ais a uniquely determined number or phrase. For an image generation query  Qg, we do not measure the generated image directly. In this situation,  A = ∅.

3.2 Query Construction

To construct (F, Q, T ), we first gather human-designed queries that meet three main principles: i) Given  T ⊆ Tc, the task (F, Q) can be solved with the capabilities enabled by tools in T . ii) To evaluate LLMs’ reasoning and planning abilities, the tool invocation steps should not be explicitly

image

Figure 2: Two steps are performed in the dataset construction pipeline.  ➊ Duringquery construction, initial exemplars and instruction documents are designed by experts and given to human annotators. Annotators brainstorm and design more samples based on the exemplars.  ➋During tool chain construction, annotators manually call the deployed tools to check the executability of each query in the query set. Then they annotate the ground truth tool chains for each query.

stated in the queries. iii) The queries are meaningful and based on real-world scenarios. Satisfying all the principles simultaneously is challenging. It requires F, Q, and T to match each other sensibly and logically. We use a query construction pipeline based on exemplar expansion, as shown in the first part of Figure 2. We first give some initial exemplars with diverse scenarios and tool combinations. Then, we instruct annotators to create more queries based on the exemplars.

Exemplar designed by experts. We first design some initial questions as exemplars, which are provided in Appendix C.1. These example questions are of diverse scenarios and contain different tool combinations. Every sample should comprise six components: F (image files), Q (queries), T (involved tools), S (solution steps), A (answers), and E (evidence). Image files F could be obtained from the internet, and their URLs must be recorded. F could also be a photo taken or a diagram drawn by the annotators. The query Q must avoid obvious references to a specific tool. For example, the query please describe the image for me is unqualified since it obviously refers to the tool ImageDescription. The components S, A, and E will not appear in the final dataset but are utilized to assist annotators in meeting the annotation requirements. S represents the steps required to solve the problem. Annotators should note down the steps, ensuring their number exceeds two. The answer A of objective queries should be given to guarantee a unique answer. To ensure the uniqueness, the answer should not depend on the images generated in previous steps. For example, the question what kind of animal is in the picture should not be asked after generate an image of an animal, as the answer is uncertain. For queries utilizing the Google Search tool, E should include the answer’s URL and a screenshot pinpointing the answer’s location to verify the query’s searchability with the tool.

Diversified expansion by annotators. After the initial exemplars are given, we instruct annotators to create more samples based on each exemplar. We adopt a diversified expansion strategy for the annotators to expand the questions based on the exemplars. The general idea is to keep the tool set T of the template unchanged or slightly modify it. Then, annotators brainstorm scenarios different from the template. Further information on the diversified expansion approach is detailed in Appendix C.2. For each sample, we have crafted a manual expansion example to serve as guidance for the annotators. After the expansion process, we perform a quality check and manually filter out the questions that do not satisfy the expansion requirements. The instruction documents for annotators are reported in Appendix C.3.

Considerations for the search tool. Web search tools like Google Search may return variable results over time, harming the accuracy and stability of evaluation. To address this problem, we perform two constraints. First, the question must be answered using web search rather than relying on an LLM’s internal knowledge. This can be achieved by designing time-sensitive questions, such as what is the 2023 QS ranking of Tsinghua University rather than general inquiries like where is Tsinghua University located in China. We may also direct the query to a specific information source, for instance, asking what is the recipe for Ma Po Tofu according to the BBC Good Food website instead

image

Table 3: Basic statistics of GTA.

image

Figure 3: Other statistics of GTA. (a) Step number per query. (b) Frequency of different tool combination.

of a broad question like what is the recipe for Ma Po Tofu. Second, it is crucial to ensure that answers remain constant over time within an evaluation dataset. To fulfill this criterion, we can specify a time frame, web page, or organization within the question. An example would be what is the 2024 QS ranking of Tsinghua University, rather than what is the QS ranking of Tsinghua University.

3.3 Tool Chain Construction

Based on the (F, Q, T ) samples constructed in Section 3.2, we instruct three annotators majoring in computer science to manually construct the corresponding tool chain C and the final answer A. We design a JSON file structure containing the query-related tool list, image paths, and ReAct [33] style dialog sequences. The dialog sequences include the user query, the executable tool chain, and the final answer. Initially, (T , F, Q) are put into the associated sections for tools, images, and user queries. Subsequently, we deploy all tools in  Tc. The annotators utilize the tools according to the reference steps S and get the outcomes. They record this process in the tool chain section of the dialog sequences, alongside the final answer. Since we do not evaluate the tools’ efficacy, when a tool fails to provide accurate recognition for a query (for instance, OCR inaccuracies in text recognition within diagrams), we discard the query. Through the above process, we ensure the feasibility of the questions, the executability of the tool chains, as well as the precision of the final answers. The structure of the tool chain is provided in Appendix C.4.

3.4 Dataset Statistics

GTA comprises a total of 229 questions, with the basic dataset statistics presented in Table 3. The dataset involves 252 images and 14 distinct tools. It includes 156 objective, 16 subjective, and 57 image generation queries. The number of tools involved in each question varies from 1 to 4, with most questions using 2 or 3 tools. The steps to resolve the questions range from 2 to 8, with most questions requiring 2 to 4 steps, as depicted in Figure 3(a). The detailed frequency distribution of different tool combinations is listed in Figure 3(b). P, O, L, C are short for Perception, Operation, Logic, Creativity, respectively. Perception+Logic and Perception+Operation are the most frequently appearing tool combination types.

4.1 Experiment Settings

We evaluate 16 LLMs on GTA. For API-based models, we select GPT-3.5 [20], GPT-4 [1], GPT-4o, Claude-3 [2], and Mistral-Large [9]. For open-source models, we select Llama-3 [15] series, Qwen1.5 [3] series, Mistral [9], Mixtral [10], Yi [34] series, Deepseek [4] series. Experiments are conducted using NVIDIA A100 GPU within OpenCompass [7] evaluation platform. We adopt Lagent [27] as the agent framework. ReAct [33] is used as the tool invocation prompt schema. More experiment information can be found in Appendix D.1 and D.2.

We evaluate the models in two modes. Step-by-step mode is designed to evaluate the model’s fine-grained tool-use capabilities. In this mode, the model is provided with the initial n steps of the reference tool chain as prompts, with the expectation to predict the action in step n + 1. This method does not involve the actual use of the tool, and the prediction of each step does not depend on the

Table 4: Main results of GTA. Inst., Tool., Arg., Summ., Ans., Ans.+I denote InstAcc, ToolAcc, ArgAcc SummAcc, AnsAcc, and AnsAcc w/ ImgGen respectively. P., O., L., C. denote the F1 score of tool selection in Perception, Operation, Logic, and Creativity categories. Bold denotes the best score among all models. Underline denotes the best score under the same model scale. AnsAcc reflects the overall performance.

image

model’s preceding outputs. This enables an alignment comparison between the model’s output with each step of the ground truth tool chain. End-to-end mode is designed to reflect the tool agent’s actual task executing performance dynamically. In this mode, the model actually calls the tools and solves the problem by itself. Each step relies on the preceding step’s output. We compare the tools selected and the execution result with the ground-truth tool set and result under this mode.

4.2 Evaluation Metrics

We design fine-grained metrics spanning from the LLM’s tool invocation process to execution results. To evaluate the tool invocation process, we devise four metrics under step-by-step mode: InstAcc, ToolAcc, ArgAcc, and SummAcc. InstAcc is instruction-following accuracy, which quantifies the percentage of steps executed without errors. ToolAcc measures the accuracy of tool selection. ArgAcc accesses the accuracy of argument name prediction. SummAcc reflects how accurately the model can summarize the final answers considering all previous tool-use steps. For end-to-end mode, we use AnsAcc to measure the accuracy of the execution result. Besides, we calculate the F1 scores of tool selection in perception, operation, logic, and creativity categories. The four F1 scores compare the model’s tool selection with the ground truth tool set, measuring its tool selection ability.

In calculating the metric AnsAcc, we exclude image generation queries and focus solely on queries with pure text answers, including subjective and objective queries. For objective queries, the ground truth contains both a whitelist and a blacklist of phrases. An answer is considered correct if it includes all terms from the whitelist and excludes all terms from the blacklist. In the case of subjective queries, the ground truth contains three manually labeled responses from distinct annotators. We compute the cosine similarity (ranging from 0 to 1) between the model’s prediction and each of the three ground truth answers, ultimately considering the highest score obtained. We also design a metric AnsAcc w/ ImgGen, to take image generation queries into account indirectly. Given that the outcome of the image generation is determined solely by the input parameters, we evaluate the accuracy of these parameter predictions. If the predicted parameters are correct, the images produced should align with the specified task objectives. The specific score calculation formulas of subjective and image generation queries are shown in Appendix D.3.

image

Figure 4: Performance of models with various size. Larger models within the same series perform better than their smaller counterparts, but larger models from different series do not necessarily outperform the smaller ones.

image

Figure 5: The Pearson correlation coefficient between AnsAcc and four metrics.

image

Model Figure 6: The number of successful and failed tool calls of each model.

4.3 Main Results

Real-world tool-use tasks are challenging for existing LLMs. Current LLMs are struggling to accurately invoke tools to solve these real-world tasks. As shown in Table 4, the best-performing models, GPT-4 and GPT-4o can only correctly solve fewer than 50% of the problems, while the rest of the models solve less than 25%. This shows that real-world problems with implicit steps, real tool invocations, and multimodal contextual inputs impose high requirements on the tool-use capabilities of LLMs. Regarding model performance comparisons, API-based models outperform open-source ones. Among open-source models, Qwen1.5-72B-Chat has the highest result accuracy. Larger models within the same series perform better than their smaller counterparts, but larger models from different series do not necessarily outperform the smaller ones, as shown in Figure 4. For example, the AnsAcc of Llama-3-70B-Instruct is higher than that of Llama-3-8B-Instruct, but lower than Qwen1.5-7B-Chat.

The current bottleneck mainly lies on argument prediction. From the results, we observe that the overall performance of the system is affected by the lowest metric. We argue that the four metrics in the step-by-step mode follow the buckets effect. To verify this observation, we calculate the Pearson correlation coefficients between four metrics (InstAcc, ToolAcc, ArgAcc, SummAcc) and AnsAcc, the result is shown in Figure 5. We find that the correlation coefficient for ArgAcc with AnsAcc is the highest. ArgAcc is low for most models, indicating that the four metrics follow the buckets effect. For example, the scores of Llama-3-70B-Instruct in InstAcc, ToolAcc, and SummAcc are higher than those of Qwen1.5-14B-Chat, but its ArgAcc is lower than Qwen1.5-14B-Chat, resulting in a lower final answer accuracy. The scores of GPT-4o in InstAcc and ToolAcc are higher than GPT-4, but its weaker argument prediction capability leads to a lower accuracy rate in the final result. The reason for the buckets effect is that under our evaluation framework, the model needs to follow user instructions, invoke tools multiple times in the correct format, and summarize the answer based on the returned results. Any error in this process can lead to an incorrect conclusion. Currently, argument prediction is the weakest capability for most models, suggesting that to enhance their general tool-use capabilities, researchers can focus on argument prediction capabilities. This concerns both the value and the format correctness of an argument.

Different series of LLMs exhibit distinct behavioral patterns. We count the number of successful and failed tool calls, illustrated in Figure 6. Successful means there are not any errors in the tool call. GPT-4o has the highest number of successful tool calls, while GPT-4 has the highest successful tool call rate. We find that models from different series exhibit dis-

Table 6: Detailed error distribution of GPT-4-1106-Preview and Llama-3-8B-Instruct on argument prediction.

image

tinct behavioral tendencies. Yi and Deepseek series tend to be aggressive, invoking tools frequently but lacks sufficient instruction-following ability to invoke tools in a correct format. The Qwen series is conservative, preferring to invoke tools less often, yet it has stronger instruction-following capabilities, resulting in a higher success rate of tool calls. The GPT series is neutral, tending to invoke tools moderately and possessing robust instruction-following abilities, which leads to the highest final answer accuracy. This suggests that to improve the performance of Yi or Deepseek, focus should be given to enhancing their instruction-following ability.

Conversely, to enhance the Qwen series, reducing its conservative behavior to tool calls could be beneficial.

Table 5: The percentage of different error types.

image

Models favor either format errors or argument format errors, not both equally. We count the percentage of error types when calling tools, including format error, argument format error, and N/A (other errors, mainly containing the tools’ internal error). Most models exhibit a clear tendency toward either format errors or argument format errors, rather than making both types of mistakes in nearly equal numbers. For example, Claude-3’s errors are predominantly argument format-related, amounting to 82.86%, while format errors account for a mere 4.29%. This indicates that Claude-3 can follow the toolcall format well, but fails to pass the argument in a correct format.

4.4 Further Analysis and Exploration

Detailed Error Analysis. In Section 4.3, we discuss the bottleneck in task performance arising from most models’ inability to generate responses or predict arguments in the correct format. To understand the reason behind these model failures, we conduct a detailed analysis of the predictions generated by GPT-4-1106-Preview and Llama-3-8B-Instruct. We systematically categorize seven primary error types. The statistical outcomes are presented in Table 6. Detailed error cases of each type can be found in Appendix D.4.

Our analysis reveals distinct error distributions between GPT-4 and Llama-3. GPT-4 consistently adheres to the given prompts when executing actions, in contrast to Llama-3, which often fails to maintain the prescribed format. However, GPT-4 is prone to generating passive thought processes or attempting interaction with the user, rather than taking decisive action.

For the GPT-4 model, the predominant type of error is No Action, wherein the model neither utilizes tools nor produces a final answer. In 38.7% of erroneous responses, GPT-4 attempts to engage with

Table 7: Comparison of Llama-2-Chat-7B with Agent-Flan-7B (which is fine-tuned from Llama-2- Chat-7B on ReAct and JSON format data) on GTA.

image

the user, mistakenly assuming the query lacks clarity and requesting additional information, despite the query and input images supplying sufficient details for task resolution. Furthermore, 50% of the error responses consist solely of the model’s internal thought without any corresponding action.

For the Llama-3 model, most errors are related to formatting during action sequences, such as invoking tools or generating the final answer. Specifically, 45.4% of errors originate from argument predictions not adhering to a valid JSON format. Additionally, in 16.5% of the flawed responses, the model attempts to invoke multiple tools simultaneously, which is not supported by the agent system. Moreover, 19.6% of the errors occur when the model disregards the prompt and generates redundant information after argument prediction, leading to incorrect argument parsing. Finally, in 15.8% of the cases, the model fails to perform the correct action, merely repeating content from the prompt.

Further Exploration to Enhance Model Performance. Since the LLM functions as the central controller of an agent system, producing responses that strictly comply with the agent protocol is important. Fine-tuning on ReAct and JSON format may mitigate format-related errors during action execution. To verify this, we further compare Llama-2-Chat-7B with Agent-Flan-7B on GTA benchmark. AgentFLAN[6] is a popular instruction tuning method that fine-tunes LLM-based agents using ReAct and JSON instruction-following data. Agent-Flan-7B is fine-tuned from Llama-2-Chat-7B using AgentFLAN method. The results are shown in Table 7.

We discovered that Agent-Flan-7B’s InstAcc and ToolAcc metrics were significantly higher than those of Llama-2-Chat-7B. The responses of Agent-Flan-7B follow the format of ’Thought-Action-Action Input’ that specified in the prompt. But most responses of Llama-2-Chat-7B fail to follow the format. This further suggests the improved instruction following capability of Agent-Flan-7B.

However, we note that the ArgAcc of Agent-Flan-7B is still low. We compare the response of the two models, as shown in Appendix D.5. We find that although Agent-Flan-7B follows the format of ’Thought-Action-Action Input’, it sometimes fails to generate the argument (Action Input) in a correct JSON format, or summarizes the final answer incorrectly. Thus, how to further enhance the model’s capabilities on GTA through instruction fine-tuning is still an open problem.

We propose GTA, a real-world tool-use benchmark for general-purpose agents. The user queries are human-designed, step-implicit, and settled in real-world scenarios. Multimodal contextual inputs are provided. We build an evaluation platform equipped with executable tools in the categories of perception, operation, logic, and creation. Fine-grained metrics are designed for the tool-use capabilities of LLMs in real-world scenarios. We evaluate the tool-use capabilities of 16 LLMs. The evaluation results show that GTA is challenging for current LLMs, with advanced models like GPT-4 struggling with these real-world tasks, completing less than 50% of them. Based on our findings, we give takeaways and further suggestions on tool-use capability improvement. We believe that the GTA benchmark will advance further research in identifying the model’s tool-use capabilities and contribute to realizing general-purpose tool agents.

Our benchmark lacks language diversity since all queries are in English. Multilingual queries can be added in future work to assess the capability of tool agents in non-English environments. Moreover, to achieve high data quality, both the user queries and the tool chains are human-written. So the cost of a data piece is higher than that of AI-generated counterparts.

This work is supported by the National Key R&D Program of China (No. 2022ZD0161600), and the National Natural Science Foundation of China under Grants 62422311 and 62176152.

[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.

[2] AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 2024.

[3] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.

[4] Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.

[5] Harrison Chase. Langchain, October 2022.

[6] Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, and Feng Zhao. Agent-FLAN: Designing data and methods of effective agent tuning for large language models. In Findings of the Association for Computational Linguistics ACL 2024, pages 9354–9366, August 2024.

[7] OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.

[8] Significant Gravitas. Autogpt, 2023.

[9] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.

[10] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.

[11] Chenliang Li, He Chen, Ming Yan, Weizhou Shen, Haiyang Xu, Zhikai Wu, Zhicheng Zhang, Wenmeng Zhou, Yingda Chen, Chen Cheng, et al. Modelscope-agent: Building your customizable agent system with open-source large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 566–578, 2023.

[12] Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li API-bank. A comprehensive benchmark for tool-augmented llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116, 2023.

[13] Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multimodal agents. arXiv preprint arXiv:2311.05437, 2023.

[14] Zixian Ma, Weikai Huang, Jieyu Zhang, Tanmay Gupta, and Ranjay Krishna. m&m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks. In Synthetic Data for Computer Vision Workshop@ CVPR 2024, 2024.

[15] AI Meta. Introducing meta llama 3: The most capable openly available llm to date. Meta AI Blog (accessed 2024–04–20). There is no corresponding record for this reference, 2024.

[16] Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. In The Twelfth International Conference on Learning Representations, 2023.

[17] Yohei Nakajima. Babyagi, 2023.

[18] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.

[19] OpenAI. Chatgpt plugins. https://openai.com/index/chatgpt-plugins/, 2023.

[20] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.

[21] Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.

[22] Yujia Qin, Zihan Cai, Dian Jin, Lan Yan, Shihao Liang, Kunlun Zhu, Yankai Lin, Xu Han, Ning Ding, Huadong Wang, et al. Webcpm: Interactive web search for chinese long-form question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8968–8988, 2023.

[23] Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.

[24] Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36, 2024.

[25] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mpnet: Masked and permuted pre-training for language understanding. Advances in neural information processing systems, 33:16857–16867, 2020.

[26] Yifan Song, Weimin Xiong, Dawei Zhu, Cheng Li, Ke Wang, Ye Tian, and Sujian Li. Restgpt: Connecting large language models with real-world applications via restful apis. arXiv preprint arXiv:2306.06624, 2023.

[27] Lagent Developer Team. Lagent: InternLM a lightweight open-source framework that allows users to efficiently build large language model(llm)-based agents. https://github.com/ InternLM/lagent, 2023.

[28] C Wang, W Luo, Q Chen, H Mai, J Guo, S Dong, XM Xuan, Z Li, L Ma, and S Gao. Mllm-tool: A multimodal large language model for tool agent learning. arXiv preprint arXiv:2401.10727, 4, 2024.

[29] Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.

[30] Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. arXiv preprint arXiv:2404.07972, 2024.

[31] Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.

[32] Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022.

[33] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations, 2023.

[34] Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.

[35] Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users, 2023.

[36] Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang. Data-copilot: Bridging billions of data and humans with autonomous workflow, 2024.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] See Introduction.

(b) Did you describe the limitations of your work? [Yes] See Limitation.

(c) Did you discuss any potential negative societal impacts of your work? [N/A] Our paper proposes a dataset to evaluate LLM-based tool agents in real-world scenarios. There is currently no negative social impact.

(d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] See supplementary materials.

2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

3. If you ran experiments (e.g. for benchmarks)...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] https: //github.com/open-compass/GTA

(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] https://github.com/open-compass/GTA

(c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A]

(d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Section 4.1.

4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [N/A]

(e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A]

5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [Yes] See supplementary materials.

(b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]

(c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

image

A.1 Motivation

image

We create GTA (a benchmark for General Tool Agents) to evaluate the general tool-use ability of LLMs in real-world scenarios. The benchmark has human-written queries with simple real-world objectives but implicit tool-use, an evaluation platform equipped with executable tools across diverse categories, and authentic image files as context input. These features bridge the gap between existing benchmarks and real-world tool-use scenarios.

Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)? The authors of this paper.

Who funded the creation of the dataset?

This work is supported by the National Key R&D Program of China (No. 2022ZD0161600), and the National Natural Science Foundation of China under Grants 62422311 and 62176152.

A.2 Composition

What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?

Each instance in GTA is in the JSON format. It contains natural language queries, image file inputs, tool descriptions, a reference tool chain, and a final answer.

How many instances are there in total (of each type, if appropriate)? There are 229 instances in GTA, with 252 image files.

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? We will provide all instances in our GitHub repository for GTA.

What data does each instance consist of? Each instance contains a natural language query, image file inputs, tool descriptions, a reference tool chain, and a final answer.

Is there a label or target associated with each instance? The correct tool chain and final answer is provided for each query.

Is any information missing from individual instances? No.

Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)? No.

Are there recommended data splits (e.g., training, development/validation, testing)? The whole dataset is a test set.

Are there any errors, sources of noise, or redundancies in the dataset? The dataset are created and verified by human. The noise may come from human error in writing.

Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? The dataset is self-contained.

Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor–patient confidentiality, data that includes the content of individuals’ non-public communications)?

image

Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? No.

A.3 Collection Process

How was the data associated with each instance acquired? The queries are all human designed. The image inputs are collected from the Internet or created by annotators (such as diagrams drawn by annotators).

What mechanisms or procedures were used to collect the data (e.g., hardware apparatuses or sensors, manual human curation, software programs, software APIs)? We use Google Images to collect image inputs. Queries are written by human.

Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)? The data are created by researchers and student annotators. The annotators were paid about $ 40 per day.

Over what timeframe was the data collected? The data were constructed in 2023 and 2024.

Were any ethical review processes conducted (e.g., by an institutional review board)? Yes. All images within GTA are available for academic use. During the collection process, we instruct annotators to document the original URL of each image. Subsequently, we manually review these URLs, eliminating images that are not suitable for academic use. Moreover, should any authors request the removal of their images from GTA, we will promptly comply.

A.4 Preprocessing/cleaning/labeling

Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?

image

Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)?

There is no raw data, since the dataset is created from scratch, rather than a cleaned version of existing data.

Is the software that was used to preprocess/clean/label the data available? Excel and VSCode are used for create the data.

A.5 Uses

Has the dataset been used for any tasks already? No.

Is there a repository that links to any or all papers or systems that use the dataset? No.

What (other) tasks could the dataset be used for? GTA is used for evaluating the general tool-use ability of LLMs in real-world scenarios.

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? No.

Are there tasks for which the dataset should not be used? No.

Are there any potential negative social impacts?

The GTA benchmark may have potential negative societal impacts. These include copyright concerns related to image data collection. The presence of images involving people in our dataset also raises privacy concerns. Additionally, during the evaluation of GTA, the agent system could potentially experience hallucinations and generate harmful information. Besides, given the inclusion of coding questions in GTA, the agent system might produce malicious code.

A.6 Distribution

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? No.

How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? The dataset will be released at https://github.com/open-compass/GTA.

Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? The dataset is released under the Apache License.

Have any third parties imposed IP-based or other restrictions on the data associated with the instances? No.

Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? No.

A.7 Maintenance

Who will be supporting/hosting/maintaining the dataset? The authors of this paper.

How can the owner/curator/manager of the dataset be contacted (e.g., email address)? Please contact with authors through emails in the paper.

Is there an erratum? No.

Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? Yes, users can propose issues and the dataset will be updated on Github.

Will older versions of the dataset continue to be supported/hosted/maintained?

Primarily, we plan to maintain only the most recent version of the dataset. However, under certain circumstances, such as significant updates to our dataset or the need for validation of previous research work using older versions, we will exceptionally preserve previous versions of the dataset for up to one year.

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? Contact the authors of the paper.

B.1 Tool Definition

The detailed definition of 14 tools across perception, operation, logic, and creativity categories are shown in Table 8.

Table 8: Detailed definition of 14 tools across four categories.

image

B.2 Examples of Three Query Types

The examples of objective queries  Qo, subjective queries  Qs, and image generation queries  Qgare shown in Figure 7, Figure 8, and Figure 9, respectively. In the supplementary material, we provide the complete data sample, which is in the JSON format, including the involved tools, files, query, tool chain, and the final answer. To facilitate automatic evaluation, we design different final answer format for the three query types. For objective queries, the final answer contains both a whitelist and a blacklist of phrases. An answer is considered correct if it includes all terms from the whitelist and excludes all terms from the blacklist. In the case of subjective queries, the final answer contains three manually labeled responses from distinct annotators. We compute the cosine similarity (ranging from 0 to 1) between the model’s prediction and each of the three ground truth answers, ultimately considering the highest score obtained. For image generation queries, the final answer is none, since we evaluate the execution accuracy through measuring the argument accuracy of image generation tools.

image

Query Type: Objective Query: I need to prepare twelve servings of this dish. How many boxes of eggs will I need in total? Involved Tools: ImageDescription, CountGivenObject, OCR

image

Figure 7: An example of objective query  Qo. The final answer is a uniquely determined number or phrase.

image

Figure 8: An example of subjective query  Qs. The final answer is usually some descriptive text. It is not unique, but the general idea is the same.

C.1 Query Exemplars

We design several initial queries as query exemplars, as shown from Figure 10 to 24. The annotators brainstorm and design new questions that have the same tool chain as the exemplar but with different scenarios. We provide an expansion example for most exemplars for annotators to refer to.

image

1. Identify the ratings of each restaurant in the map using OCR tool. 2. Identify the restaurant with the highest rating and its coordinate from the OCR result.

3. Circle the restaurant in the graph using DrawBox tool.

image

Figure 9: An example of image generation query  Qi. The final answer is none since we do not evaluate the generated image directly.

image

Query: How much should I pay for the beer on the table according to the price on the menu? Involved Tools: ImageDescription, CountGivenObject, OCR, Calculator

image

Query: I need to prepare twelve servings of this dish. How many boxes of eggs will I need in total? Involved Tools: ImageDescription, CountGivenObject, OCR, Calculator

image

Figure 10: Query exemplar 1.

image

Figure 11: Query exemplar 2.

image

Figure 12: Query exemplar 3.

image

Figure 13: Query exemplar 4.

image

Figure 14: Query exemplar 5.

image

Figure 15: Query exemplar 6.

image

Figure 16: Query exemplar 7.

image

Figure 17: Query exemplar 8.

image

Figure 18: Query exemplar 9.

image

Figure 19: Query exemplar 10.

image

Figure 20: Query exemplar 11.

image

Query: As of December 31, 2023, how many Boeing 787-8 Dreamliner airplanes does the airline shown in the image own? Involved Tools: OCR, GoogleSearch Files:

image

1. Identify the airline name. 2. Search for the number of aircraft of the type owned by the airline company. Answer: 36 Evidence: https://en.wikipedia.org/wiki/All_Nippon_Airways

image

Evidence: https://www.amd.com/en/products/cpu/amd-ryzen-9-7950x

image

Figure 21: Query exemplar 12.

image

Figure 22: Query exemplar 13.

image

Figure 23: Query exemplar 14.

image

Figure 24: Query exemplar 15.

C.2 Diversified Expansion Approach

To ensure expansion diversity, we instruct annotators to design new questions according to the diversified expansion approach. Rules of the approach are shown in Figure 25. We also provide an example, shown in Figure 26.

image

Figure 25: Diversified expansion approach.

image

Figure 26: An example for the diversified expansion approach. Changes to the tool set are highlighted in blue. The evidence part is omitted for clarity of illustration.

C.3 Instruction for Annotators

The detailed instruction for annotators during the query construction stage is provided in Figure 27. The instruction during the tool chain construction stage is provided in Figure 28.

image

Figure 27: Annotation instruction document for query construction stage.

image

Figure 28: Annotation instruction document for tool chain construction stage.

C.4 Illustration of Executable Tool Chains

An illustration on each part of the tool chain is shown in Figure 29. It is in the JSON format. It contains the involved tool list, file list, and dialog list. There are three roles in the dialog list: user, assistant, and tool. In the user’s dialog, the query content is recorded. In the assistant’s dialog, the correct tool call including the tool name and arguments is recorded. In the tool’s dialog, the tool’s return value is recorded. You can refer to Figure 7 to ??, Figure 8 to ??, and Figure 9 to ?? for JSON-format tool chain examples.

image

Figure 29: An illustration of each part of the tool chain.

D.1 Build an LLM-Based Agent System

We build the LLM-based agent system using Lagent 2 framework. It equips an LLM with some action & planning schema, using action executor to let it interact with external tools. To build such an agent system, we should consider three parts: LLM, action & planning schema, and tools. In our experiment, we use ReAct as the action & planning schema. As for tools, we have implemented the 14 tools using AgentLego 3, which is a platform supporting tool serving and remote accessing. When evaluating different LLMs, we replace different LLMs into the Lagent framework, and evaluate this system on the Opencompass 4 evaluation platform.

D.2 ReAct-Style Prompts

The ReAct-style prompt template using for the agent system is shown in Figure 30.

image

Figure 30: The ReAct-style prompt template for the agent system.

D.3 Final Answer Evaluation of Subjective and Image Generation Queries

For a subjective query, we use All-MPNet-Base-V2[25] to encode both the prediction and the ground truth. Then we calculate the cosine similarity between the two embeddings. To scale the score from 0 to 1, we consider only positive values:

image

For a query with image answers, the AnsAcc score can be formulated as:

image

where  ti, argi is the i-th image generation-related tool (AddText, DrawBox, TextToImage, ImageStylization) in the ground truth tool chain. I denotes the indicator function.  Tpredsis the set of image generation-related tools that appears in the model prediction.  argpredis the corresponding argument for  tias predicted by the model.

D.4 Detailed Error Cases of Each Type

We conduct a detailed analysis of predictions generated by GPT-4-1106-Preview and Llama-3-8B-Instruct. We systematically categorize seven primary error types. Detailed error cases of each type are shown in Figure 31.

image

An error response of llama3-8b-instruct: "You are a assistant who can utilize external tools.\n[{'name': 'Calculator', 'description': 'A calculator tool. The input must be a single Python expression and you cannot import packages. You can use functions in the `math` package without import.', 'parameters': [{'name': 'expression', 'description': None, 'type': 'STRING'}], 'required': ['expression'], 'parameter_description': 'If you call this tool, you must pass arguments in the JSON format {key: value}, where the key is the parameter name.'}, {'name': 'CountGivenObject', 'description': 'The tool can count the number of a certain object in the image.', 'parameters': [{'name': 'image', 'description': None, 'type': 'STRING'}, {'name': 'text', 'description': 'The object description in English.', 'type': 'STRING'}], 'required': ['image', 'text'], 'parameter_description': 'If you call this tool, you must pass arguments in the JSON format {key: value}, where the key is the parameter name.'}]\nTo use a tool, please use the following format:\n```\nThought:Think what you need to solve, do you need to use tools?\nAction:the tool name, should be one of [['Calculator', 'CountGivenObject']]\nAction Input:the input to the action\n```\nThe response after utilizing tools should using the following format:\n```\nResponse:the results after call the tool.\n```\nIf you already know the answer, or you do not need to use tools,\nplease using the following format to reply:\n```\nThought:the thought process to get the final answer\nFinal Answer:final answer\n```\nBegin!\n" Repeating contents from the prompt.

Prompt for the agent system: "You are a assistant who can utilize external tools. \n[{'name': 'CountGivenObject', 'description': 'The tool can count the number of a certain object in the image.', 'parameters': [{'name': 'image', 'description': None, 'type': 'STRING'}, {'name': 'text', 'description': 'The object description in English.', 'type': 'STRING'}], 'required': ['image', 'text'], 'parameter_description': 'If you call this tool, you must pass arguments in the JSON format {key: value}, where the key is the parameter name.'}] ...<Due to space constraints we omit other tool descriptions here>] To use a tool, please use the following format:\n```Thought: Think what you need to solve, do you need to use tools?\nAction: the tool name, should be one of [{action_names}]\nAction Input: the input to the action```\nThe response after utilizing tools should using the following format:\n```Response: the results after call the tool.```\nIf you already know the answer, or you do not need to use tools, please using the following format to reply:\n```Thought: the thought process to get the final answer\nFinal Answer: final answer```\nBegin!"

image

Figure 31: Detailed error cases of each type in the predictions generated by GPT-4-1106-Preview and Llama-3-8B-Instruct.

D.5 Comparison of Llama-2-Chat-7B and Agent-Flan-7B

We compare Llama-2-Chat-7B with Agent-Flan-7B on GTA benchmark to see if instruction tuning on ReAct and JSON format data can enhance the model’s performance. The comparison of the two models’ responses to a same user query is shown in Figure 32.

image

Figure 32: The comparison of Llama-2-Chat-7B and Agent-Flan-7B responses to a same user query.

Designed for Accessibility and to further Open Science