-
Notifications
You must be signed in to change notification settings - Fork 38
Description
感谢Interns1的杰出工作,我希望测试Interns1在MCP协议下调用工具作为Agent完成科学问题,从模型能力上讲,Interns1是对于科学问题理解能力最先进的模型。
不过在评测过程中我发现Interns1感知到工具的存在对工具进行调用,直接给出了答案。
一个评估的例子为:
"conversations": [
{
"role": "user",
"content": "\nYou are a geoscientist, and you need to use tools to answer multiple-choice questions about Earth observation data analysis. Question: The Chengdu Plain Agricultural Zone in Sichuan Province is a key rice-producing region in southwestern China. On July 12,, 2022, MODIS-derived Land Surface Temperature (LST) and Enhanced Vegetation Index (EVI) data were used to assess drought conditions across the area. First, list the input TIFF files, then calculate the Temperature-Vegetation Dryness Index (TVDI), and finally determine the percentage of the agricultural area where TVDI values exceeded the threshold of 0.75, indicating moderate water stress for rice crops.benchmark/data/question2\nA.12.87%\nB.22.40%\nC.28.07%\nD.36.56%"
},
{
"role": "assistant",
"content": [
{
"type": "text",
"content": "\n\nB"
}
]
}
],
"final_answer": "\n\nB"
而别的模型比如GPT4o,Qwen3都可以完成工具的调用解决问题,如下:
"conversations": [
{
"role": "user",
"content": "\nYou are a geoscientist, and you need to use tools to answer multiple-choice questions about Earth observation data analysis. Note that if a tool returns an error, you can only try again once. Ultimately, you only need to explicitly tell me the correct choice.\nATTENTION:\n1. When a tool returns "Result saved at /path/to/file", you must use the full returned path "/path/to/file" in all subsequent tool calls.\n2. For each question, you must provide the choice you think is most appropriate.Don't gibe me another format. Your final answer format must be:\nYour choice\n\n\nQuestion: The Chengdu Plain Agricultural Zone in Sichuan Province is a key rice-producing region in southwestern China. On July 12,, 2022, MODIS-derived Land Surface Temperature (LST) and Enhanced Vegetation Index (EVI) data were used to assess drought conditions across the area. First, list the input TIFF files, then calculate the Temperature-Vegetation Dryness Index (TVDI), and finally determine the percentage of the agricultural area where TVDI values exceeded the threshold of 0.75, indicating moderate water stress for rice crops.benchmark/data/question2\nA.12.87%\nB.22.40%\nC.28.07%\nD.36.56%"
},
{
"role": "assistant",
"content": [
{
"name": "get_filelist",
"input": {
"dir_path": "benchmark/data/question2"
}
}
]
},
{
"role": "tool",
"name": "get_filelist",
"content": [
{
"output": [
{
"text": "["Sichuan_2021-07-12_LST.tif","Sichuan_2021-07-12_NDVI.tif"]"
}
]
}
]
},
{
"role": "assistant",
"content": [
{
"name": "compute_tvdi",
"input": {
"ndvi_path": "benchmark/data/question2/Sichuan_2021-07-12_NDVI.tif",
"lst_path": "benchmark/data/question2/Sichuan_2021-07-12_LST.tif",
"output_path": "benchmark/data/question2/tvdi_result.tif"
}
}
]
},
最终给出了正确的答案:
"final_answer": "C.28.07%"