Skip to content

LongVT-RL在videomme上tool_call频率非常低 #15

@HiFiChang

Description

@HiFiChang

脚本如下,使用longvideotool/LongVT-RL预训练模型,评测videomme,发现几乎没有工具调用的过程

#!/bin/bash

# Environment variables
export OPENAI_API_BASE="http://localhost:8000/v1"
export OPENAI_MODEL_NAME="judge"
export OPENAI_BASE_URL="http://your-judge-server-ip:8000/v1"
export OPENAI_API_KEY="EMPTY"
export USE_LLM_JUDGE=False
export DECORD_EOF_RETRY_MAX=409600

TASK_NAME=videomme_reward_tool                # Evaluation task name
IS_QWEN3_VL=True              # Whether using Qwen3-VL model (True/False)
MAX_FRAME_NUM=${4:-768}     # Number of frames (Default:768)

# Path to MCP server for tool calling
MCP_PATH="./examples/video_tools/mcp_server.py"

# Activate conda environment
source /opt/conda/etc/profile.d/conda.sh
conda activate eval

# Start vLLM server
# Qwen3 VL does not need additional chat template
if [ "$IS_QWEN3_VL" == "False" ]; then
    vllm serve $CKPT_PATH \
        --chat-template ./examples/eval/tool_call_qwen2_5_vl.jinja \
        --tool-call-parser hermes \
        --enable-auto-tool-choice \
        --data-parallel-size 1 \
        --gpu-memory-utilization 0.8 \
        --trust-remote-code &
else
    vllm serve $CKPT_PATH \
        --tool-call-parser hermes \
        --enable-auto-tool-choice \
        --data-parallel-size 1 \
        --gpu-memory-utilization 0.8 \
        --trust-remote-code &
fi
sleep 240

# Run evaluation
accelerate launch --num_processes=8 --main_process_port 12345 -m lmms_eval \
    --model async_openai \
    --model_args model_version=$CKPT_PATH,mcp_server_path=$MCP_PATH,fps=1,max_frames=$MAX_FRAME_NUM,max_pixels=50176,base_url=$OPENAI_API_BASE,api_key=$OPENAI_API_KEY,num_cpus=1,timeout=12000,is_qwen3_vl=$IS_QWEN3_VL \
    --tasks $TASK_NAME \
    --batch_size 1 \
    --output_path ./eval_logs \
    --log_samples \
    --include_path ./lmms_eval_tasks \
    --limit 100

详细信息如下
20260204_234358_results.json

20260204_234358_samples_videomme_reward_tool.json

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentationgood first issueGood for newcomers

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions