Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

follow the official code,got the error:because name 'global_poolverl_group_2:0' already exists #57

Open
ArlanCooper opened this issue Feb 10, 2025 · 6 comments

Comments

@ArlanCooper
Copy link

i ran the code as readme,


# 设置参数
export CUDA_VISIBLE_DEVICES=2,3
export N_GPUS=2
export BASE_MODEL=/home/octopus/data/llm_list/Qwen2.5-3B
export DATA_DIR=/home/powerop/work/rwq/myoperate/TinyZero/data_sets
export ROLLOUT_TP_SIZE=2
export EXPERIMENT_NAME=countdown-qwen2.5-3b
export VLLM_ATTENTION_BACKEND=XFORMERS

bash ./scripts/train_tiny_zero.sh


i got this error:


Traceback (most recent call last):
  File "/home/powerop/work/rwq/myoperate/TinyZero/verl/trainer/main_ppo.py", line 103, in main
    ray.get(main_task.remote(config))
  File "/home/octopus/work/conda/envs/ds_zero/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/octopus/work/conda/envs/ds_zero/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/octopus/work/conda/envs/ds_zero/lib/python3.9/site-packages/ray/_private/worker.py", line 2755, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/octopus/work/conda/envs/ds_zero/lib/python3.9/site-packages/ray/_private/worker.py", line 906, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RaySystemError): ray::main_task() (pid=88844, ip=10.59.144.61)
  File "/home/powerop/work/rwq/myoperate/TinyZero/verl/trainer/main_ppo.py", line 188, in main_task
    trainer.init_workers()
  File "/home/powerop/work/rwq/myoperate/TinyZero/verl/trainer/ppo/ray_trainer.py", line 494, in init_workers
    wg_dict = self.ray_worker_group_cls(resource_pool=resource_pool, ray_cls_with_init=worker_dict_cls)
  File "/home/powerop/work/rwq/myoperate/TinyZero/verl/single_controller/ray/base.py", line 197, in __init__
    self._init_with_resource_pool(resource_pool=resource_pool,
  File "/home/powerop/work/rwq/myoperate/TinyZero/verl/single_controller/ray/base.py", line 220, in _init_with_resource_pool
    pgs = resource_pool.get_placement_groups(strategy=strategy)
  File "/home/powerop/work/rwq/myoperate/TinyZero/verl/single_controller/ray/base.py", line 80, in get_placement_groups
    pgs = [
  File "/home/powerop/work/rwq/myoperate/TinyZero/verl/single_controller/ray/base.py", line 81, in <listcomp>
    placement_group(bundles=bundles, strategy=strategy, name=pg_name_prefix + str(idx), lifetime=lifetime)
  File "/home/octopus/work/conda/envs/ds_zero/lib/python3.9/site-packages/ray/util/placement_group.py", line 211, in placement_group
    placement_group_id = worker.core_worker.create_placement_group(
  File "python/ray/includes/common.pxi", line 104, in ray._raylet.check_status
ray.exceptions.RaySystemError: System error: Failed to create placement group '34872bc423461785f8db1e66eac101000000' because name 'global_poolverl_group_2:0' already exists.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.



what is the reason?

@ArlanCooper
Copy link
Author

my env is

cuda 12.2
torch 2.4.0
vllm 0.6.3

@Hanpx20
Copy link

Hanpx20 commented Feb 11, 2025

same error

@yangzhch6
Copy link

Multi GPU works for me, but single gpu script encountered the same error

@ITSXU0
Copy link

ITSXU0 commented Feb 27, 2025

i ran the code as readme,


# 设置参数
export CUDA_VISIBLE_DEVICES=2,3
export N_GPUS=2
export BASE_MODEL=/home/octopus/data/llm_list/Qwen2.5-3B
export DATA_DIR=/home/powerop/work/rwq/myoperate/TinyZero/data_sets
export ROLLOUT_TP_SIZE=2
export EXPERIMENT_NAME=countdown-qwen2.5-3b
export VLLM_ATTENTION_BACKEND=XFORMERS

bash ./scripts/train_tiny_zero.sh

i got this error:


Traceback (most recent call last):
  File "/home/powerop/work/rwq/myoperate/TinyZero/verl/trainer/main_ppo.py", line 103, in main
    ray.get(main_task.remote(config))
  File "/home/octopus/work/conda/envs/ds_zero/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/octopus/work/conda/envs/ds_zero/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/octopus/work/conda/envs/ds_zero/lib/python3.9/site-packages/ray/_private/worker.py", line 2755, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/octopus/work/conda/envs/ds_zero/lib/python3.9/site-packages/ray/_private/worker.py", line 906, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RaySystemError): ray::main_task() (pid=88844, ip=10.59.144.61)
  File "/home/powerop/work/rwq/myoperate/TinyZero/verl/trainer/main_ppo.py", line 188, in main_task
    trainer.init_workers()
  File "/home/powerop/work/rwq/myoperate/TinyZero/verl/trainer/ppo/ray_trainer.py", line 494, in init_workers
    wg_dict = self.ray_worker_group_cls(resource_pool=resource_pool, ray_cls_with_init=worker_dict_cls)
  File "/home/powerop/work/rwq/myoperate/TinyZero/verl/single_controller/ray/base.py", line 197, in __init__
    self._init_with_resource_pool(resource_pool=resource_pool,
  File "/home/powerop/work/rwq/myoperate/TinyZero/verl/single_controller/ray/base.py", line 220, in _init_with_resource_pool
    pgs = resource_pool.get_placement_groups(strategy=strategy)
  File "/home/powerop/work/rwq/myoperate/TinyZero/verl/single_controller/ray/base.py", line 80, in get_placement_groups
    pgs = [
  File "/home/powerop/work/rwq/myoperate/TinyZero/verl/single_controller/ray/base.py", line 81, in <listcomp>
    placement_group(bundles=bundles, strategy=strategy, name=pg_name_prefix + str(idx), lifetime=lifetime)
  File "/home/octopus/work/conda/envs/ds_zero/lib/python3.9/site-packages/ray/util/placement_group.py", line 211, in placement_group
    placement_group_id = worker.core_worker.create_placement_group(
  File "python/ray/includes/common.pxi", line 104, in ray._raylet.check_status
ray.exceptions.RaySystemError: System error: Failed to create placement group '34872bc423461785f8db1e66eac101000000' because name 'global_poolverl_group_2:0' already exists.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

what is the reason?

Have you solved this problem? I have the same problem, thank you.

@anapple-hub
Copy link

same error

1 similar comment
@PinzhengWang322
Copy link

same error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants