Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shape matching error #83

Open
lixiaochuan2020 opened this issue Feb 21, 2025 · 0 comments
Open

Shape matching error #83

lixiaochuan2020 opened this issue Feb 21, 2025 · 0 comments

Comments

@lixiaochuan2020
Copy link

Hi! First, thank you for this valuable repository. When I followed your README and tried to run the shell, I encountered an error while using 2 GPUs for Qwen2.5-3b.

Error executing job with overrides: ['data.train_files=./dataset/train.parquet', 'data.val_files=./dataset/test.parquet', 'data.train_batch_size=256', 'data.val_batch_size=1312', 'data.max_prompt_length=256', 'data.max_response_length=1024', 'actor_rollout_ref.model.path=PATH', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=128', 'actor_rollout_ref.actor.ppo_micro_batch_size=8', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=8', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.4', 'actor_rollout_ref.ref.log_prob_micro_batch_size=4', 'critic.optim.lr=1e-5', 'critic.model.path=PATH', 'critic.ppo_micro_batch_size=8', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.logger=[wandb]', '+trainer.val_before_train=False', 'trainer.default_hdfs_dir=null', 'trainer.n_gpus_per_node=2', 'trainer.nnodes=1', 'trainer.save_freq=100', 'trainer.test_freq=100', 'trainer.project_name=TinyZero', 'trainer.experiment_name=countdown-qwen2.5-0.3b', 'trainer.total_epochs=15']
Traceback (most recent call last):
  File "PATH/TinyZero/verl/trainer/main_ppo.py", line 103, in main
    ray.get(main_task.remote(config))
  File "PATH/anaconda3/envs/vl/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "PATH/anaconda3/envs/vl/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "PATH/anaconda3/envs/vl/lib/python3.11/site-packages/ray/_private/worker.py", line 2755, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "PATH/anaconda3/envs/vl/lib/python3.11/site-packages/ray/_private/worker.py", line 906, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::main_task() (pid=1184447, ip=10.66.99.70)
  File "PATH/TinyZero/verl/trainer/main_ppo.py", line 189, in main_task
    trainer.fit()
  File "PATH/TinyZero/verl/trainer/ppo/ray_trainer.py", line 589, in fit
    gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "PATH/TinyZero/verl/single_controller/ray/base.py", line 42, in func
    output = ray.get(output)
             ^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(RuntimeError): ray::WorkerDict.actor_rollout_generate_sequences() (pid=1187143, ip=10.66.99.70, actor_id=9eb50610df01f7645e69b54901000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f69ae62dc90>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "PATH/TinyZero/verl/single_controller/ray/base.py", line 399, in func
    return getattr(self.worker_dict[key], name)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "PATH/TinyZero/verl/single_controller/base/decorator.py", line 404, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "PATH/TinyZero/verl/workers/fsdp_workers.py", line 434, in generate_sequences
    old_log_probs = self.actor.compute_log_prob(data=output)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "PATH/TinyZero/verl/workers/actor/dp_actor.py", line 191, in compute_log_prob
    _, log_probs = self._forward_micro_batch(micro_batch, temperature=temperature)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "PATH/TinyZero/verl/workers/actor/dp_actor.py", line 138, in _forward_micro_batch
    log_probs = logprobs_from_logits(logits, micro_batch['responses'])
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "PATH/TinyZero/verl/utils/torch_functional.py", line 67, in logprobs_from_logits
    output = output.view(*batch_dim)
             ^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape '[4, 1024]' is invalid for input of size 1

I discovered that the output originates from this function :

def logprobs_from_logits_flash_attn(logits, labels):
    # output = -cross_entropy_loss(logits, labels)[0]
    # return output

The expression -cross_entropy_loss(logits, labels) produces a tensor of size (4096,). Accessing index 0 reduces it to a single tensor value, leading to a mismatch error.

I haven't found any other issues related to this function. Is there anything I might have missed or misunderstood?

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant