Shape matching error #83

lixiaochuan2020 · 2025-02-21T12:04:27Z

Hi! First, thank you for this valuable repository. When I followed your README and tried to run the shell, I encountered an error while using 2 GPUs for Qwen2.5-3b.

Error executing job with overrides: ['data.train_files=./dataset/train.parquet', 'data.val_files=./dataset/test.parquet', 'data.train_batch_size=256', 'data.val_batch_size=1312', 'data.max_prompt_length=256', 'data.max_response_length=1024', 'actor_rollout_ref.model.path=PATH', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=128', 'actor_rollout_ref.actor.ppo_micro_batch_size=8', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=8', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.4', 'actor_rollout_ref.ref.log_prob_micro_batch_size=4', 'critic.optim.lr=1e-5', 'critic.model.path=PATH', 'critic.ppo_micro_batch_size=8', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.logger=[wandb]', '+trainer.val_before_train=False', 'trainer.default_hdfs_dir=null', 'trainer.n_gpus_per_node=2', 'trainer.nnodes=1', 'trainer.save_freq=100', 'trainer.test_freq=100', 'trainer.project_name=TinyZero', 'trainer.experiment_name=countdown-qwen2.5-0.3b', 'trainer.total_epochs=15']
Traceback (most recent call last):
  File "PATH/TinyZero/verl/trainer/main_ppo.py", line 103, in main
    ray.get(main_task.remote(config))
  File "PATH/anaconda3/envs/vl/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "PATH/anaconda3/envs/vl/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "PATH/anaconda3/envs/vl/lib/python3.11/site-packages/ray/_private/worker.py", line 2755, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "PATH/anaconda3/envs/vl/lib/python3.11/site-packages/ray/_private/worker.py", line 906, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::main_task() (pid=1184447, ip=10.66.99.70)
  File "PATH/TinyZero/verl/trainer/main_ppo.py", line 189, in main_task
    trainer.fit()
  File "PATH/TinyZero/verl/trainer/ppo/ray_trainer.py", line 589, in fit
    gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "PATH/TinyZero/verl/single_controller/ray/base.py", line 42, in func
    output = ray.get(output)
             ^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(RuntimeError): ray::WorkerDict.actor_rollout_generate_sequences() (pid=1187143, ip=10.66.99.70, actor_id=9eb50610df01f7645e69b54901000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f69ae62dc90>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "PATH/TinyZero/verl/single_controller/ray/base.py", line 399, in func
    return getattr(self.worker_dict[key], name)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "PATH/TinyZero/verl/single_controller/base/decorator.py", line 404, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "PATH/TinyZero/verl/workers/fsdp_workers.py", line 434, in generate_sequences
    old_log_probs = self.actor.compute_log_prob(data=output)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "PATH/TinyZero/verl/workers/actor/dp_actor.py", line 191, in compute_log_prob
    _, log_probs = self._forward_micro_batch(micro_batch, temperature=temperature)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "PATH/TinyZero/verl/workers/actor/dp_actor.py", line 138, in _forward_micro_batch
    log_probs = logprobs_from_logits(logits, micro_batch['responses'])
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "PATH/TinyZero/verl/utils/torch_functional.py", line 67, in logprobs_from_logits
    output = output.view(*batch_dim)
             ^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape '[4, 1024]' is invalid for input of size 1

I discovered that the output originates from this function :

def logprobs_from_logits_flash_attn(logits, labels):
    # output = -cross_entropy_loss(logits, labels)[0]
    # return output

The expression -cross_entropy_loss(logits, labels) produces a tensor of size (4096,). Accessing index 0 reduces it to a single tensor value, leading to a mismatch error.

I haven't found any other issues related to this function. Is there anything I might have missed or misunderstood?

Thanks.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shape matching error #83

Shape matching error #83

lixiaochuan2020 commented Feb 21, 2025

Shape matching error #83

Shape matching error #83

Comments

lixiaochuan2020 commented Feb 21, 2025