You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! First, thank you for this valuable repository. When I followed your README and tried to run the shell, I encountered an error while using 2 GPUs for Qwen2.5-3b.
Error executing job with overrides: ['data.train_files=./dataset/train.parquet', 'data.val_files=./dataset/test.parquet', 'data.train_batch_size=256', 'data.val_batch_size=1312', 'data.max_prompt_length=256', 'data.max_response_length=1024', 'actor_rollout_ref.model.path=PATH', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=128', 'actor_rollout_ref.actor.ppo_micro_batch_size=8', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=8', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.4', 'actor_rollout_ref.ref.log_prob_micro_batch_size=4', 'critic.optim.lr=1e-5', 'critic.model.path=PATH', 'critic.ppo_micro_batch_size=8', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.logger=[wandb]', '+trainer.val_before_train=False', 'trainer.default_hdfs_dir=null', 'trainer.n_gpus_per_node=2', 'trainer.nnodes=1', 'trainer.save_freq=100', 'trainer.test_freq=100', 'trainer.project_name=TinyZero', 'trainer.experiment_name=countdown-qwen2.5-0.3b', 'trainer.total_epochs=15']
Traceback (most recent call last):
File "PATH/TinyZero/verl/trainer/main_ppo.py", line 103, in main
ray.get(main_task.remote(config))
File "PATH/anaconda3/envs/vl/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "PATH/anaconda3/envs/vl/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "PATH/anaconda3/envs/vl/lib/python3.11/site-packages/ray/_private/worker.py", line 2755, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "PATH/anaconda3/envs/vl/lib/python3.11/site-packages/ray/_private/worker.py", line 906, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::main_task() (pid=1184447, ip=10.66.99.70)
File "PATH/TinyZero/verl/trainer/main_ppo.py", line 189, in main_task
trainer.fit()
File "PATH/TinyZero/verl/trainer/ppo/ray_trainer.py", line 589, in fit
gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "PATH/TinyZero/verl/single_controller/ray/base.py", line 42, in func
output = ray.get(output)
^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(RuntimeError): ray::WorkerDict.actor_rollout_generate_sequences() (pid=1187143, ip=10.66.99.70, actor_id=9eb50610df01f7645e69b54901000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f69ae62dc90>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "PATH/TinyZero/verl/single_controller/ray/base.py", line 399, in func
return getattr(self.worker_dict[key], name)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "PATH/TinyZero/verl/single_controller/base/decorator.py", line 404, in inner
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "PATH/TinyZero/verl/workers/fsdp_workers.py", line 434, in generate_sequences
old_log_probs = self.actor.compute_log_prob(data=output)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "PATH/TinyZero/verl/workers/actor/dp_actor.py", line 191, in compute_log_prob
_, log_probs = self._forward_micro_batch(micro_batch, temperature=temperature)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "PATH/TinyZero/verl/workers/actor/dp_actor.py", line 138, in _forward_micro_batch
log_probs = logprobs_from_logits(logits, micro_batch['responses'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "PATH/TinyZero/verl/utils/torch_functional.py", line 67, in logprobs_from_logits
output = output.view(*batch_dim)
^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape '[4, 1024]' is invalid for input of size 1
I discovered that the output originates from this function :
The expression -cross_entropy_loss(logits, labels) produces a tensor of size (4096,). Accessing index 0 reduces it to a single tensor value, leading to a mismatch error.
I haven't found any other issues related to this function. Is there anything I might have missed or misunderstood?
Thanks.
The text was updated successfully, but these errors were encountered:
Hi! First, thank you for this valuable repository. When I followed your README and tried to run the shell, I encountered an error while using 2 GPUs for Qwen2.5-3b.
I discovered that the output originates from this function :
The expression
-cross_entropy_loss(logits, labels)
produces a tensor of size (4096,). Accessing index 0 reduces it to a single tensor value, leading to a mismatch error.I haven't found any other issues related to this function. Is there anything I might have missed or misunderstood?
Thanks.
The text was updated successfully, but these errors were encountered: