invalid gradient at index 1 - expected device meta but got cuda:0 #2241

caterpillarpants · 2023-05-21T04:20:27Z

caterpillarpants
May 21, 2023

updated ooba and kernel and now I'm getting this error when training.

Traceback (most recent call last): File "/home/ana/oobabooga_linux/installer_files/env/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/home/ana/oobabooga_linux/installer_files/env/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/home/ana/oobabooga_linux/text-generation-webui/modules/training.py", line 428, in threaded_run trainer.train() File "/home/ana/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train return inner_training_loop( File "/home/ana/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/ana/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/transformers/trainer.py", line 2709, in training_step self.scaler.scale(loss).backward() File "/home/ana/.local/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/ana/.local/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/ana/.local/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply return user_fn(self, *args) File "/home/ana/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 157, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/home/ana/.local/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Function MmBackward0 returned an invalid gradient at index 1 - expected device meta but got cuda:0

I'm guessing my kernel is not compatible with ooba or one of it's components such as pytorch. So, I wonder what kernel I should use with ooba?

Or perhaps it's nothing to do with the kernel. Any help would be appreciated.

update: I tried a bunch of kernels and same issue, so I don't think it's the kernels version. perhaps it's another dependency version issue?

update: I get the error while training in any mode except cpu. for example:
if I choose cpu alone then training works
if I choose 8-bit then "got cuda:0" error
if I choose audo-device then "got cuda:0" error
if I choose auto-device and 8-bit then "got cuda:0" error

model will run under all these combinations, just not train.
It would be nice to figure it out, since cpu training is extremely slow.
And I should point out, if I haven't already, that 8-bit with cpu offloading was working a while ago.

update: So the error comes from pytorch and perhaps it can't find the gpu which is cuda controlled. so my system has pytorch 2 and cuda 11.5. The pytorch website says they want cuda 11.6, 11.7 or 11.8. Unfortunately, I'm running the latest kernel in linux mint 19.whatever and I guess it comes with cuda 11.5.

OK, fine. I don't know if cuda comes with the kernel of if it lives in pip world. Pip has all these cudas:
nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.2.10.91 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusparse-cu11 11.7.4.91 nvidia-ml-py3 7.352.0 nvidia-nccl-cu11 2.14.3 nvidia-nvtx-cu11 11.7.91
apt has cuda too....
how to upgrade cuda? or should I downgrade pytorch?

update: Does this thing want cuda-toolkit? or cuda-the-driver? I'm not super comfy with using my work computer to do experimental cuda drivers. The repos stop at 11.5 for a reason and that reason might be stability which I approve of.
I did update my os to mint 21, but the cuda-toolkit remained at 11.5. I uninstalled cuda-toolkit 11.5, but I'm not sure if I install 11.6 if it will work with the device driver or if I will need to update that too. I'm a little spooked because it would be a real bummer to loose this particular computer for even a day if the display were to go down or x not load or something.
And what is the deal with rwkv_cuda_on in the interface tab? Should I be using this? Should I not?

update:
I've got the cpu training blues... do-doo-do-do-do...
ain't got no gpu... do-doo-do-do-do...
my language model's stupid
and it's all because of you! di-di-di-di-di-di-di...
I got the cpu training blues... do-do-di-do...
I got the cpu training blues! do-do-di-do
I got the cpu training blues.... di-do-do
and you know what?,,,, it's all because of you, do-do-do-do-di-do-do-di-do-do-di-do-do-dooooooo

update: deleted old ooba and downloaded installer and reinstalled and same issue. actually, it's worse. now I can't train with cpu either:( please help me so I don't have to install another ai-bot-frontend! I do like ooba. when it works it is incredible.

DEV3YeHe · 2023-06-02T08:45:39Z

DEV3YeHe
Jun 2, 2023

I ran into the same problem.
cuda 12.1

5 replies

caterpillarpants Jun 5, 2023
Author

ohhhhh.... So you are running cuda 12.1 and still experiencing the problem. Additionally, I have been poking around and I am pretty sure ooba is using it's own cuda in pip. Together this is strong evidence we don't have to worry about the cuda version and it is another problem.

caterpillarpants Sep 18, 2023
Author

update: this has been fixed in a later version.

xll0328 Dec 12, 2023

How could I solve this promblem? I still have it:
RuntimeError: Function MmBackward0 returned an invalid gradient at index 1 - expected device meta but got cuda:1
and my:
torch = 2.1.1
cuda = 12.1

cooperdk Jan 2, 2026

update: this has been fixed in a later version.

I have the error NOW, in 2025(!!!), using 2 GPUs so clearly, it has not been fixed

JasXSL Jan 20, 2026

Same issue, using RTX5090 on win11 following the One-click installer instructions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

invalid gradient at index 1 - expected device meta but got cuda:0 #2241

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

invalid gradient at index 1 - expected device meta but got cuda:0 #2241

Uh oh!

Uh oh!

caterpillarpants May 21, 2023

Replies: 1 comment · 5 replies

Uh oh!

DEV3YeHe Jun 2, 2023

Uh oh!

caterpillarpants Jun 5, 2023 Author

Uh oh!

caterpillarpants Sep 18, 2023 Author

Uh oh!

xll0328 Dec 12, 2023

Uh oh!

cooperdk Jan 2, 2026

Uh oh!

JasXSL Jan 20, 2026

caterpillarpants
May 21, 2023

Replies: 1 comment 5 replies

DEV3YeHe
Jun 2, 2023

caterpillarpants Jun 5, 2023
Author

caterpillarpants Sep 18, 2023
Author