invalid gradient at index 1 - expected device meta but got cuda:0 #2241
Replies: 1 comment 5 replies
-
|
I ran into the same problem. |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
updated ooba and kernel and now I'm getting this error when training.
Traceback (most recent call last): File "/home/ana/oobabooga_linux/installer_files/env/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/home/ana/oobabooga_linux/installer_files/env/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/home/ana/oobabooga_linux/text-generation-webui/modules/training.py", line 428, in threaded_run trainer.train() File "/home/ana/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train return inner_training_loop( File "/home/ana/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/ana/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/transformers/trainer.py", line 2709, in training_step self.scaler.scale(loss).backward() File "/home/ana/.local/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/ana/.local/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/ana/.local/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply return user_fn(self, *args) File "/home/ana/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 157, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/home/ana/.local/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Function MmBackward0 returned an invalid gradient at index 1 - expected device meta but got cuda:0I'm guessing my kernel is not compatible with ooba or one of it's components such as pytorch. So, I wonder what kernel I should use with ooba?
Or perhaps it's nothing to do with the kernel. Any help would be appreciated.
update: I tried a bunch of kernels and same issue, so I don't think it's the kernels version. perhaps it's another dependency version issue?
update: I get the error while training in any mode except cpu. for example:
if I choose cpu alone then training works
if I choose 8-bit then "got cuda:0" error
if I choose audo-device then "got cuda:0" error
if I choose auto-device and 8-bit then "got cuda:0" error
model will run under all these combinations, just not train.
It would be nice to figure it out, since cpu training is extremely slow.
And I should point out, if I haven't already, that 8-bit with cpu offloading was working a while ago.
update: So the error comes from pytorch and perhaps it can't find the gpu which is cuda controlled. so my system has pytorch 2 and cuda 11.5. The pytorch website says they want cuda 11.6, 11.7 or 11.8. Unfortunately, I'm running the latest kernel in linux mint 19.whatever and I guess it comes with cuda 11.5.
OK, fine. I don't know if cuda comes with the kernel of if it lives in pip world. Pip has all these cudas:
nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.2.10.91 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusparse-cu11 11.7.4.91 nvidia-ml-py3 7.352.0 nvidia-nccl-cu11 2.14.3 nvidia-nvtx-cu11 11.7.91apt has cuda too....
how to upgrade cuda? or should I downgrade pytorch?
update: Does this thing want cuda-toolkit? or cuda-the-driver? I'm not super comfy with using my work computer to do experimental cuda drivers. The repos stop at 11.5 for a reason and that reason might be stability which I approve of.
I did update my os to mint 21, but the cuda-toolkit remained at 11.5. I uninstalled cuda-toolkit 11.5, but I'm not sure if I install 11.6 if it will work with the device driver or if I will need to update that too. I'm a little spooked because it would be a real bummer to loose this particular computer for even a day if the display were to go down or x not load or something.
And what is the deal with rwkv_cuda_on in the interface tab? Should I be using this? Should I not?
update:
I've got the cpu training blues... do-doo-do-do-do...
ain't got no gpu... do-doo-do-do-do...
my language model's stupid
and it's all because of you! di-di-di-di-di-di-di...
I got the cpu training blues... do-do-di-do...
I got the cpu training blues! do-do-di-do
I got the cpu training blues.... di-do-do
and you know what?,,,, it's all because of you, do-do-do-do-di-do-do-di-do-do-di-do-do-dooooooo
update: deleted old ooba and downloaded installer and reinstalled and same issue. actually, it's worse. now I can't train with cpu either:( please help me so I don't have to install another ai-bot-frontend! I do like ooba. when it works it is incredible.
Beta Was this translation helpful? Give feedback.
All reactions