Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Exception in ASGI application when trying inference with an image wit h Qwen2.5-VL-72B #732

Open
3 tasks done
Lissanro opened this issue Feb 5, 2025 · 2 comments
Open
3 tasks done
Labels
bug Something isn't working

Comments

@Lissanro
Copy link

Lissanro commented Feb 5, 2025

OS

Linux

GPU Library

CUDA 12.x

Python version

3.12

Pytorch version

I assume the latest version that ExllamaV2 installs as a requirements

Model

https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct

Describe the bug

I have four 3090 GPUs, and I try to use Qwen2.5-VL-72B (6bpw EXL2 quant, which with 64K Q6 context leaves memory in fourth GPUs completely free, and more than 4GB of free memory in the third GPU). The model works for text generation just fine. But as soon as I attach an image and ask to describe it, I get "Exception in ASGI application" error, with the log that says at the end that the issue comes from the ExllamaV2 library:

ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/linear.py", line 367, in forward
ERROR:        hidden_states_out = torch.matmul(hidden_states, matrix)
ERROR:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:    torch.OutOfMemoryError: Allocation on device

I have plenty of VRAM though, so I assume it is a bug. At first I tried with 8bpw quant, but then I created 6bpw quant and now have almost 30GB of VRAM free, yet I still get this error.

It is worth mentioning that I can use Pixtral-Large-Instruct-2411-exl2-5.0bpw with Q6 cache and 64K context without issues, and it has 124B parameters, while Qwen2.5-VL has only 72B, and as I have mentioned I have plenty of free VRAM.

Maybe ExllamaV2 forgets to allocate required VRAM and perhaps tries to get on a GPU with already filled VRAM, hence the error. But I do not know this for sure or how to debug this, so this is just my guess.

Reproduction steps

I am using TabbyAPI https://github.com/theroyallab/tabbyAPI and start it like this:

./start.sh --vision True --model-name Qwen2.5-VL-72B-Instruct-exl2-6.0bpw-128000seq --cache-mode Q6 --max-seq-len 65536

I cannot upload my 6.0bpw quant (due to my Internet connection upload limitations), but I can share the command I used to generate the quant and include the measurements.json:

python ./exllamav2/convert.py -i /tmp/Qwen2.5-VL-72B-Instruct \ -m /tmp/Qwen2.5-VL-72B-Instruct-128000seq/measurement.json \ -o /tmp/Qwen2.5-VL-72B-Instruct-128000seq-convert \ -cf /tmp/Qwen2.5-VL-72B-Instruct-6.0bpw-exl2 \ -hb 8 -b 6

measurement.json

Expected behavior

Get description of the image instead of an error. Also, I think that any VRAM needed for vision should have been preallocated during the model loading, or at least should have used GPU with most free memory if it needs allocate dynamically.

Logs

Activating venv
pip 24.0 from /home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/pip (python 3.12)
Loaded your saved preferences from start_options.json
Starting TabbyAPI...
INFO: ExllamaV2 version: 0.2.7
INFO: Your API key is: a2cdb0c05aa3bbc7dd749016257e48a3
INFO: Your admin key is: 9f77b1c5bc877f4d4ec6c2ef5194824b
INFO:
INFO: If these keys get compromised, make sure to delete api_tokens.yml and restart the server. Have fun!
INFO: Generation logging is disabled
WARNING: Draft model is disabled because a model name wasn't provided. Please check your config.yml!
WARNING: The given cache_size (65536) is less than 2 * max_seq_len and may be too small for requests using CFG.
WARNING: Ignore this warning if you do not plan on using CFG.
INFO: Attempting to load a prompt template if present.
INFO: Using template "from_tokenizer_config" for chat completions.
INFO: Loading model: /mnt/neuro/text-generation-webui/models/Qwen2.5-VL-72B-Instruct-exl2-6.0bpw-128000seq
INFO: Loading with autosplit
INFO: Model successfully loaded.
Loading vision modules ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 66/66 0:00:00
Loading model modules ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 163/163 0:00:00
INFO: Developer documentation: http://127.0.0.1:5000/redoc
INFO: Starting OAI API
INFO: Completions: http://127.0.0.1:5000/v1/completions
INFO: Chat completions: http://127.0.0.1:5000/v1/chat/completions
INFO: Started server process [2097040]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:5000 (Press CTRL+C to quit)
INFO: 127.0.0.1:51690 - "POST /v1/chat/completions HTTP/1.1" 200
INFO: Received chat completion streaming request 4cec6fac80cc4a86b6c0d175f0e39c1c
INFO: Finished chat completion streaming request 4cec6fac80cc4a86b6c0d175f0e39c1c
INFO: Metrics (ID: 4cec6fac80cc4a86b6c0d175f0e39c1c): 280 tokens generated in 24.75 seconds (Queue: 0.0 s, Process: 0 cached tokens and 1778 new tokens at 677.72 T/s, Generate: 12.66 T/s, Context: 1778 tokens)
INFO: 127.0.0.1:43296 - "POST /v1/chat/completions HTTP/1.1" 500
ERROR: Exception in ASGI application
ERROR: Traceback (most recent call last):
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
ERROR: result = await app( # type: ignore[func-returns-value]
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in call
ERROR: return await self.app(scope, receive, send)
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in call
ERROR: await super().call(scope, receive, send)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/applications.py", line 113, in call
ERROR: await self.middleware_stack(scope, receive, send)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/middleware/errors.py", line 187, in call
ERROR: raise exc
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/middleware/errors.py", line 165, in call
ERROR: await self.app(scope, receive, _send)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/middleware/cors.py", line 85, in call
ERROR: await self.app(scope, receive, send)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 62, in call
ERROR: await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
ERROR: raise exc
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
ERROR: await app(scope, receive, sender)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/routing.py", line 715, in call
ERROR: await self.middleware_stack(scope, receive, send)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/routing.py", line 735, in app
ERROR: await route.handle(scope, receive, send)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/routing.py", line 288, in handle
ERROR: await self.app(scope, receive, send)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/routing.py", line 76, in app
ERROR: await wrap_app_handling_exceptions(app, request)(scope, receive, send)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
ERROR: raise exc
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
ERROR: await app(scope, receive, sender)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/routing.py", line 73, in app
ERROR: response = await f(request)
ERROR: ^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/fastapi/routing.py", line 301, in app
ERROR: raw_response = await run_endpoint_function(
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
ERROR: return await dependant.call(**values)
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/endpoints/OAI/router.py", line 126, in chat_completion_request
ERROR: prompt, embeddings = await apply_chat_template(data)
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/endpoints/OAI/utils/chat_completion.py", line 257, in apply_chat_template
ERROR: prompt, mm_embeddings, template_vars = await format_messages_with_template(
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/endpoints/OAI/utils/chat_completion.py", line 220, in format_messages_with_template
ERROR: await mm_embeddings.add(content.image_url.url)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/common/multimodal.py", line 26, in add
ERROR: embedding = await get_image_embedding(url)
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/async_lru/init.py", line 227, in call
ERROR: return await asyncio.shield(fut)
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/backends/exllamav2/vision.py", line 68, in get_image_embedding
ERROR: return model.container.vision_model.get_image_embeddings(
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/vlm/vision_tower.py", line 320, in get_image_embeddings
ERROR: embedding_tensor = self.process(
ERROR: ^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/vlm/vision_tower.py", line 244, in process
ERROR: hidden_states = module.forward(
ERROR: ^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/mlp.py", line 333, in forward
ERROR: return self.forward_torch(hidden_states, cache, attn_params, past_len, intermediates, loras = loras, **kwargs)
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/mlp.py", line 471, in forward_torch
ERROR: up = self.up_proj.forward(post_norm, loras = loras)
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/linear.py", line 367, in forward
ERROR: hidden_states_out = torch.matmul(hidden_states, matrix)
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: torch.OutOfMemoryError: Allocation on device

Additional context

Qwen2.5-VL is a huge step forward so it would be great to be able to run it, I would greatly appreciate any help - please let me know if I did something wrong or if I need to provide more information.

Acknowledgements

  • I have looked for similar issues before submitting this one.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will ask my questions politely.
@Lissanro Lissanro added the bug Something isn't working label Feb 5, 2025
@remichu-ai
Copy link

For qwen 2.5 vl, you will need to use the dev branch as it is not merged to master yet. Qwen 2.5 vl 72b ran fine on my system with the dev branch

@Lissanro
Copy link
Author

Lissanro commented Feb 5, 2025

I am running the latest dev version. Just tried to update both TabbyAPI and ExllamaV2 again, still the same issue.

I tried to load it differently to reserve some spare memory on each GPU:

./start.sh --vision True --model-name Qwen2.5-VL-72B-Instruct-exl2-6.0bpw-128000seq \
  --cache-mode Q6 --max-seq-len 65536 \
  --gpu-split 16 16 16 16

Notice the new --gpu-split 16 16 16 16 argument. It leaves a lot of free memory on each GPU. With this, image inference worked without issues. Thus, it seems my guess is correct - when using Qwen 2.5 VL 72B, it forgets to allocate VRAM needed for vision, and when it needs to allocate it dynamically, it just uses the first GPU instead of GPU with the most of free memory.

Here is VRAM usage after loding the model with --gpu-split 16 16 16 16 argument:

19065MiB /  24576MiB
17561MiB /  24576MiB
18269MiB /  24576MiB
13177MiB /  24576MiB

Then, I asked to describe an image, the memory usage in the first GPU increased by 0.4GB (this is what causing the momery allocation when using the default auto split), I omitted other GPUs since their memory usage did not change:

19465MiB /  24576MiB

Then, I asked to describe the second image, the memory usage almost unchanged:

19475MiB /  24576MiB

Asking about one more image did not change the memory usage further:

19475MiB /  24576MiB

Based on this, I think the best fix would be just to preallocate all needed memory, since this would be the most reliable solution, unless there is a good reason why it is implemented to allocate memory dynamically.

If this is the case and it has to be dynamically allocated, then doing it on GPU with most of free memory would also fix this issue.

In the meantime, workaround is to either use --gpu-split and choose manually how much GB to use on each GPU, or alternatively use --autosplit-reserve 512 to reserve 0.5GiB (512 MiB).

So, this bug is about Qwen2.5 VL 72B not working with the default --gpu-split-auto True when using multiple GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants