[BUG] Exception in ASGI application when trying inference with an image wit h Qwen2.5-VL-72B #732
Open
3 tasks done
Labels
bug
Something isn't working
OS
Linux
GPU Library
CUDA 12.x
Python version
3.12
Pytorch version
I assume the latest version that ExllamaV2 installs as a requirements
Model
https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct
Describe the bug
I have four 3090 GPUs, and I try to use Qwen2.5-VL-72B (6bpw EXL2 quant, which with 64K Q6 context leaves memory in fourth GPUs completely free, and more than 4GB of free memory in the third GPU). The model works for text generation just fine. But as soon as I attach an image and ask to describe it, I get "Exception in ASGI application" error, with the log that says at the end that the issue comes from the ExllamaV2 library:
I have plenty of VRAM though, so I assume it is a bug. At first I tried with 8bpw quant, but then I created 6bpw quant and now have almost 30GB of VRAM free, yet I still get this error.
It is worth mentioning that I can use Pixtral-Large-Instruct-2411-exl2-5.0bpw with Q6 cache and 64K context without issues, and it has 124B parameters, while Qwen2.5-VL has only 72B, and as I have mentioned I have plenty of free VRAM.
Maybe ExllamaV2 forgets to allocate required VRAM and perhaps tries to get on a GPU with already filled VRAM, hence the error. But I do not know this for sure or how to debug this, so this is just my guess.
Reproduction steps
I am using TabbyAPI https://github.com/theroyallab/tabbyAPI and start it like this:
./start.sh --vision True --model-name Qwen2.5-VL-72B-Instruct-exl2-6.0bpw-128000seq --cache-mode Q6 --max-seq-len 65536
I cannot upload my 6.0bpw quant (due to my Internet connection upload limitations), but I can share the command I used to generate the quant and include the measurements.json:
python ./exllamav2/convert.py -i /tmp/Qwen2.5-VL-72B-Instruct \ -m /tmp/Qwen2.5-VL-72B-Instruct-128000seq/measurement.json \ -o /tmp/Qwen2.5-VL-72B-Instruct-128000seq-convert \ -cf /tmp/Qwen2.5-VL-72B-Instruct-6.0bpw-exl2 \ -hb 8 -b 6
measurement.json
Expected behavior
Get description of the image instead of an error. Also, I think that any VRAM needed for vision should have been preallocated during the model loading, or at least should have used GPU with most free memory if it needs allocate dynamically.
Logs
Activating venv
pip 24.0 from /home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/pip (python 3.12)
Loaded your saved preferences from
start_options.json
Starting TabbyAPI...
INFO: ExllamaV2 version: 0.2.7
INFO: Your API key is: a2cdb0c05aa3bbc7dd749016257e48a3
INFO: Your admin key is: 9f77b1c5bc877f4d4ec6c2ef5194824b
INFO:
INFO: If these keys get compromised, make sure to delete api_tokens.yml and restart the server. Have fun!
INFO: Generation logging is disabled
WARNING: Draft model is disabled because a model name wasn't provided. Please check your config.yml!
WARNING: The given cache_size (65536) is less than 2 * max_seq_len and may be too small for requests using CFG.
WARNING: Ignore this warning if you do not plan on using CFG.
INFO: Attempting to load a prompt template if present.
INFO: Using template "from_tokenizer_config" for chat completions.
INFO: Loading model: /mnt/neuro/text-generation-webui/models/Qwen2.5-VL-72B-Instruct-exl2-6.0bpw-128000seq
INFO: Loading with autosplit
INFO: Model successfully loaded.
Loading vision modules ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 66/66 0:00:00
Loading model modules ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 163/163 0:00:00
INFO: Developer documentation: http://127.0.0.1:5000/redoc
INFO: Starting OAI API
INFO: Completions: http://127.0.0.1:5000/v1/completions
INFO: Chat completions: http://127.0.0.1:5000/v1/chat/completions
INFO: Started server process [2097040]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:5000 (Press CTRL+C to quit)
INFO: 127.0.0.1:51690 - "POST /v1/chat/completions HTTP/1.1" 200
INFO: Received chat completion streaming request 4cec6fac80cc4a86b6c0d175f0e39c1c
INFO: Finished chat completion streaming request 4cec6fac80cc4a86b6c0d175f0e39c1c
INFO: Metrics (ID: 4cec6fac80cc4a86b6c0d175f0e39c1c): 280 tokens generated in 24.75 seconds (Queue: 0.0 s, Process: 0 cached tokens and 1778 new tokens at 677.72 T/s, Generate: 12.66 T/s, Context: 1778 tokens)
INFO: 127.0.0.1:43296 - "POST /v1/chat/completions HTTP/1.1" 500
ERROR: Exception in ASGI application
ERROR: Traceback (most recent call last):
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
ERROR: result = await app( # type: ignore[func-returns-value]
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in call
ERROR: return await self.app(scope, receive, send)
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in call
ERROR: await super().call(scope, receive, send)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/applications.py", line 113, in call
ERROR: await self.middleware_stack(scope, receive, send)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/middleware/errors.py", line 187, in call
ERROR: raise exc
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/middleware/errors.py", line 165, in call
ERROR: await self.app(scope, receive, _send)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/middleware/cors.py", line 85, in call
ERROR: await self.app(scope, receive, send)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 62, in call
ERROR: await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
ERROR: raise exc
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
ERROR: await app(scope, receive, sender)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/routing.py", line 715, in call
ERROR: await self.middleware_stack(scope, receive, send)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/routing.py", line 735, in app
ERROR: await route.handle(scope, receive, send)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/routing.py", line 288, in handle
ERROR: await self.app(scope, receive, send)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/routing.py", line 76, in app
ERROR: await wrap_app_handling_exceptions(app, request)(scope, receive, send)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
ERROR: raise exc
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
ERROR: await app(scope, receive, sender)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/starlette/routing.py", line 73, in app
ERROR: response = await f(request)
ERROR: ^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/fastapi/routing.py", line 301, in app
ERROR: raw_response = await run_endpoint_function(
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
ERROR: return await dependant.call(**values)
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/endpoints/OAI/router.py", line 126, in chat_completion_request
ERROR: prompt, embeddings = await apply_chat_template(data)
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/endpoints/OAI/utils/chat_completion.py", line 257, in apply_chat_template
ERROR: prompt, mm_embeddings, template_vars = await format_messages_with_template(
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/endpoints/OAI/utils/chat_completion.py", line 220, in format_messages_with_template
ERROR: await mm_embeddings.add(content.image_url.url)
ERROR: File "/home/lissanro/pkgs/tabbyAPI/common/multimodal.py", line 26, in add
ERROR: embedding = await get_image_embedding(url)
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/async_lru/init.py", line 227, in call
ERROR: return await asyncio.shield(fut)
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/backends/exllamav2/vision.py", line 68, in get_image_embedding
ERROR: return model.container.vision_model.get_image_embeddings(
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/vlm/vision_tower.py", line 320, in get_image_embeddings
ERROR: embedding_tensor = self.process(
ERROR: ^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/vlm/vision_tower.py", line 244, in process
ERROR: hidden_states = module.forward(
ERROR: ^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/mlp.py", line 333, in forward
ERROR: return self.forward_torch(hidden_states, cache, attn_params, past_len, intermediates, loras = loras, **kwargs)
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/mlp.py", line 471, in forward_torch
ERROR: up = self.up_proj.forward(post_norm, loras = loras)
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/linear.py", line 367, in forward
ERROR: hidden_states_out = torch.matmul(hidden_states, matrix)
ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR: torch.OutOfMemoryError: Allocation on device
Additional context
Qwen2.5-VL is a huge step forward so it would be great to be able to run it, I would greatly appreciate any help - please let me know if I did something wrong or if I need to provide more information.
Acknowledgements
The text was updated successfully, but these errors were encountered: