You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running the test.py script on a Jetson Orin Nano board inside a vLLM container, the system freezes and reboots. The issue occurs consistently regardless of the gpu_memory_utilization value (tested with 0.3, 0.5, and 0.8).
Steps to Reproduce
Use a Jetson Orin Nano board with vLLM installed inside a Docker container.
Run the following script (test.py):
#!/usr/bin/env python3print('testing vLLM...')
fromhuggingface_hubimporthf_hub_downloadfromvllmimportLLM, SamplingParamsimportxgrammardefrun_gguf_inference(model_path):
PROMPT_TEMPLATE="<|system|>\n{system_message}</s>\n<|user|>\n{prompt}</s>\n<|assistant|>\n"system_message="You are a friendly chatbot who always responds in the style of a pirate."prompts= [
"How many helicopters can a human eat in one sitting?",
"What's the future of AI?",
]
prompts= [
PROMPT_TEMPLATE.format(system_message=system_message, prompt=prompt)
forpromptinprompts
]
sampling_params=SamplingParams(temperature=0, max_tokens=128)
llm=LLM(model=model_path,
tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
gpu_memory_utilization=0.3)
outputs=llm.generate(prompts, sampling_params)
foroutputinoutputs:
prompt=output.promptgenerated_text=output.outputs[0].textprint(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
if__name__=="__main__":
repo_id="TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"filename="tinyllama-1.1b-chat-v1.0.Q4_0.gguf"model=hf_hub_download(repo_id, filename=filename)
run_gguf_inference(model)
print(xgrammar)
print('vLLM OK\n')
Observe that the system freezes and reboots.
Expected Behavior
The script should execute without causing the system to freeze or reboot.
Actual Behavior
The system freezes and reboots during execution, regardless of the gpu_memory_utilization value.
Logs
Here are the logs captured before the system freezes:
testing vLLM...
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
tinyllama-1.1b-chat-v1.0.Q4_0.gguf: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 638M/638M [01:17<00:00, 8.19MB/s]
INFO 01-27 12:38:07 config.py:2272] Downcasting torch.float32 to torch.float16.
INFO 01-27 12:38:27 config.py:510] This model supports multiple tasks: {'generate', 'score', 'reward', 'classify', 'embed'}. Defaulting to 'generate'.
WARNING 01-27 12:38:27 config.py:588] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING 01-27 12:38:27 config.py:1051] Possibly too large swap space. 4.00 GiB out of the 7.44 GiB total CPU memory is allocated for the swap space.
INFO 01-27 12:38:29 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='/data/models/huggingface/models--TheBloke--TinyLlama-1.1B-Chat-v1.0-GGUF/snapshots/52e7645ba7c309695bec7ac98f4f005b139cf465/tinyllama-1.1b-chat-v1.0.Q4_0.gguf', speculative_config=None, tokenizer='TinyLlama/TinyLlama-1.1B-Chat-v1.0', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=gguf, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/models/huggingface/models--TheBloke--TinyLlama-1.1B-Chat-v1.0-GGUF/snapshots/52e7645ba7c309695bec7ac98f4f005b139cf465/tinyllama-1.1b-chat-v1.0.Q4_0.gguf, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.29k/1.29k [00:00<00:00, 2.91MB/s]
tokenizer.model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 10.1MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 3.43MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 551/551 [00:00<00:00, 1.19MB/s]
INFO 01-27 12:38:50 selector.py:120] Using Flash Attention backend.
INFO 01-27 12:38:51 model_runner.py:1094] Starting to load model /data/models/huggingface/models--TheBloke--TinyLlama-1.1B-Chat-v1.0-GGUF/snapshots/52e7645ba7c309695bec7ac98f4f005b139cf465/tinyllama-1.1b-chat-v1.0.Q4_0.gguf...
/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:226: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
return _nested.nested_tensor(
INFO 01-27 12:39:09 model_runner.py:1099] Loading model weights took 0.5974 GB
INFO 01-27 12:39:16 worker.py:241] Memory profiling takes 7.06 seconds
INFO 01-27 12:39:16 worker.py:241] the current vLLM instance can use total_gpu_memory (7.44GiB) x gpu_memory_utilization (0.50) = 3.72GiB
INFO 01-27 12:39:16 worker.py:241] model weights take 0.60GiB; non_torch_memory takes 0.78GiB; PyTorch activation peak memory takes 0.30GiB; the rest of the memory reserved for KV Cache is 2.05GiB.
INFO 01-27 12:39:17 gpu_executor.py:76] # GPU blocks: 6096, # CPU blocks: 11915
INFO 01-27 12:39:17 gpu_executor.py:80] Maximum concurrency for 2048 tokens per request: 47.62x
Description
When running the
test.py
script on a Jetson Orin Nano board inside a vLLM container, the system freezes and reboots. The issue occurs consistently regardless of thegpu_memory_utilization
value (tested with 0.3, 0.5, and 0.8).Steps to Reproduce
Use a Jetson Orin Nano board with vLLM installed inside a Docker container.
Run the following script (
test.py
):Observe that the system freezes and reboots.
Expected Behavior
The script should execute without causing the system to freeze or reboot.
Actual Behavior
The system freezes and reboots during execution, regardless of the
gpu_memory_utilization
value.Logs
Here are the logs captured before the system freezes:
Environment
The text was updated successfully, but these errors were encountered: