System freezes and reboots while running vLLM Cont. on Jetson Orin Nano with test.py script #800

Ja-efan · 2025-01-27T13:12:46Z

Description

When running the test.py script on a Jetson Orin Nano board inside a vLLM container, the system freezes and reboots. The issue occurs consistently regardless of the gpu_memory_utilization value (tested with 0.3, 0.5, and 0.8).

Steps to Reproduce

Use a Jetson Orin Nano board with vLLM installed inside a Docker container.

Run the following script (test.py):

#!/usr/bin/env python3
print('testing vLLM...')

from huggingface_hub import hf_hub_download
from vllm import LLM, SamplingParams
import xgrammar

def run_gguf_inference(model_path):
    PROMPT_TEMPLATE = "<|system|>\n{system_message}</s>\n<|user|>\n{prompt}</s>\n<|assistant|>\n"
    system_message = "You are a friendly chatbot who always responds in the style of a pirate."
    prompts = [
        "How many helicopters can a human eat in one sitting?",
        "What's the future of AI?",
    ]
    prompts = [
        PROMPT_TEMPLATE.format(system_message=system_message, prompt=prompt)
        for prompt in prompts
    ]
    sampling_params = SamplingParams(temperature=0, max_tokens=128)
    llm = LLM(model=model_path,
              tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
              gpu_memory_utilization=0.3)
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

if __name__ == "__main__":
    repo_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
    filename = "tinyllama-1.1b-chat-v1.0.Q4_0.gguf"
    model = hf_hub_download(repo_id, filename=filename)
    run_gguf_inference(model)
    print(xgrammar)

print('vLLM OK\n')

Observe that the system freezes and reboots.

Expected Behavior

The script should execute without causing the system to freeze or reboot.

Actual Behavior

The system freezes and reboots during execution, regardless of the gpu_memory_utilization value.

Logs

Here are the logs captured before the system freezes:

testing vLLM...
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
  warnings.warn(
tinyllama-1.1b-chat-v1.0.Q4_0.gguf: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 638M/638M [01:17<00:00, 8.19MB/s]
INFO 01-27 12:38:07 config.py:2272] Downcasting torch.float32 to torch.float16.
INFO 01-27 12:38:27 config.py:510] This model supports multiple tasks: {'generate', 'score', 'reward', 'classify', 'embed'}. Defaulting to 'generate'.
WARNING 01-27 12:38:27 config.py:588] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING 01-27 12:38:27 config.py:1051] Possibly too large swap space. 4.00 GiB out of the 7.44 GiB total CPU memory is allocated for the swap space.
INFO 01-27 12:38:29 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='/data/models/huggingface/models--TheBloke--TinyLlama-1.1B-Chat-v1.0-GGUF/snapshots/52e7645ba7c309695bec7ac98f4f005b139cf465/tinyllama-1.1b-chat-v1.0.Q4_0.gguf', speculative_config=None, tokenizer='TinyLlama/TinyLlama-1.1B-Chat-v1.0', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=gguf, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/models/huggingface/models--TheBloke--TinyLlama-1.1B-Chat-v1.0-GGUF/snapshots/52e7645ba7c309695bec7ac98f4f005b139cf465/tinyllama-1.1b-chat-v1.0.Q4_0.gguf, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.29k/1.29k [00:00<00:00, 2.91MB/s]
tokenizer.model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 10.1MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 3.43MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 551/551 [00:00<00:00, 1.19MB/s]
INFO 01-27 12:38:50 selector.py:120] Using Flash Attention backend.
INFO 01-27 12:38:51 model_runner.py:1094] Starting to load model /data/models/huggingface/models--TheBloke--TinyLlama-1.1B-Chat-v1.0-GGUF/snapshots/52e7645ba7c309695bec7ac98f4f005b139cf465/tinyllama-1.1b-chat-v1.0.Q4_0.gguf...
/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:226: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
  return _nested.nested_tensor(
INFO 01-27 12:39:09 model_runner.py:1099] Loading model weights took 0.5974 GB
INFO 01-27 12:39:16 worker.py:241] Memory profiling takes 7.06 seconds
INFO 01-27 12:39:16 worker.py:241] the current vLLM instance can use total_gpu_memory (7.44GiB) x gpu_memory_utilization (0.50) = 3.72GiB
INFO 01-27 12:39:16 worker.py:241] model weights take 0.60GiB; non_torch_memory takes 0.78GiB; PyTorch activation peak memory takes 0.30GiB; the rest of the memory reserved for KV Cache is 2.05GiB.
INFO 01-27 12:39:17 gpu_executor.py:76] # GPU blocks: 6096, # CPU blocks: 11915
INFO 01-27 12:39:17 gpu_executor.py:80] Maximum concurrency for 2048 tokens per request: 47.62x

Environment

Board: Jetson Orin Nano
Jetpack: Jetpack 6.2
Docker Image: dustynv/vllm:0.6.6.post1-r36.4.0
Model: TinyLlama-1.1B-Chat-v1.0 (quantized GGUF format)
Python Version: 3.10
GPU Utilization: Tried with 0.3, 0.5, and 0.8

The text was updated successfully, but these errors were encountered:

Calcifer97 · 2025-02-07T02:06:02Z

我这边遇到同样的问题，vllm占用的内存太大了

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System freezes and reboots while running vLLM Cont. on Jetson Orin Nano with test.py script #800

System freezes and reboots while running vLLM Cont. on Jetson Orin Nano with test.py script #800

Ja-efan commented Jan 27, 2025

Calcifer97 commented Feb 7, 2025

System freezes and reboots while running vLLM Cont. on Jetson Orin Nano with test.py script #800

System freezes and reboots while running vLLM Cont. on Jetson Orin Nano with test.py script #800

Comments

Ja-efan commented Jan 27, 2025

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Logs

Environment

Calcifer97 commented Feb 7, 2025