Skip to content

[Bug] SLM running serve on known-good chat model crashes on _print_kv_cache_metadata_in_json #1921

Closed
@Sing-Li

Description

@Sing-Li

🐛 Bug

When I invoke serve.server with mistral-7b supplying the two required arguments, it crashes when trying to start the async_engine within _print_kv_cache_metadata_in_json

To Reproduce

Steps to reproduce the behavior:

  1. use mlc_chat chat to generate/discover the lib name to mistral-7b on your system:
$ python3 -m mlc_chat chat  HF://mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC
...
[2024-03-10 07:19:10] INFO download.py:124: Weights already downloaded: /home/autoqa/.cache/mlc_chat/model_weights/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC
[2024-03-10 07:19:10] INFO chat_module.py:765: Model lib not found. Now compiling model lib on device...
[2024-03-10 07:19:10] INFO jit.py:34: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-03-10 07:19:10] INFO jit.py:116: Using cached model lib: /home/autoqa/.cache/mlc_chat/model_lib/635eeba1d562a81be3d2d543b8cf71dd.so
  1. invoke server using the discovered paths as argument to serve:
$ python3 -m mlc_chat.serve.server  --model "/home/autoqa/.cache/mlc_chat/model_weights/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC"  --model-lib-path "/home/autoqa/.cache/mlc_chat/model_lib/635eeba1d562a81be3d2d543b8cf71dd.so"

Serve will crash on _print_kv_cache_metadata_in_json

Expected behavior

Server running servicing request.

Environment

  • Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA
  • Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 22.04lts
  • Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...) rtx 3060 12GB
  • How you installed MLC-LLM (conda, source): conda
  • How you installed TVM-Unity (pip, source): pip
  • Python version (e.g. 3.10): 3.10
  • GPU driver version (if applicable):
  • CUDA/cuDNN version (if applicable):
  • TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):
  • Any other relevant information:

Additional context

This is the trace on the crashing run:

$ python3 -m mlc_chat.serve.server  --model "/home/autoqa/.cache/mlc_chat/model_weights/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC"  --model-lib-path "/home/autoqa/.cache/mlc_chat/model_lib/635eeba1d562a81be3d2d543b8cf71dd.so"
[2024-03-10 07:21:25] INFO auto_device.py:76: Found device: cuda:0
[2024-03-10 07:21:26] INFO auto_device.py:85: Not found device: rocm:0
[2024-03-10 07:21:27] INFO auto_device.py:85: Not found device: metal:0
[2024-03-10 07:21:27] INFO auto_device.py:76: Found device: vulkan:0
[2024-03-10 07:21:28] INFO auto_device.py:85: Not found device: opencl:0
[2024-03-10 07:21:28] INFO auto_device.py:33: Using device: cuda:0
[2024-03-10 07:21:28] INFO chat_module.py:373: Using model folder: /home/autoqa/.cache/mlc_chat/model_weights/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC
[2024-03-10 07:21:28] INFO chat_module.py:374: Using mlc chat config: /home/autoqa/.cache/mlc_chat/model_weights/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json
[2024-03-10 07:21:28] INFO chat_module.py:516: Using library model: /home/autoqa/.cache/mlc_chat/model_lib/635eeba1d562a81be3d2d543b8cf71dd.so
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/autoqa/.local/lib/python3.10/site-packages/mlc_chat/cli/model_metadata.py", line 194, in <module>
    main()
  File "/home/autoqa/.local/lib/python3.10/site-packages/mlc_chat/cli/model_metadata.py", line 188, in main
    _print_kv_cache_metadata_in_json(metadata)
  File "/home/autoqa/.local/lib/python3.10/site-packages/mlc_chat/cli/model_metadata.py", line 125, in _print_kv_cache_metadata_in_json
    print(json.dumps(metadata["kv_cache"]))
KeyError: 'kv_cache'
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/autoqa/.local/lib/python3.10/site-packages/mlc_chat/serve/server/__main__.py", line 56, in <module>
    args: argparse.Namespace = parse_args_and_initialize()
  File "/home/autoqa/.local/lib/python3.10/site-packages/mlc_chat/serve/server/__main__.py", line 46, in parse_args_and_initialize
    engine = async_engine.AsyncThreadedEngine(
  File "/home/autoqa/.local/lib/python3.10/site-packages/mlc_chat/serve/async_engine.py", line 151, in __init__
    kv_cache_config.max_total_sequence_length = _estimate_max_total_sequence_length(
  File "/home/autoqa/.local/lib/python3.10/site-packages/mlc_chat/serve/engine.py", line 176, in _estimate_max_total_sequence_length
    kv_cache_metadata_str = subprocess.check_output(cmd, universal_newlines=True)
  File "/usr/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-m', 'mlc_chat.cli.model_metadata', '/home/autoqa/.cache/mlc_chat/model_lib/635eeba1d562a81be3d2d543b8cf71dd.so', '--print-kv-cache-metadata-in-json']' returned non-zero exit status 1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugConfirmed bugs

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions