[Bug] SLM running serve on known-good chat model crashes on `_print_kv_cache_metadata_in_json`

## 🐛 Bug

When I invoke `serve.server` with mistral-7b supplying the two required arguments, it crashes when trying to start the `async_engine` within `_print_kv_cache_metadata_in_json`


## To Reproduce

Steps to reproduce the behavior:

1.  use `mlc_chat chat` to generate/discover the lib name to mistral-7b on your system:

```
$ python3 -m mlc_chat chat  HF://mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC
...
[2024-03-10 07:19:10] INFO download.py:124: Weights already downloaded: /home/autoqa/.cache/mlc_chat/model_weights/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC
[2024-03-10 07:19:10] INFO chat_module.py:765: Model lib not found. Now compiling model lib on device...
[2024-03-10 07:19:10] INFO jit.py:34: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-03-10 07:19:10] INFO jit.py:116: Using cached model lib: /home/autoqa/.cache/mlc_chat/model_lib/635eeba1d562a81be3d2d543b8cf71dd.so
```

2.  invoke server using the discovered paths as argument to serve:
```
$ python3 -m mlc_chat.serve.server  --model "/home/autoqa/.cache/mlc_chat/model_weights/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC"  --model-lib-path "/home/autoqa/.cache/mlc_chat/model_lib/635eeba1d562a81be3d2d543b8cf71dd.so"
```

Serve will crash on  `_print_kv_cache_metadata_in_json`


## Expected behavior

Server running servicing request.

## Environment

 - Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA):  CUDA
 - Operating system (e.g. Ubuntu/Windows/MacOS/...):  Ubuntu 22.04lts
 - Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...)     rtx 3060 12GB
 - How you installed MLC-LLM (`conda`, source):  conda
 - How you installed TVM-Unity (`pip`, source):   pip 
 - Python version (e.g. 3.10):   3.10
 - GPU driver version (if applicable):    
 - CUDA/cuDNN version (if applicable):
 - TVM Unity Hash Tag (`python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"`, applicable if you compile models):
 - Any other relevant information:

## Additional context

This is the trace on the crashing run:

```
$ python3 -m mlc_chat.serve.server  --model "/home/autoqa/.cache/mlc_chat/model_weights/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC"  --model-lib-path "/home/autoqa/.cache/mlc_chat/model_lib/635eeba1d562a81be3d2d543b8cf71dd.so"
[2024-03-10 07:21:25] INFO auto_device.py:76: Found device: cuda:0
[2024-03-10 07:21:26] INFO auto_device.py:85: Not found device: rocm:0
[2024-03-10 07:21:27] INFO auto_device.py:85: Not found device: metal:0
[2024-03-10 07:21:27] INFO auto_device.py:76: Found device: vulkan:0
[2024-03-10 07:21:28] INFO auto_device.py:85: Not found device: opencl:0
[2024-03-10 07:21:28] INFO auto_device.py:33: Using device: cuda:0
[2024-03-10 07:21:28] INFO chat_module.py:373: Using model folder: /home/autoqa/.cache/mlc_chat/model_weights/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC
[2024-03-10 07:21:28] INFO chat_module.py:374: Using mlc chat config: /home/autoqa/.cache/mlc_chat/model_weights/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json
[2024-03-10 07:21:28] INFO chat_module.py:516: Using library model: /home/autoqa/.cache/mlc_chat/model_lib/635eeba1d562a81be3d2d543b8cf71dd.so
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/autoqa/.local/lib/python3.10/site-packages/mlc_chat/cli/model_metadata.py", line 194, in <module>
    main()
  File "/home/autoqa/.local/lib/python3.10/site-packages/mlc_chat/cli/model_metadata.py", line 188, in main
    _print_kv_cache_metadata_in_json(metadata)
  File "/home/autoqa/.local/lib/python3.10/site-packages/mlc_chat/cli/model_metadata.py", line 125, in _print_kv_cache_metadata_in_json
    print(json.dumps(metadata["kv_cache"]))
KeyError: 'kv_cache'
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/autoqa/.local/lib/python3.10/site-packages/mlc_chat/serve/server/__main__.py", line 56, in <module>
    args: argparse.Namespace = parse_args_and_initialize()
  File "/home/autoqa/.local/lib/python3.10/site-packages/mlc_chat/serve/server/__main__.py", line 46, in parse_args_and_initialize
    engine = async_engine.AsyncThreadedEngine(
  File "/home/autoqa/.local/lib/python3.10/site-packages/mlc_chat/serve/async_engine.py", line 151, in __init__
    kv_cache_config.max_total_sequence_length = _estimate_max_total_sequence_length(
  File "/home/autoqa/.local/lib/python3.10/site-packages/mlc_chat/serve/engine.py", line 176, in _estimate_max_total_sequence_length
    kv_cache_metadata_str = subprocess.check_output(cmd, universal_newlines=True)
  File "/usr/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-m', 'mlc_chat.cli.model_metadata', '/home/autoqa/.cache/mlc_chat/model_lib/635eeba1d562a81be3d2d543b8cf71dd.so', '--print-kv-cache-metadata-in-json']' returned non-zero exit status 1.
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] SLM running serve on known-good chat model crashes on `_print_kv_cache_metadata_in_json` #1921

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] SLM running serve on known-good chat model crashes on _print_kv_cache_metadata_in_json #1921

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug] SLM running serve on known-good chat model crashes on `_print_kv_cache_metadata_in_json` #1921