Closed
Description
🐛 Bug
When I invoke serve.server
with mistral-7b supplying the two required arguments, it crashes when trying to start the async_engine
within _print_kv_cache_metadata_in_json
To Reproduce
Steps to reproduce the behavior:
- use
mlc_chat chat
to generate/discover the lib name to mistral-7b on your system:
$ python3 -m mlc_chat chat HF://mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC
...
[2024-03-10 07:19:10] INFO download.py:124: Weights already downloaded: /home/autoqa/.cache/mlc_chat/model_weights/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC
[2024-03-10 07:19:10] INFO chat_module.py:765: Model lib not found. Now compiling model lib on device...
[2024-03-10 07:19:10] INFO jit.py:34: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-03-10 07:19:10] INFO jit.py:116: Using cached model lib: /home/autoqa/.cache/mlc_chat/model_lib/635eeba1d562a81be3d2d543b8cf71dd.so
- invoke server using the discovered paths as argument to serve:
$ python3 -m mlc_chat.serve.server --model "/home/autoqa/.cache/mlc_chat/model_weights/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC" --model-lib-path "/home/autoqa/.cache/mlc_chat/model_lib/635eeba1d562a81be3d2d543b8cf71dd.so"
Serve will crash on _print_kv_cache_metadata_in_json
Expected behavior
Server running servicing request.
Environment
- Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA
- Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 22.04lts
- Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...) rtx 3060 12GB
- How you installed MLC-LLM (
conda
, source): conda - How you installed TVM-Unity (
pip
, source): pip - Python version (e.g. 3.10): 3.10
- GPU driver version (if applicable):
- CUDA/cuDNN version (if applicable):
- TVM Unity Hash Tag (
python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models): - Any other relevant information:
Additional context
This is the trace on the crashing run:
$ python3 -m mlc_chat.serve.server --model "/home/autoqa/.cache/mlc_chat/model_weights/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC" --model-lib-path "/home/autoqa/.cache/mlc_chat/model_lib/635eeba1d562a81be3d2d543b8cf71dd.so"
[2024-03-10 07:21:25] INFO auto_device.py:76: Found device: cuda:0
[2024-03-10 07:21:26] INFO auto_device.py:85: Not found device: rocm:0
[2024-03-10 07:21:27] INFO auto_device.py:85: Not found device: metal:0
[2024-03-10 07:21:27] INFO auto_device.py:76: Found device: vulkan:0
[2024-03-10 07:21:28] INFO auto_device.py:85: Not found device: opencl:0
[2024-03-10 07:21:28] INFO auto_device.py:33: Using device: cuda:0
[2024-03-10 07:21:28] INFO chat_module.py:373: Using model folder: /home/autoqa/.cache/mlc_chat/model_weights/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC
[2024-03-10 07:21:28] INFO chat_module.py:374: Using mlc chat config: /home/autoqa/.cache/mlc_chat/model_weights/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json
[2024-03-10 07:21:28] INFO chat_module.py:516: Using library model: /home/autoqa/.cache/mlc_chat/model_lib/635eeba1d562a81be3d2d543b8cf71dd.so
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/autoqa/.local/lib/python3.10/site-packages/mlc_chat/cli/model_metadata.py", line 194, in <module>
main()
File "/home/autoqa/.local/lib/python3.10/site-packages/mlc_chat/cli/model_metadata.py", line 188, in main
_print_kv_cache_metadata_in_json(metadata)
File "/home/autoqa/.local/lib/python3.10/site-packages/mlc_chat/cli/model_metadata.py", line 125, in _print_kv_cache_metadata_in_json
print(json.dumps(metadata["kv_cache"]))
KeyError: 'kv_cache'
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/autoqa/.local/lib/python3.10/site-packages/mlc_chat/serve/server/__main__.py", line 56, in <module>
args: argparse.Namespace = parse_args_and_initialize()
File "/home/autoqa/.local/lib/python3.10/site-packages/mlc_chat/serve/server/__main__.py", line 46, in parse_args_and_initialize
engine = async_engine.AsyncThreadedEngine(
File "/home/autoqa/.local/lib/python3.10/site-packages/mlc_chat/serve/async_engine.py", line 151, in __init__
kv_cache_config.max_total_sequence_length = _estimate_max_total_sequence_length(
File "/home/autoqa/.local/lib/python3.10/site-packages/mlc_chat/serve/engine.py", line 176, in _estimate_max_total_sequence_length
kv_cache_metadata_str = subprocess.check_output(cmd, universal_newlines=True)
File "/usr/lib/python3.10/subprocess.py", line 421, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-m', 'mlc_chat.cli.model_metadata', '/home/autoqa/.cache/mlc_chat/model_lib/635eeba1d562a81be3d2d543b8cf71dd.so', '--print-kv-cache-metadata-in-json']' returned non-zero exit status 1.