Update user docs for running llm server + upgrade gguf to 0.11.0 (#676)

stbaione · eagarvey-amd · commit 2996ecf5b60a · 2025-01-08T12:12:33.000-06:00
# Description Did a pass through and made updates + fixes to the user docs for `e2e_llama8b_mi300x.md`. 1. Update install instructions for `shark-ai` 2. Update nightly install instructions for `shortfin` and `sharktank` 3. Update paths for model artifacts to ensure they work with `llama3.1-8b-fp16-instruct` 4. Remove steps to `write edited config`. No longer needed after #487 Added back `sentencepiece` as a requirement for `sharktank`. Not having it caused `export_paged_llm_v1` to break when installing nightly: ```text ModuleNotFoundError: No module named 'sentencepiece' ``` This was obfuscated when building from source, because `shortfin` includes `sentencepiece` in `requirements-tests.txt`.
diff --git a/docs/shortfin/llm/user/e2e_llama8b_mi300x.md b/docs/shortfin/llm/user/e2e_llama8b_mi300x.md
@@ -22,32 +22,28 @@ python -m venv --prompt shark-ai .venv
 source .venv/bin/activate
 ```
 
-### Install `shark-ai`
+## Install stable shark-ai packages
 
-You can install either the `latest stable` version of `shark-ai`
-or the `nightly` version:
-
-#### Stable
+<!-- TODO: Add `sharktank` to `shark-ai` meta package -->
 
 ```bash
-pip install shark-ai
+pip install shark-ai[apps] sharktank
 ```
 
-#### Nightly
-
-```bash
-pip install sharktank -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
-pip install shortfin -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
-```
+### Nightly packages
 
-#### Install dataclasses-json
+To install nightly packages:
 
-<!-- TODO: This should be included in release: -->
+<!-- TODO: Add `sharktank` to `shark-ai` meta package -->
 
 ```bash
-pip install dataclasses-json
+pip install shark-ai[apps] sharktank \
+    --pre --find-links https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
 ```
 
+See also the
+[instructions here](https://github.com/nod-ai/shark-ai/blob/main/docs/nightly_releases.md).
+
 ### Define a directory for export files
 
 Create a new directory for us to export files like
@@ -78,8 +74,8 @@ This example uses the `llama8b_f16.gguf` and `tokenizer.json` files
 that were downloaded in the previous step.
 
 ```bash
-export MODEL_PARAMS_PATH=$EXPORT_DIR/llama3.1-8b/llama8b_f16.gguf
-export TOKENIZER_PATH=$EXPORT_DIR/llama3.1-8b/tokenizer.json
+export MODEL_PARAMS_PATH=$EXPORT_DIR/meta-llama-3.1-8b-instruct.f16.gguf
+export TOKENIZER_PATH=$EXPORT_DIR/tokenizer.json
 ```
 
 #### General env vars
@@ -91,8 +87,6 @@ The following env vars can be copy + pasted directly:
 export MLIR_PATH=$EXPORT_DIR/model.mlir
 # Path to export config.json file
 export OUTPUT_CONFIG_PATH=$EXPORT_DIR/config.json
-# Path to export edited_config.json file
-export EDITED_CONFIG_PATH=$EXPORT_DIR/edited_config.json
 # Path to export model.vmfb file
 export VMFB_PATH=$EXPORT_DIR/model.vmfb
 # Batch size for kvcache
@@ -108,7 +102,7 @@ to export our model to `.mlir` format.
 
 ```bash
 python -m sharktank.examples.export_paged_llm_v1 \
-  --irpa-file=$MODEL_PARAMS_PATH \
+  --gguf-file=$MODEL_PARAMS_PATH \
   --output-mlir=$MLIR_PATH \
   --output-config=$OUTPUT_CONFIG_PATH \
   --bs=$BS
@@ -137,37 +131,6 @@ iree-compile $MLIR_PATH \
  -o $VMFB_PATH
 ```
 
-## Write an edited config
-
-We need to write a config for our model with a slightly edited structure
-to run with shortfin. This will work for the example in our docs.
-You may need to modify some of the parameters for a specific model.
-
-### Write edited config
-
-```bash
-cat > $EDITED_CONFIG_PATH << EOF
-{
-    "module_name": "module",
-    "module_abi_version": 1,
-    "max_seq_len": 131072,
-    "attn_head_count": 8,
-    "attn_head_dim": 128,
-    "prefill_batch_sizes": [
-        $BS
-    ],
-    "decode_batch_sizes": [
-        $BS
-    ],
-    "transformer_block_count": 32,
-    "paged_kv_cache": {
-        "block_seq_stride": 16,
-        "device_block_count": 256
-    }
-}
-EOF
-```
-
 ## Running the `shortfin` LLM server
 
 We should now have all of the files that we need to run the shortfin LLM server.
@@ -178,15 +141,14 @@ Verify that you have the following in your specified directory ($EXPORT_DIR):
 ls $EXPORT_DIR
 ```
 
-- edited_config.json
+- config.json
+- meta-llama-3.1-8b-instruct.f16.gguf
+- model.mlir
 - model.vmfb
+- tokenizer_config.json
+- tokenizer.json
 
-### Launch server:
-
-<!-- #### Set the target device
-
-TODO: Add instructions on targeting different devices,
-when `--device=hip://$DEVICE` is supported -->
+### Launch server
 
 #### Run the shortfin server
 
@@ -209,7 +171,7 @@ Run the following command to launch the Shortfin LLM Server in the background:
 ```bash
 python -m shortfin_apps.llm.server \
    --tokenizer_json=$TOKENIZER_PATH \
-   --model_config=$EDITED_CONFIG_PATH \
+   --model_config=$OUTPUT_CONFIG_PATH \
    --vmfb=$VMFB_PATH \
    --parameters=$MODEL_PARAMS_PATH \
    --device=hip > shortfin_llm_server.log 2>&1 &
@@ -252,7 +214,7 @@ port = 8000 # Change if running on a different port
 generate_url = f"http://localhost:{port}/generate"
 
 def generation_request():
-    payload = {"text": "What is the capital of the United States?", "sampling_params": {"max_completion_tokens": 50}}
+    payload = {"text": "Name the capital of the United States.", "sampling_params": {"max_completion_tokens": 50}}
     try:
         resp = requests.post(generate_url, json=payload)
         resp.raise_for_status()  # Raises an HTTPError for bad responses
diff --git a/sharktank/requirements.txt b/sharktank/requirements.txt
@@ -1,12 +1,9 @@
 iree-turbine
 
 # Runtime deps.
-gguf==0.10.0
+gguf>=0.11.0
 numpy<2.0
 
-# Needed for newer gguf versions (TODO: remove when gguf package includes this)
-# sentencepiece>=0.1.98,<=0.2.0
-
 # Model deps.
 huggingface-hub==0.22.2
 transformers==4.40.0