Skip to content

Commit 2996ecf

Browse files
stbaioneeagarvey-amd
authored andcommitted
Update user docs for running llm server + upgrade gguf to 0.11.0 (#676)
# Description Did a pass through and made updates + fixes to the user docs for `e2e_llama8b_mi300x.md`. 1. Update install instructions for `shark-ai` 2. Update nightly install instructions for `shortfin` and `sharktank` 3. Update paths for model artifacts to ensure they work with `llama3.1-8b-fp16-instruct` 4. Remove steps to `write edited config`. No longer needed after #487 Added back `sentencepiece` as a requirement for `sharktank`. Not having it caused `export_paged_llm_v1` to break when installing nightly: ```text ModuleNotFoundError: No module named 'sentencepiece' ``` This was obfuscated when building from source, because `shortfin` includes `sentencepiece` in `requirements-tests.txt`.
1 parent 214ce10 commit 2996ecf

File tree

2 files changed

+23
-64
lines changed

2 files changed

+23
-64
lines changed

docs/shortfin/llm/user/e2e_llama8b_mi300x.md

+22-60
Original file line numberDiff line numberDiff line change
@@ -22,32 +22,28 @@ python -m venv --prompt shark-ai .venv
2222
source .venv/bin/activate
2323
```
2424

25-
### Install `shark-ai`
25+
## Install stable shark-ai packages
2626

27-
You can install either the `latest stable` version of `shark-ai`
28-
or the `nightly` version:
29-
30-
#### Stable
27+
<!-- TODO: Add `sharktank` to `shark-ai` meta package -->
3128

3229
```bash
33-
pip install shark-ai
30+
pip install shark-ai[apps] sharktank
3431
```
3532

36-
#### Nightly
37-
38-
```bash
39-
pip install sharktank -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
40-
pip install shortfin -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
41-
```
33+
### Nightly packages
4234

43-
#### Install dataclasses-json
35+
To install nightly packages:
4436

45-
<!-- TODO: This should be included in release: -->
37+
<!-- TODO: Add `sharktank` to `shark-ai` meta package -->
4638

4739
```bash
48-
pip install dataclasses-json
40+
pip install shark-ai[apps] sharktank \
41+
--pre --find-links https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
4942
```
5043

44+
See also the
45+
[instructions here](https://github.com/nod-ai/shark-ai/blob/main/docs/nightly_releases.md).
46+
5147
### Define a directory for export files
5248

5349
Create a new directory for us to export files like
@@ -78,8 +74,8 @@ This example uses the `llama8b_f16.gguf` and `tokenizer.json` files
7874
that were downloaded in the previous step.
7975

8076
```bash
81-
export MODEL_PARAMS_PATH=$EXPORT_DIR/llama3.1-8b/llama8b_f16.gguf
82-
export TOKENIZER_PATH=$EXPORT_DIR/llama3.1-8b/tokenizer.json
77+
export MODEL_PARAMS_PATH=$EXPORT_DIR/meta-llama-3.1-8b-instruct.f16.gguf
78+
export TOKENIZER_PATH=$EXPORT_DIR/tokenizer.json
8379
```
8480

8581
#### General env vars
@@ -91,8 +87,6 @@ The following env vars can be copy + pasted directly:
9187
export MLIR_PATH=$EXPORT_DIR/model.mlir
9288
# Path to export config.json file
9389
export OUTPUT_CONFIG_PATH=$EXPORT_DIR/config.json
94-
# Path to export edited_config.json file
95-
export EDITED_CONFIG_PATH=$EXPORT_DIR/edited_config.json
9690
# Path to export model.vmfb file
9791
export VMFB_PATH=$EXPORT_DIR/model.vmfb
9892
# Batch size for kvcache
@@ -108,7 +102,7 @@ to export our model to `.mlir` format.
108102

109103
```bash
110104
python -m sharktank.examples.export_paged_llm_v1 \
111-
--irpa-file=$MODEL_PARAMS_PATH \
105+
--gguf-file=$MODEL_PARAMS_PATH \
112106
--output-mlir=$MLIR_PATH \
113107
--output-config=$OUTPUT_CONFIG_PATH \
114108
--bs=$BS
@@ -137,37 +131,6 @@ iree-compile $MLIR_PATH \
137131
-o $VMFB_PATH
138132
```
139133

140-
## Write an edited config
141-
142-
We need to write a config for our model with a slightly edited structure
143-
to run with shortfin. This will work for the example in our docs.
144-
You may need to modify some of the parameters for a specific model.
145-
146-
### Write edited config
147-
148-
```bash
149-
cat > $EDITED_CONFIG_PATH << EOF
150-
{
151-
"module_name": "module",
152-
"module_abi_version": 1,
153-
"max_seq_len": 131072,
154-
"attn_head_count": 8,
155-
"attn_head_dim": 128,
156-
"prefill_batch_sizes": [
157-
$BS
158-
],
159-
"decode_batch_sizes": [
160-
$BS
161-
],
162-
"transformer_block_count": 32,
163-
"paged_kv_cache": {
164-
"block_seq_stride": 16,
165-
"device_block_count": 256
166-
}
167-
}
168-
EOF
169-
```
170-
171134
## Running the `shortfin` LLM server
172135

173136
We should now have all of the files that we need to run the shortfin LLM server.
@@ -178,15 +141,14 @@ Verify that you have the following in your specified directory ($EXPORT_DIR):
178141
ls $EXPORT_DIR
179142
```
180143

181-
- edited_config.json
144+
- config.json
145+
- meta-llama-3.1-8b-instruct.f16.gguf
146+
- model.mlir
182147
- model.vmfb
148+
- tokenizer_config.json
149+
- tokenizer.json
183150

184-
### Launch server:
185-
186-
<!-- #### Set the target device
187-
188-
TODO: Add instructions on targeting different devices,
189-
when `--device=hip://$DEVICE` is supported -->
151+
### Launch server
190152

191153
#### Run the shortfin server
192154

@@ -209,7 +171,7 @@ Run the following command to launch the Shortfin LLM Server in the background:
209171
```bash
210172
python -m shortfin_apps.llm.server \
211173
--tokenizer_json=$TOKENIZER_PATH \
212-
--model_config=$EDITED_CONFIG_PATH \
174+
--model_config=$OUTPUT_CONFIG_PATH \
213175
--vmfb=$VMFB_PATH \
214176
--parameters=$MODEL_PARAMS_PATH \
215177
--device=hip > shortfin_llm_server.log 2>&1 &
@@ -252,7 +214,7 @@ port = 8000 # Change if running on a different port
252214
generate_url = f"http://localhost:{port}/generate"
253215

254216
def generation_request():
255-
payload = {"text": "What is the capital of the United States?", "sampling_params": {"max_completion_tokens": 50}}
217+
payload = {"text": "Name the capital of the United States.", "sampling_params": {"max_completion_tokens": 50}}
256218
try:
257219
resp = requests.post(generate_url, json=payload)
258220
resp.raise_for_status() # Raises an HTTPError for bad responses

sharktank/requirements.txt

+1-4
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,9 @@
11
iree-turbine
22

33
# Runtime deps.
4-
gguf==0.10.0
4+
gguf>=0.11.0
55
numpy<2.0
66

7-
# Needed for newer gguf versions (TODO: remove when gguf package includes this)
8-
# sentencepiece>=0.1.98,<=0.2.0
9-
107
# Model deps.
118
huggingface-hub==0.22.2
129
transformers==4.40.0

0 commit comments

Comments
 (0)