@@ -22,32 +22,28 @@ python -m venv --prompt shark-ai .venv
22
22
source .venv/bin/activate
23
23
```
24
24
25
- ### Install ` shark-ai `
25
+ ## Install stable shark-ai packages
26
26
27
- You can install either the ` latest stable ` version of ` shark-ai `
28
- or the ` nightly ` version:
29
-
30
- #### Stable
27
+ <!-- TODO: Add `sharktank` to `shark-ai` meta package -->
31
28
32
29
``` bash
33
- pip install shark-ai
30
+ pip install shark-ai[apps] sharktank
34
31
```
35
32
36
- #### Nightly
37
-
38
- ``` bash
39
- pip install sharktank -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
40
- pip install shortfin -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
41
- ```
33
+ ### Nightly packages
42
34
43
- #### Install dataclasses-json
35
+ To install nightly packages:
44
36
45
- <!-- TODO: This should be included in release: -->
37
+ <!-- TODO: Add `sharktank` to `shark-ai` meta package -->
46
38
47
39
``` bash
48
- pip install dataclasses-json
40
+ pip install shark-ai[apps] sharktank \
41
+ --pre --find-links https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
49
42
```
50
43
44
+ See also the
45
+ [ instructions here] ( https://github.com/nod-ai/shark-ai/blob/main/docs/nightly_releases.md ) .
46
+
51
47
### Define a directory for export files
52
48
53
49
Create a new directory for us to export files like
@@ -78,8 +74,8 @@ This example uses the `llama8b_f16.gguf` and `tokenizer.json` files
78
74
that were downloaded in the previous step.
79
75
80
76
``` bash
81
- export MODEL_PARAMS_PATH=$EXPORT_DIR /llama3 .1-8b/llama8b_f16 .gguf
82
- export TOKENIZER_PATH=$EXPORT_DIR /llama3.1-8b/ tokenizer.json
77
+ export MODEL_PARAMS_PATH=$EXPORT_DIR /meta-llama-3 .1-8b-instruct.f16 .gguf
78
+ export TOKENIZER_PATH=$EXPORT_DIR /tokenizer.json
83
79
```
84
80
85
81
#### General env vars
@@ -91,8 +87,6 @@ The following env vars can be copy + pasted directly:
91
87
export MLIR_PATH=$EXPORT_DIR /model.mlir
92
88
# Path to export config.json file
93
89
export OUTPUT_CONFIG_PATH=$EXPORT_DIR /config.json
94
- # Path to export edited_config.json file
95
- export EDITED_CONFIG_PATH=$EXPORT_DIR /edited_config.json
96
90
# Path to export model.vmfb file
97
91
export VMFB_PATH=$EXPORT_DIR /model.vmfb
98
92
# Batch size for kvcache
@@ -108,7 +102,7 @@ to export our model to `.mlir` format.
108
102
109
103
``` bash
110
104
python -m sharktank.examples.export_paged_llm_v1 \
111
- --irpa -file=$MODEL_PARAMS_PATH \
105
+ --gguf -file=$MODEL_PARAMS_PATH \
112
106
--output-mlir=$MLIR_PATH \
113
107
--output-config=$OUTPUT_CONFIG_PATH \
114
108
--bs=$BS
@@ -137,37 +131,6 @@ iree-compile $MLIR_PATH \
137
131
-o $VMFB_PATH
138
132
```
139
133
140
- ## Write an edited config
141
-
142
- We need to write a config for our model with a slightly edited structure
143
- to run with shortfin. This will work for the example in our docs.
144
- You may need to modify some of the parameters for a specific model.
145
-
146
- ### Write edited config
147
-
148
- ``` bash
149
- cat > $EDITED_CONFIG_PATH << EOF
150
- {
151
- "module_name": "module",
152
- "module_abi_version": 1,
153
- "max_seq_len": 131072,
154
- "attn_head_count": 8,
155
- "attn_head_dim": 128,
156
- "prefill_batch_sizes": [
157
- $BS
158
- ],
159
- "decode_batch_sizes": [
160
- $BS
161
- ],
162
- "transformer_block_count": 32,
163
- "paged_kv_cache": {
164
- "block_seq_stride": 16,
165
- "device_block_count": 256
166
- }
167
- }
168
- EOF
169
- ```
170
-
171
134
## Running the ` shortfin ` LLM server
172
135
173
136
We should now have all of the files that we need to run the shortfin LLM server.
@@ -178,15 +141,14 @@ Verify that you have the following in your specified directory ($EXPORT_DIR):
178
141
ls $EXPORT_DIR
179
142
```
180
143
181
- - edited_config.json
144
+ - config.json
145
+ - meta-llama-3.1-8b-instruct.f16.gguf
146
+ - model.mlir
182
147
- model.vmfb
148
+ - tokenizer_config.json
149
+ - tokenizer.json
183
150
184
- ### Launch server:
185
-
186
- <!-- #### Set the target device
187
-
188
- TODO: Add instructions on targeting different devices,
189
- when `--device=hip://$DEVICE` is supported -->
151
+ ### Launch server
190
152
191
153
#### Run the shortfin server
192
154
@@ -209,7 +171,7 @@ Run the following command to launch the Shortfin LLM Server in the background:
209
171
``` bash
210
172
python -m shortfin_apps.llm.server \
211
173
--tokenizer_json=$TOKENIZER_PATH \
212
- --model_config=$EDITED_CONFIG_PATH \
174
+ --model_config=$OUTPUT_CONFIG_PATH \
213
175
--vmfb=$VMFB_PATH \
214
176
--parameters=$MODEL_PARAMS_PATH \
215
177
--device=hip > shortfin_llm_server.log 2>&1 &
@@ -252,7 +214,7 @@ port = 8000 # Change if running on a different port
252
214
generate_url = f " http://localhost: { port} /generate "
253
215
254
216
def generation_request ():
255
- payload = {" text" : " What is the capital of the United States? " , " sampling_params" : {" max_completion_tokens" : 50 }}
217
+ payload = {" text" : " Name the capital of the United States. " , " sampling_params" : {" max_completion_tokens" : 50 }}
256
218
try :
257
219
resp = requests.post(generate_url, json = payload)
258
220
resp.raise_for_status() # Raises an HTTPError for bad responses
0 commit comments