Skip to content

Commit e2cbcb4

Browse files
authored
Add user instructions for converting safetensors to gguf (#772)
Adds a note to [llama_serving.md](https://github.com/nod-ai/shark-ai/blob/main/docs/shortfin/llm/user/llama_serving.md) that instructs the user how to convert a collection of `.safetensor` weight files to a single `.gguf` file that can be used in the instructions that follow.
1 parent c71a250 commit e2cbcb4

File tree

2 files changed

+14
-2
lines changed

2 files changed

+14
-2
lines changed

docs/shortfin/llm/user/llama_serving.md

+10
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,16 @@ LLama3.1 8b f16 model.
8787
python -m sharktank.utils.hf_datasets llama3_8B_fp16 --local-dir $EXPORT_DIR
8888
```
8989

90+
> [!NOTE]
91+
> If you have the model weights as a collection of `.safetensors` files (downloaded from HuggingFace Model Hub, for example), you can use the `convert_hf_to_gguf.py` script from the [llama.cpp repository](https://github.com/ggerganov/llama.cpp) to convert them to a single `.gguf` file.
92+
> ```bash
93+
> export WEIGHTS_DIR=/path/to/safetensors/weights_directory/
94+
> git clone --depth 1 https://github.com/ggerganov/llama.cpp.git
95+
> cd llama.cpp
96+
> python3 convert_hf_to_gguf.py $WEIGHTS_DIR --outtype f16 --outfile $EXPORT_DIR/<output_gguf_name>.gguf
97+
> ```
98+
> Now this GGUF file can be used in the instructions ahead.
99+
90100
### Define environment variables
91101
92102
We'll first define some environment variables that are shared between the

docs/user_guide.md

+4-2
Original file line numberDiff line numberDiff line change
@@ -78,8 +78,10 @@ To get started with SDXL, please follow the [SDXL User Guide](../shortfin/python
7878

7979
### Llama 3.1
8080

81-
To get started with Llama 3.1, please follow the [Llama User Guide](shortfin/llm/user/llama_serving.md).
81+
To get started with Llama 3.1, please follow the [Llama User Guide][1].
8282

8383
* Once you've set up the Llama server in the guide above, we recommend that you use [SGLang Frontend](https://sgl-project.github.io/frontend/frontend.html) by following the [Using `shortfin` with `sglang` guide](shortfin/llm/user/shortfin_with_sglang_frontend_language.md)
8484
* If you would like to deploy LLama on a Kubernetes cluster we also provide a simple set of instructions and deployment configuration to do so [here](shortfin/llm/user/llama_serving_on_kubernetes.md).
85-
* Finally, if you'd like to leverage the instructions above to run against a different variant of Llama 3.1, it's supported. However, you will need to generate a gguf dataset for that variant. In order to do this leverage the [HuggingFace](https://huggingface.co/)'s [`huggingface-cli`](https://huggingface.co/docs/huggingface_hub/en/guides/cli) in combination with [llama.cpp](https://github.com/ggerganov/llama.cpp)'s convert_hf_to_gguf.py. In future releases, we plan to streamline these instructions to make it easier for users to compile their own models from HuggingFace.
85+
* Finally, if you'd like to leverage the instructions above to run against a different variant of Llama 3.1, it's supported. However, you will need to generate a gguf dataset for that variant (explained in the [user guide][1]). In future releases, we plan to streamline these instructions to make it easier for users to compile their own models from HuggingFace.
86+
87+
[1]: shortfin/llm/user/llama_serving.md

0 commit comments

Comments
 (0)