LLama 8b GPU instructions on MI300X

Setup

We will use an example with llama_8b_f16 in order to describe the process of exporting a model for use in the shortfin llm server with an MI300 GPU.

Pre-Requisites

Python >= 3.11 is recommended for this flow
- You can check out pyenv as a good tool to be able to manage multiple versions of python on the same system.

Create virtual environment

To start, create a new virtual environment:

python -m venv --prompt shark-ai .venv
source .venv/bin/activate

Install stable shark-ai packages

First install a torch version that fulfills your needs:

# Fast installation of torch with just CPU support.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

For other options, see https://pytorch.org/get-started/locally/.

Next install shark-ai:

pip install shark-ai[apps]

Tip

To switch from the stable release channel to the nightly release channel, see nightly_releases.md.

Define a directory for export files

Create a new directory for us to export files like model.mlir, model.vmfb, etc.

mkdir $PWD/export
export EXPORT_DIR=$PWD/export

Download llama3_8b_fp16.gguf

We will use the hf_datasets module in sharktank to download a LLama3.1 8b f16 model.

python -m sharktank.utils.hf_datasets llama3_8B_fp16 --local-dir $EXPORT_DIR

Define environment variables

Define the following environment variables to make running this example a bit easier:

Model/Tokenizer vars

This example uses the llama8b_f16.gguf and tokenizer.json files that were downloaded in the previous step.

export MODEL_PARAMS_PATH=$EXPORT_DIR/meta-llama-3.1-8b-instruct.f16.gguf
export TOKENIZER_PATH=$EXPORT_DIR/tokenizer.json

General env vars

The following env vars can be copy + pasted directly:

# Path to export model.mlir file
export MLIR_PATH=$EXPORT_DIR/model.mlir
# Path to export config.json file
export OUTPUT_CONFIG_PATH=$EXPORT_DIR/config.json
# Path to export model.vmfb file
export VMFB_PATH=$EXPORT_DIR/model.vmfb
# Batch size for kvcache
export BS=1,4
# NOTE: This is temporary, until multi-device is fixed
export ROCR_VISIBLE_DEVICES=1

Export to MLIR

We will now use the sharktank.examples.export_paged_llm_v1 script to export our model to .mlir format.

python -m sharktank.examples.export_paged_llm_v1 \
  --gguf-file=$MODEL_PARAMS_PATH \
  --output-mlir=$MLIR_PATH \
  --output-config=$OUTPUT_CONFIG_PATH \
  --bs=$BS

Compiling to `.vmfb`

Now that we have generated a model.mlir file, we can compile it to .vmfb format, which is required for running the shortfin LLM server.

We will use the iree-compile tool for compiling our model.

Compile for MI300

NOTE: This command is specific to MI300 GPUs. For other --iree-hip-target GPU options, look here

iree-compile $MLIR_PATH \
 --iree-hal-target-backends=rocm \
 --iree-hip-target=gfx942 \
 -o $VMFB_PATH

Running the `shortfin` LLM server

We should now have all of the files that we need to run the shortfin LLM server.

Verify that you have the following in your specified directory ($EXPORT_DIR):

ls $EXPORT_DIR

config.json
meta-llama-3.1-8b-instruct.f16.gguf
model.mlir
model.vmfb
tokenizer_config.json
tokenizer.json

Launch server

Run the shortfin server

Now that we are finished with setup, we can start the Shortfin LLM Server.

Run the following command to launch the Shortfin LLM Server in the background:

Note By default, our server will start at http://localhost:8000. You can specify the --host and/or --port arguments, to run at a different address.

If you receive an error similar to the following:

[errno 98] address already in use

Then, you can confirm the port is in use with ss -ntl | grep 8000 and either kill the process running at that port, or start the shortfin server at a different port.

python -m shortfin_apps.llm.server \
   --tokenizer_json=$TOKENIZER_PATH \
   --model_config=$OUTPUT_CONFIG_PATH \
   --vmfb=$VMFB_PATH \
   --parameters=$MODEL_PARAMS_PATH \
   --device=hip > shortfin_llm_server.log 2>&1 &
shortfin_process=$!

You can verify your command has launched successfully when you see the following logs outputted to terminal:

cat shortfin_llm_server.log

Expected output

[2024-10-24 15:40:27.440] [info] [on.py:62] Application startup complete.
[2024-10-24 15:40:27.444] [info] [server.py:214] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Test the server

We can now test our LLM server.

First let's confirm that it is running:

curl -i http://localhost:8000/health

# HTTP/1.1 200 OK
# date: Thu, 19 Dec 2024 19:40:43 GMT
# server: uvicorn
# content-length: 0

Next, let's send a generation request:

curl http://localhost:8000/generate \
    -H "Content-Type: application/json" \
    -d '{
        "text": "Name the capital of the United States.",
        "sampling_params": {"max_completion_tokens": 50}
    }'

Send requests from Python

You can also send HTTP requests from Python like so:

import os
import requests

port = 8000 # Change if running on a different port
generate_url = f"http://localhost:{port}/generate"

def generation_request():
    payload = {"text": "Name the capital of the United States.", "sampling_params": {"max_completion_tokens": 50}}
    try:
        resp = requests.post(generate_url, json=payload)
        resp.raise_for_status()  # Raises an HTTPError for bad responses
        print(resp.text)
    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")

generation_request()

Cleanup

When done, you can stop the shortfin_llm_server by killing the process:

kill -9 $shortfin_process

If you want to find the process again:

ps -f | grep shortfin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e_llama8b_mi300x.md

e2e_llama8b_mi300x.md

LLama 8b GPU instructions on MI300X

Setup

Pre-Requisites

Create virtual environment

Install stable shark-ai packages

Define a directory for export files

Download llama3_8b_fp16.gguf

Define environment variables

Model/Tokenizer vars

General env vars

Export to MLIR

Compiling to `.vmfb`

Compile for MI300

Running the `shortfin` LLM server

Launch server

Run the shortfin server

Expected output

Test the server

Send requests from Python

Cleanup

Files

e2e_llama8b_mi300x.md

Latest commit

History

e2e_llama8b_mi300x.md

File metadata and controls

LLama 8b GPU instructions on MI300X

Setup

Pre-Requisites

Create virtual environment

Install stable shark-ai packages

Define a directory for export files

Download llama3_8b_fp16.gguf

Define environment variables

Model/Tokenizer vars

General env vars

Export to MLIR

Compiling to .vmfb

Compile for MI300

Running the shortfin LLM server

Launch server

Run the shortfin server

Expected output

Test the server

Send requests from Python

Cleanup

Compiling to `.vmfb`

Running the `shortfin` LLM server