Skip to content

Latest commit

 

History

History
255 lines (185 loc) · 6.1 KB

e2e_llama8b_mi300x.md

File metadata and controls

255 lines (185 loc) · 6.1 KB

LLama 8b GPU instructions on MI300X

Setup

We will use an example with llama_8b_f16 in order to describe the process of exporting a model for use in the shortfin llm server with an MI300 GPU.

Pre-Requisites

  • Python >= 3.11 is recommended for this flow
    • You can check out pyenv as a good tool to be able to manage multiple versions of python on the same system.

Create virtual environment

To start, create a new virtual environment:

python -m venv --prompt shark-ai .venv
source .venv/bin/activate

Install stable shark-ai packages

First install a torch version that fulfills your needs:

# Fast installation of torch with just CPU support.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

For other options, see https://pytorch.org/get-started/locally/.

Next install shark-ai:

pip install shark-ai[apps]

Tip

To switch from the stable release channel to the nightly release channel, see nightly_releases.md.

Define a directory for export files

Create a new directory for us to export files like model.mlir, model.vmfb, etc.

mkdir $PWD/export
export EXPORT_DIR=$PWD/export

Download llama3_8b_fp16.gguf

We will use the hf_datasets module in sharktank to download a LLama3.1 8b f16 model.

python -m sharktank.utils.hf_datasets llama3_8B_fp16 --local-dir $EXPORT_DIR

Define environment variables

Define the following environment variables to make running this example a bit easier:

Model/Tokenizer vars

This example uses the llama8b_f16.gguf and tokenizer.json files that were downloaded in the previous step.

export MODEL_PARAMS_PATH=$EXPORT_DIR/meta-llama-3.1-8b-instruct.f16.gguf
export TOKENIZER_PATH=$EXPORT_DIR/tokenizer.json

General env vars

The following env vars can be copy + pasted directly:

# Path to export model.mlir file
export MLIR_PATH=$EXPORT_DIR/model.mlir
# Path to export config.json file
export OUTPUT_CONFIG_PATH=$EXPORT_DIR/config.json
# Path to export model.vmfb file
export VMFB_PATH=$EXPORT_DIR/model.vmfb
# Batch size for kvcache
export BS=1,4
# NOTE: This is temporary, until multi-device is fixed
export ROCR_VISIBLE_DEVICES=1

Export to MLIR

We will now use the sharktank.examples.export_paged_llm_v1 script to export our model to .mlir format.

python -m sharktank.examples.export_paged_llm_v1 \
  --gguf-file=$MODEL_PARAMS_PATH \
  --output-mlir=$MLIR_PATH \
  --output-config=$OUTPUT_CONFIG_PATH \
  --bs=$BS

Compiling to .vmfb

Now that we have generated a model.mlir file, we can compile it to .vmfb format, which is required for running the shortfin LLM server.

We will use the iree-compile tool for compiling our model.

Compile for MI300

NOTE: This command is specific to MI300 GPUs. For other --iree-hip-target GPU options, look here

iree-compile $MLIR_PATH \
 --iree-hal-target-backends=rocm \
 --iree-hip-target=gfx942 \
 -o $VMFB_PATH

Running the shortfin LLM server

We should now have all of the files that we need to run the shortfin LLM server.

Verify that you have the following in your specified directory ($EXPORT_DIR):

ls $EXPORT_DIR
  • config.json
  • meta-llama-3.1-8b-instruct.f16.gguf
  • model.mlir
  • model.vmfb
  • tokenizer_config.json
  • tokenizer.json

Launch server

Run the shortfin server

Now that we are finished with setup, we can start the Shortfin LLM Server.

Run the following command to launch the Shortfin LLM Server in the background:

Note By default, our server will start at http://localhost:8000. You can specify the --host and/or --port arguments, to run at a different address.

If you receive an error similar to the following:

[errno 98] address already in use

Then, you can confirm the port is in use with ss -ntl | grep 8000 and either kill the process running at that port, or start the shortfin server at a different port.

python -m shortfin_apps.llm.server \
   --tokenizer_json=$TOKENIZER_PATH \
   --model_config=$OUTPUT_CONFIG_PATH \
   --vmfb=$VMFB_PATH \
   --parameters=$MODEL_PARAMS_PATH \
   --device=hip > shortfin_llm_server.log 2>&1 &
shortfin_process=$!

You can verify your command has launched successfully when you see the following logs outputted to terminal:

cat shortfin_llm_server.log

Expected output

[2024-10-24 15:40:27.440] [info] [on.py:62] Application startup complete.
[2024-10-24 15:40:27.444] [info] [server.py:214] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Test the server

We can now test our LLM server.

First let's confirm that it is running:

curl -i http://localhost:8000/health

# HTTP/1.1 200 OK
# date: Thu, 19 Dec 2024 19:40:43 GMT
# server: uvicorn
# content-length: 0

Next, let's send a generation request:

curl http://localhost:8000/generate \
    -H "Content-Type: application/json" \
    -d '{
        "text": "Name the capital of the United States.",
        "sampling_params": {"max_completion_tokens": 50}
    }'

Send requests from Python

You can also send HTTP requests from Python like so:

import os
import requests

port = 8000 # Change if running on a different port
generate_url = f"http://localhost:{port}/generate"

def generation_request():
    payload = {"text": "Name the capital of the United States.", "sampling_params": {"max_completion_tokens": 50}}
    try:
        resp = requests.post(generate_url, json=payload)
        resp.raise_for_status()  # Raises an HTTPError for bad responses
        print(resp.text)
    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")

generation_request()

Cleanup

When done, you can stop the shortfin_llm_server by killing the process:

kill -9 $shortfin_process

If you want to find the process again:

ps -f | grep shortfin