Skip to content

Commit 52a2135

Browse files
authored
Replace ipex with ipex-llm (#10554)
* fix ipex with ipex_llm * fix ipex with ipex_llm * update * update * update * update * update * update * update * update
1 parent 0a2e820 commit 52a2135

File tree

106 files changed

+127
-122
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

106 files changed

+127
-122
lines changed

docker/llm/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ After the container is booted, you could get into the container through `docker
6262
docker exec -it my_container bash
6363
```
6464

65-
To run inference using `IPEX-LLM` using cpu, you could refer to this [documentation](https://github.com/intel-analytics/IPEX/tree/main/python/llm#cpu-int4).
65+
To run inference using `IPEX-LLM` using cpu, you could refer to this [documentation](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm#cpu-int4).
6666

6767

6868
#### Getting started with chat

docker/llm/finetune/qlora/cpu/kubernetes/Chart.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
apiVersion: v2
2-
name: ipex-fintune-service
2+
name: ipex_llm-fintune-service
33
description: A Helm chart for IPEX-LLM Finetune Service on Kubernetes
44
type: application
55
version: 1.1.27

docker/llm/serving/cpu/docker/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ sudo docker run -itd \
3030

3131
After the container is booted, you could get into the container through `docker exec`.
3232

33-
To run model-serving using `IPEX-LLM` as backend, you can refer to this [document](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/src/ipex/llm/serving).
33+
To run model-serving using `IPEX-LLM` as backend, you can refer to this [document](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/src/ipex_llm/serving/fastchat).
3434
Also you can set environment variables and start arguments while running a container to get serving started initially. You may need to boot several containers to support. One controller container and at least one worker container are needed. The api server address(host and port) and controller address are set in controller container, and you need to set the same controller address as above, model path on your machine and worker address in worker container.
3535

3636
To start a controller container:

docker/llm/serving/cpu/kubernetes/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ To deploy IPEX-LLM-serving cpu in Kubernetes environment, please use this image:
1010

1111
In this document, we will use `vicuna-7b-v1.5` as the deployment model.
1212

13-
After downloading the model, please change name from `vicuna-7b-v1.5` to `vicuna-7b-v1.5-ipex` to use `ipex-llm` as the backend. The `ipex-llm` backend will be used if model path contains `ipex-llm`. Otherwise, the original transformer-backend will be used.
13+
After downloading the model, please change name from `vicuna-7b-v1.5` to `vicuna-7b-v1.5-ipex-llm` to use `ipex-llm` as the backend. The `ipex-llm` backend will be used if model path contains `ipex-llm`. Otherwise, the original transformer-backend will be used.
1414

1515
You can download the model from [here](https://huggingface.co/lmsys/vicuna-7b-v1.5).
1616

python/llm/example/CPU/Deepspeed-AutoTP/deepspeed_autotp.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@
102102
# Batch tokenizing
103103
prompt = args.prompt
104104
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(f'cpu:{local_rank}')
105-
# ipex model needs a warmup, then inference time can be accurate
105+
# ipex-llm model needs a warmup, then inference time can be accurate
106106
output = model.generate(input_ids,
107107
max_new_tokens=args.n_predict,
108108
use_cache=True)

python/llm/example/CPU/LangChain/README.md

+8-3
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
## Langchain Examples
22

3-
This folder contains examples showcasing how to use `langchain` with `ipex`.
3+
This folder contains examples showcasing how to use `langchain` with `ipex-llm`.
44

5-
### Install IPEX
5+
### Install-IPEX LLM
66

77
Ensure `ipex-llm` is installed by following the [IPEX-LLM Installation Guide](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm#install).
88

@@ -36,7 +36,7 @@ To run the example, execute the following command in the current directory:
3636
```bash
3737
python transformers_int4/rag.py -m <path_to_model> [-q <your_question>] [-i <path_to_input_txt>]
3838
```
39-
> Note: If `-i` is not specified, it will use a short introduction to Big-DL as input by default. if `-q` is not specified, `What is IPEX?` will be used by default.
39+
> Note: If `-i` is not specified, it will use a short introduction to Big-DL as input by default. if `-q` is not specified, `What is IPEX LLM?` will be used by default.
4040
4141

4242
### Example: Math
@@ -66,3 +66,8 @@ python transformers_int4/voiceassistant.py -m <path_to_model> [-q <your_question
6666
- `-x MAX_NEW_TOKENS`: the max new tokens of model tokens input
6767
- `-l LANGUAGE`: you can specify a language such as "english" or "chinese"
6868
- `-d True|False`: whether the model path specified in -m is saved low bit model.
69+
70+
### Legacy (Native INT4 examples)
71+
72+
IPEX-LLM also provides langchain integrations using native INT4 mode. Those examples can be foud in [native_int4](./native_int4/) folder. For detailed instructions of settting up and running `native_int4` examples, refer to [Native INT4 Examples README](./README_nativeint4.md).
73+

python/llm/example/CPU/PyTorch-Models/Model/mixtral/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@
5454
with torch.inference_mode():
5555
prompt = MIXTRAL_PROMPT_FORMAT.format(prompt=args.prompt)
5656
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('cpu')
57-
# ipex model needs a warmup, then inference time can be accurate
57+
# ipex-llm model needs a warmup, then inference time can be accurate
5858
output = model.generate(input_ids,
5959
max_new_tokens=args.n_predict)
6060

python/llm/example/CPU/QLoRA-FineTuning/alpaca-qlora/README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ Example usage:
2828
python ./alpaca_qlora_finetuning_cpu.py \
2929
--base_model "meta-llama/Llama-2-7b-hf" \
3030
--data_path "yahma/alpaca-cleaned" \
31-
--output_dir "./ipex-qlora-alpaca"
31+
--output_dir "./ipex-llm-qlora-alpaca"
3232
```
3333

3434
**Note**: You could also specify `--base_model` to the local path of the huggingface model checkpoint folder and `--data_path` to the local path of the dataset JSON file.
@@ -109,7 +109,7 @@ def generate_and_tokenize_prompt(data_point):
109109
python ./quotes_qlora_finetuning_cpu.py \
110110
--base_model "meta-llama/Llama-2-7b-hf" \
111111
--data_path "./english_quotes" \
112-
--output_dir "./ipex-qlora-alpaca" \
112+
--output_dir "./ipex-llm-qlora-alpaca" \
113113
--prompt_template_name "english_quotes"
114114
```
115115

python/llm/example/CPU/QLoRA-FineTuning/alpaca-qlora/finetune_one_node_two_sockets.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -14,5 +14,5 @@ mpirun -n 2 \
1414
--max_steps -1 \
1515
--base_model "meta-llama/Llama-2-7b-hf" \
1616
--data_path "yahma/alpaca-cleaned" \
17-
--output_dir "./ipex-qlora-alpaca"
17+
--output_dir "./ipex-llm-qlora-alpaca"
1818

python/llm/example/GPU/Deepspeed-AutoTP/deepspeed_autotp.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,7 @@ def get_int_from_env(env_keys, default):
109109
with torch.inference_mode():
110110
prompt = args.prompt
111111
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(f'xpu:{local_rank}')
112-
# ipex model needs a warmup, then inference time can be accurate
112+
# ipex_llm model needs a warmup, then inference time can be accurate
113113
output = model.generate(input_ids,
114114
max_new_tokens=args.n_predict,
115115
use_cache=True)

python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF-IQ2/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@
6464
with torch.inference_mode():
6565
prompt = PROMPT_FORMAT.format(prompt=args.prompt)
6666
input_ids = tokenizer.encode(prompt, return_tensors="pt").to("xpu")
67-
# ipex model needs a warmup, then inference time can be accurate
67+
# ipex_llm model needs a warmup, then inference time can be accurate
6868
output = model.generate(input_ids,
6969
max_new_tokens=args.n_predict)
7070
st = time.time()

python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@
5555
with torch.inference_mode():
5656
prompt = BAICHUAN_PROMPT_FORMAT.format(prompt=args.prompt)
5757
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
58-
# ipex model needs a warmup, then inference time can be accurate
58+
# ipex_llm model needs a warmup, then inference time can be accurate
5959
output = model.generate(input_ids,
6060
max_new_tokens=args.n_predict)
6161

python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@
6161
with torch.inference_mode():
6262
prompt = BAICHUAN_PROMPT_FORMAT.format(prompt=args.prompt)
6363
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
64-
# ipex model needs a warmup, then inference time can be accurate
64+
# ipex_llm model needs a warmup, then inference time can be accurate
6565
output = model.generate(input_ids,
6666
max_new_tokens=args.n_predict)
6767

python/llm/example/GPU/HF-Transformers-AutoModels/Model/bluelm/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@
5555
with torch.inference_mode():
5656
prompt = BLUELM_PROMPT_FORMAT.format(prompt=args.prompt)
5757
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
58-
# ipex model needs a warmup, then inference time can be accurate
58+
# ipex_llm model needs a warmup, then inference time can be accurate
5959
output = model.generate(input_ids,
6060
max_new_tokens=args.n_predict)
6161

python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@
5858
with torch.inference_mode():
5959
prompt = CHATGLM_V2_PROMPT_FORMAT.format(prompt=args.prompt)
6060
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
61-
# ipex model needs a warmup, then inference time can be accurate
61+
# ipex_llm model needs a warmup, then inference time can be accurate
6262
output = model.generate(input_ids,
6363
max_new_tokens=args.n_predict)
6464

python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2/streamchat.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@
5454
with torch.inference_mode():
5555
prompt = args.question
5656
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
57-
# ipex model needs a warmup, then inference time can be accurate
57+
# ipex_llm model needs a warmup, then inference time can be accurate
5858
output = model.generate(input_ids,
5959
max_new_tokens=32)
6060

python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@
5858
with torch.inference_mode():
5959
prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt=args.prompt)
6060
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
61-
# ipex model needs a warmup, then inference time can be accurate
61+
# ipex_llm model needs a warmup, then inference time can be accurate
6262
output = model.generate(input_ids,
6363
max_new_tokens=args.n_predict)
6464

python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/streamchat.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@
5454
with torch.inference_mode():
5555
prompt = args.question
5656
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
57-
# ipex model needs a warmup, then inference time can be accurate
57+
# ipex_llm model needs a warmup, then inference time can be accurate
5858
output = model.generate(input_ids,
5959
max_new_tokens=32)
6060

python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ def get_prompt(message: str, chat_history: list[tuple[str, str]],
7474
with torch.inference_mode():
7575
prompt = get_prompt(args.prompt, [], system_prompt=DEFAULT_SYSTEM_PROMPT)
7676
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
77-
# ipex model needs a warmup, then inference time can be accurate
77+
# ipex_llm model needs a warmup, then inference time can be accurate
7878
output = model.generate(input_ids,
7979
max_new_tokens=args.n_predict)
8080

python/llm/example/GPU/HF-Transformers-AutoModels/Model/codellama/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@
5858
prompt = CODELLAMA_PROMPT_FORMAT.format(prompt=args.prompt)
5959
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
6060

61-
# ipex model needs a warmup, then inference time can be accurate
61+
# ipex_llm model needs a warmup, then inference time can be accurate
6262
output = model.generate(input_ids,
6363
max_new_tokens=args.n_predict)
6464

python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@
5858
prompt = FALCON_PROMPT_FORMAT.format(prompt=args.prompt)
5959
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
6060

61-
# ipex model needs a warmup, then inference time can be accurate
61+
# ipex_llm model needs a warmup, then inference time can be accurate
6262
output = model.generate(input_ids,
6363
max_new_tokens=args.n_predict)
6464

python/llm/example/GPU/HF-Transformers-AutoModels/Model/flan-t5/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@
6060
with torch.inference_mode():
6161
prompt = FLAN_T5_PROMPT_FORMAT.format(prompt=args.prompt)
6262
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
63-
# ipex model needs a warmup, then inference time can be accurate
63+
# ipex_llm model needs a warmup, then inference time can be accurate
6464
output = model.generate(input_ids,
6565
max_new_tokens=args.n_predict)
6666

python/llm/example/GPU/HF-Transformers-AutoModels/Model/gemma/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@
5959
chat[0]['content'] = args.prompt
6060
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
6161
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
62-
# ipex model needs a warmup, then inference time can be accurate
62+
# ipex_llm model needs a warmup, then inference time can be accurate
6363
output = model.generate(input_ids,
6464
max_new_tokens=args.n_predict)
6565

python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@
5757
prompt = GptJ_PROMPT_FORMAT.format(prompt=args.prompt)
5858
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
5959

60-
# ipex model needs a warmup, then inference time can be accurate
60+
# ipex_llm model needs a warmup, then inference time can be accurate
6161
output = model.generate(input_ids,
6262
max_new_tokens=args.n_predict)
6363

python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@
5757
with torch.inference_mode():
5858
prompt = INTERNLM_PROMPT_FORMAT.format(prompt=args.prompt)
5959
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
60-
# ipex model needs a warmup, then inference time can be accurate
60+
# ipex_llm model needs a warmup, then inference time can be accurate
6161
output = model.generate(input_ids,
6262
max_new_tokens=args.n_predict)
6363

python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm2/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@
6262
with torch.inference_mode():
6363
prompt = INTERNLM_PROMPT_FORMAT.format(prompt=args.prompt)
6464
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
65-
# ipex model needs a warmup, then inference time can be accurate
65+
# ipex_llm model needs a warmup, then inference time can be accurate
6666
output = model.generate(input_ids,
6767
max_new_tokens=args.n_predict)
6868

python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ def get_prompt(message: str, chat_history: list[tuple[str, str]],
7070
with torch.inference_mode():
7171
prompt = get_prompt(args.prompt, [], system_prompt=DEFAULT_SYSTEM_PROMPT)
7272
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
73-
# ipex model needs a warmup, then inference time can be accurate
73+
# ipex_llm model needs a warmup, then inference time can be accurate
7474
output = model.generate(input_ids,
7575
max_new_tokens=args.n_predict)
7676

python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@
5656
with torch.inference_mode():
5757
prompt = MISTRAL_PROMPT_FORMAT.format(prompt=args.prompt)
5858
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
59-
# ipex model needs a warmup, then inference time can be accurate
59+
# ipex_llm model needs a warmup, then inference time can be accurate
6060
output = model.generate(input_ids,
6161
max_new_tokens=args.n_predict)
6262

python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@
5656
with torch.inference_mode():
5757
prompt = MIXTRAL_PROMPT_FORMAT.format(prompt=args.prompt)
5858
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
59-
# ipex model needs a warmup, then inference time can be accurate
59+
# ipex_llm model needs a warmup, then inference time can be accurate
6060
output = model.generate(input_ids,
6161
max_new_tokens=args.n_predict)
6262

python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@
5858
with torch.inference_mode():
5959
prompt = MPT_PROMPT_FORMAT.format(prompt=args.prompt)
6060
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
61-
# ipex model needs a warmup, then inference time can be accurate
61+
# ipex_llm model needs a warmup, then inference time can be accurate
6262
output = model.generate(input_ids,
6363
max_new_tokens=args.n_predict)
6464

python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-1_5/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@
5959
prompt = PHI1_5_PROMPT_FORMAT.format(prompt=args.prompt)
6060
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
6161

62-
# ipex model needs a warmup, then inference time can be accurate
62+
# ipex_llm model needs a warmup, then inference time can be accurate
6363
output = model.generate(input_ids,
6464
max_new_tokens=args.n_predict,
6565
generation_config = generation_config)

python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-2/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@
6060
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
6161

6262
model.generation_config.pad_token_id = model.generation_config.eos_token_id
63-
# ipex model needs a warmup, then inference time can be accurate
63+
# ipex_llm model needs a warmup, then inference time can be accurate
6464
output = model.generate(input_ids,
6565
max_new_tokens=args.n_predict,
6666
generation_config = generation_config)

python/llm/example/GPU/HF-Transformers-AutoModels/Model/phixtral/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@
6161
prompt = PHI1_5_PROMPT_FORMAT.format(prompt=args.prompt)
6262
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
6363

64-
# ipex model needs a warmup, then inference time can be accurate
64+
# ipex_llm model needs a warmup, then inference time can be accurate
6565
output = model.generate(input_ids,
6666
max_new_tokens=args.n_predict,
6767
generation_config = generation_config)

python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@
6464
with torch.inference_mode():
6565
prompt = QWEN_PROMPT_FORMAT.format(prompt=args.prompt)
6666
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
67-
# ipex model needs a warmup, then inference time can be accurate
67+
# ipex_llm model needs a warmup, then inference time can be accurate
6868
output = model.generate(input_ids,
6969
max_new_tokens=args.n_predict)
7070

python/llm/example/GPU/HF-Transformers-AutoModels/Model/redpajama/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@
5656
prompt = RedPajama_PROMPT_FORMAT.format(prompt=args.prompt)
5757
inputs = tokenizer(prompt, return_tensors='pt').to('xpu')
5858

59-
# ipex model needs a warmup, then inference time can be accurate
59+
# ipex_llm model needs a warmup, then inference time can be accurate
6060
output = model.generate(**inputs,
6161
max_new_tokens=args.n_predict,
6262
do_sample=True,

python/llm/example/GPU/HF-Transformers-AutoModels/Model/replit/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@
5757
prompt = REPLIT_PROMPT_FORMAT.format(prompt=args.prompt)
5858
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
5959

60-
# ipex model needs a warmup, then inference time can be accurate
60+
# ipex_llm model needs a warmup, then inference time can be accurate
6161
output = model.generate(input_ids,
6262
max_new_tokens=args.n_predict)
6363

python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv4/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ def generate_prompt(instruction):
7070
with torch.inference_mode():
7171
prompt = generate_prompt(instruction=args.prompt)
7272
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
73-
# ipex model needs a warmup, then inference time can be accurate
73+
# ipex_llm model needs a warmup, then inference time can be accurate
7474
output = model.generate(input_ids,
7575
max_new_tokens=args.n_predict)
7676

python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv5/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ def generate_prompt(instruction):
6767
with torch.inference_mode():
6868
prompt = generate_prompt(instruction=args.prompt)
6969
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
70-
# ipex model needs a warmup, then inference time can be accurate
70+
# ipex_llm model needs a warmup, then inference time can be accurate
7171
output = model.generate(input_ids,
7272
max_new_tokens=args.n_predict)
7373

python/llm/example/GPU/HF-Transformers-AutoModels/Model/solar/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@
5858
with torch.inference_mode():
5959
prompt = SOLAR_PROMPT_FORMAT.format(prompt=args.prompt)
6060
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
61-
# ipex model needs a warmup, then inference time can be accurate
61+
# ipex_llm model needs a warmup, then inference time can be accurate
6262
output = model.generate(input_ids,
6363
max_new_tokens=args.n_predict)
6464

python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder/generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@
5757
prompt = StarCoder_PROMPT_FORMAT.format(prompt=args.prompt)
5858
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
5959

60-
# ipex model needs a warmup, then inference time can be accurate
60+
# ipex_llm model needs a warmup, then inference time can be accurate
6161
output = model.generate(input_ids,
6262
max_new_tokens=args.n_predict)
6363

0 commit comments

Comments
 (0)