Skip to content

Commit 9b75806

Browse files
authored
Update Windows GPU quickstart regarding demo (#12124)
* use Qwen2-1.5B-Instruct in demo * update * add reference link * update * update
1 parent 17c23cd commit 9b75806

File tree

1 file changed

+42
-27
lines changed

1 file changed

+42
-27
lines changed

docs/mddocs/Quickstart/install_windows_gpu.md

Lines changed: 42 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -123,21 +123,15 @@ To monitor your GPU's performance and status (e.g. memory consumption, utilizati
123123

124124
## A Quick Example
125125

126-
Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://huggingface.co/Qwen/Qwen-1_8B-Chat) model, a 1.8 billion parameter LLM for this demonstration. Follow the steps below to setup and run the model, and observe how it responds to a prompt "What is AI?".
126+
Now let's play with a real LLM. We'll be using the [Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) model, a 1.8 billion parameter LLM for this demonstration. Follow the steps below to setup and run the model, and observe how it responds to a prompt "What is AI?".
127127

128128
- Step 1: Follow [Runtime Configurations Section](#step-1-runtime-configurations) above to prepare your runtime environment.
129129

130-
- Step 2: Install additional package required for Qwen-1.8B-Chat to conduct:
131-
132-
```cmd
133-
pip install tiktoken transformers_stream_generator einops
134-
```
135-
136-
- Step 3: Create code file. IPEX-LLM supports loading model from Hugging Face or ModelScope. Please choose according to your requirements.
130+
- Step 2: Create code file. IPEX-LLM supports loading model from Hugging Face or ModelScope. Please choose according to your requirements.
137131

138132
- For **loading model from Hugging Face**:
139133

140-
Create a new file named `demo.py` and insert the code snippet below to run [Qwen-1.8B-Chat](https://huggingface.co/Qwen/Qwen-1_8B-Chat) model with IPEX-LLM optimizations.
134+
Create a new file named `demo.py` and insert the code snippet below to run [Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) model with IPEX-LLM optimizations.
141135

142136
```python
143137
# Copy/Paste the contents to a new file demo.py
@@ -147,24 +141,34 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
147141
generation_config = GenerationConfig(use_cache=True)
148142

149143
print('Now start loading Tokenizer and optimizing Model...')
150-
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat",
144+
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-1.5B-Instruct",
151145
trust_remote_code=True)
152146

153147
# Load Model using ipex-llm and load it to GPU
154-
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat",
148+
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-1.5B-Instruct",
155149
load_in_4bit=True,
156150
cpu_embedding=True,
157151
trust_remote_code=True)
158152
model = model.to('xpu')
159153
print('Successfully loaded Tokenizer and optimized Model!')
160154

161155
# Format the prompt
156+
# you could tune the prompt based on your own model,
157+
# here the prompt tuning refers to https://huggingface.co/Qwen/Qwen2-1.5B-Instruct#quickstart
162158
question = "What is AI?"
163-
prompt = "user: {prompt}\n\nassistant:".format(prompt=question)
159+
messages = [
160+
{"role": "system", "content": "You are a helpful assistant."},
161+
{"role": "user", "content": question}
162+
]
163+
text = tokenizer.apply_chat_template(
164+
messages,
165+
tokenize=False,
166+
add_generation_prompt=True
167+
)
164168

165169
# Generate predicted tokens
166170
with torch.inference_mode():
167-
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
171+
input_ids = tokenizer.encode(text, return_tensors="pt").to('xpu')
168172

169173
print('--------------------------------------Note-----------------------------------------')
170174
print('| For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or |')
@@ -185,7 +189,7 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
185189
do_sample=False,
186190
max_new_tokens=32,
187191
generation_config=generation_config).cpu()
188-
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
192+
output_str = tokenizer.decode(output[0], skip_special_tokens=False)
189193
print(output_str)
190194
```
191195
- For **loading model ModelScopee**:
@@ -195,10 +199,9 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
195199
pip install modelscope==1.11.0
196200
```
197201

198-
Create a new file named `demo.py` and insert the code snippet below to run [Qwen-1.8B-Chat](https://www.modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary) model with IPEX-LLM optimizations.
202+
Create a new file named `demo.py` and insert the code snippet below to run [Qwen2-1.5B-Instruct](https://www.modelscope.cn/models/qwen/Qwen2-1.5B-Instruct/summary) model with IPEX-LLM optimizations.
199203

200204
```python
201-
202205
# Copy/Paste the contents to a new file demo.py
203206
import torch
204207
from ipex_llm.transformers import AutoModelForCausalLM
@@ -207,11 +210,11 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
207210
generation_config = GenerationConfig(use_cache=True)
208211

209212
print('Now start loading Tokenizer and optimizing Model...')
210-
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat",
213+
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-1.5B-Instruct",
211214
trust_remote_code=True)
212215

213216
# Load Model using ipex-llm and load it to GPU
214-
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat",
217+
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-1.5B-Instruct",
215218
load_in_4bit=True,
216219
cpu_embedding=True,
217220
trust_remote_code=True,
@@ -220,13 +223,22 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
220223
print('Successfully loaded Tokenizer and optimized Model!')
221224

222225
# Format the prompt
226+
# you could tune the prompt based on your own model,
227+
# here the prompt tuning refers to https://huggingface.co/Qwen/Qwen2-1.5B-Instruct#quickstart
223228
question = "What is AI?"
224-
prompt = "user: {prompt}\n\nassistant:".format(prompt=question)
225-
229+
messages = [
230+
{"role": "system", "content": "You are a helpful assistant."},
231+
{"role": "user", "content": question}
232+
]
233+
text = tokenizer.apply_chat_template(
234+
messages,
235+
tokenize=False,
236+
add_generation_prompt=True
237+
)
238+
226239
# Generate predicted tokens
227240
with torch.inference_mode():
228-
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
229-
241+
input_ids = tokenizer.encode(text, return_tensors="pt").to('xpu')
230242
print('--------------------------------------Note-----------------------------------------')
231243
print('| For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or |')
232244
print('| Pro A60, it may take several minutes for GPU kernels to compile and initialize. |')
@@ -246,7 +258,7 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
246258
do_sample=False,
247259
max_new_tokens=32,
248260
generation_config=generation_config).cpu()
249-
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
261+
output_str = tokenizer.decode(output[0], skip_special_tokens=False)
250262
print(output_str)
251263
```
252264
> **Note**:
@@ -257,7 +269,7 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
257269
> When running LLMs on Intel iGPUs with limited memory size, we recommend setting `cpu_embedding=True` in the `from_pretrained` function.
258270
> This will allow the memory-intensive embedding layer to utilize the CPU instead of GPU.
259271

260-
- Step 4. Run `demo.py` within the activated Python environment using the following command:
272+
- Step 3. Run `demo.py` within the activated Python environment using the following command:
261273

262274
```cmd
263275
python demo.py
@@ -267,9 +279,12 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
267279

268280
Example output on a system equipped with an Intel Core Ultra 5 125H CPU and Intel Arc Graphics iGPU:
269281
```
270-
user: What is AI?
271-
272-
assistant: AI stands for Artificial Intelligence, which refers to the development of computer systems that can perform tasks that typically require human intelligence, such as visual perception, speech recognition,
282+
<|im_start|>system
283+
You are a helpful assistant.<|im_end|>
284+
<|im_start|>user
285+
What is AI?<|im_end|>
286+
<|im_start|>assistant
287+
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and act like humans. It involves the development of algorithms,
273288
```
274289

275290
## Tips & Troubleshooting

0 commit comments

Comments
 (0)