Skip to content

Commit 4cb9a47

Browse files
Add index page for API doc & links update in mddocs (#11393)
* Small fixes * Add initial api doc index * Change index.md -> README.md * Fix on API links
1 parent b200e11 commit 4cb9a47

File tree

8 files changed

+104
-88
lines changed

8 files changed

+104
-88
lines changed

docs/mddocs/Overview/KeyFeatures/inference_on_gpu.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ You could choose to use [PyTorch API](./optimize_model.md) or [`transformers`-st
3434
>
3535
> When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the `optimize_model` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
3636
>
37-
> See the [API doc](https://ipex-llm.readthedocs.io/en/latest/doc/PythonAPI/LLM/optimize.html) for ``optimize_model`` to find more information.
37+
> See the [API doc](../../PythonAPI/optimize.md) for ``optimize_model`` to find more information.
3838
3939
Especially, if you have saved the optimized model following setps [here](./optimize_model.md#save), the loading process on Intel GPUs maybe as follows:
4040

@@ -70,7 +70,7 @@ You could choose to use [PyTorch API](./optimize_model.md) or [`transformers`-st
7070
>
7171
> When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the `from_pretrained` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
7272
>
73-
> See the [API doc](https://ipex-llm.readthedocs.io/en/latest/doc/PythonAPI/LLM/transformers.html) to find more information.
73+
> See the [API doc](../../PythonAPI/transformers.md) to find more information.
7474
7575
Especially, if you have saved the optimized model following setps [here](./hugging_face_format.md#save--load), the loading process on Intel GPUs maybe as follows:
7676

docs/mddocs/Overview/KeyFeatures/optimize_model.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,6 @@ model = load_low_bit(model, saved_dir) # Load the optimized model
6161

6262

6363
> [!NOTE]
64-
> - Please refer to the [API documentation](https://ipex-llm.readthedocs.io/en/latest/doc/PythonAPI/LLM/optimize.html) for more details.
64+
> - Please refer to the [API documentation](../../PythonAPI/optimize.md) for more details.
6565
> - We also provide detailed examples on how to run PyTorch models (e.g., Openai Whisper, LLaMA2, ChatGLM2, Falcon, MPT, Baichuan2, etc.) using IPEX-LLM. See the complete CPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models) and GPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models)
6666

docs/mddocs/PythonAPI/PyTorch-API.md

-85
This file was deleted.

docs/mddocs/PythonAPI/README.md

+22
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# IPEX-LLM API
2+
3+
- [IPEX-LLM `transformers`-style API](./transformers.md)
4+
5+
- [Hugging Face `transformers` AutoModel](./transformers.md#hugging-face-transformers-automodel)
6+
7+
- AutoModelForCausalLM
8+
- AutoModel
9+
- AutoModelForSpeechSeq2Seq
10+
- AutoModelForSeq2SeqLM
11+
- AutoModelForSequenceClassification
12+
- AutoModelForMaskedLM
13+
- AutoModelForQuestionAnswering
14+
- AutoModelForNextSentencePrediction
15+
- AutoModelForMultipleChoice
16+
- AutoModelForTokenClassification
17+
18+
- [IPEX-LLM PyTorch API](./optimize.md)
19+
20+
- [Optimize Model](./optimize.md#optimize-model)
21+
22+
- [Load Optimized Model](./optimize.md#load-optimized-model)

docs/mddocs/PythonAPI/optimize.md

+79
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# IPEX-LLM PyTorch API
2+
3+
## Optimize Model
4+
You can run any PyTorch model with `optimize_model` through only one-line code change to benefit from IPEX-LLM optimization, regardless of the library or API you are using.
5+
6+
### `ipex_llm.optimize_model`_`(model, low_bit='sym_int4', optimize_llm=True, modules_to_not_convert=None, cpu_embedding=False, lightweight_bmm=False, **kwargs)`_
7+
8+
A method to optimize any pytorch model.
9+
10+
- **Parameters**:
11+
12+
- **model**: The original PyTorch model (nn.module)
13+
14+
- **low_bit**: str value, options are `'sym_int4'`, `'asym_int4'`, `'sym_int5'`, `'asym_int5'`, `'sym_int8'`, `'nf3'`, `'nf4'`, `'fp4'`, `'fp8'`, `'fp8_e4m3'`, `'fp8_e5m2'`, `'fp16'` or `'bf16'`, `'sym_int4'` means symmetric int 4, `'asym_int4'` means asymmetric int 4, `'nf4'` means 4-bit NormalFloat, etc. Relevant low bit optimizations will be applied to the model.
15+
16+
- **optimize_llm**: Whether to further optimize llm model. Default to be `True`.
17+
18+
- **modules_to_not_convert**: list of str value, modules (`nn.Module`) that are skipped when conducting model optimizations. Default to be `None`.
19+
20+
- **cpu_embedding**: Whether to replace the Embedding layer, may need to set it to `True` when running IPEX-LLM on GPU. Default to be `False`.
21+
22+
- **lightweight_bmm**: Whether to replace the `torch.bmm` ops, may need to set it to `True` when running IPEX-LLM on GPU on Windows. Default to be `False`.
23+
24+
- **Returns**: The optimized model.
25+
26+
- **Example**:
27+
28+
```python
29+
# Take OpenAI Whisper model as an example
30+
from ipex_llm import optimize_model
31+
model = whisper.load_model('tiny') # Load whisper model under pytorch framework
32+
model = optimize_model(model) # With only one line code change
33+
# Use the optimized model without other API change
34+
result = model.transcribe(audio, verbose=True, language="English")
35+
# (Optional) you can also save the optimized model by calling 'save_low_bit'
36+
model.save_low_bit(saved_dir)
37+
```
38+
39+
## Load Optimized Model
40+
41+
To avoid high resource consumption during the loading processes of the original model, we provide save/load API to support the saving of model after low-bit optimization and the loading of the saved low-bit model. Saving and loading operations are platform-independent, regardless of their operating systems.
42+
43+
### `ipex_llm.optimize.load_low_bit`_`(model, model_path)`_
44+
45+
Load the optimized pytorch model.
46+
47+
- **Parameters**:
48+
49+
- **model**: The PyTorch model instance.
50+
51+
- **model_path**: The path of saved optimized model.
52+
53+
54+
- **Returns**: The optimized model.
55+
56+
- **Example**:
57+
58+
```python
59+
# Example 1:
60+
# Take ChatGLM2-6B model as an example
61+
# Make sure you have saved the optimized model by calling 'save_low_bit'
62+
from ipex_llm.optimize import low_memory_init, load_low_bit
63+
with low_memory_init(): # Fast and low cost by loading model on meta device
64+
model = AutoModel.from_pretrained(saved_dir,
65+
torch_dtype="auto",
66+
trust_remote_code=True)
67+
model = load_low_bit(model, saved_dir) # Load the optimized model
68+
```
69+
70+
```python
71+
# Example 2:
72+
# If the model doesn't fit 'low_memory_init' method,
73+
# alternatively, you can obtain the model instance through traditional loading method.
74+
# Take OpenAI Whisper model as an example
75+
# Make sure you have saved the optimized model by calling 'save_low_bit'
76+
from ipex_llm.optimize import load_low_bit
77+
model = whisper.load_model('tiny') # A model instance through traditional loading method
78+
model = load_low_bit(model, saved_dir) # Load the optimized model
79+
```
File renamed without changes.

0 commit comments

Comments
 (0)