|
| 1 | +# IPEX-LLM PyTorch API |
| 2 | + |
| 3 | +## Optimize Model |
| 4 | +You can run any PyTorch model with `optimize_model` through only one-line code change to benefit from IPEX-LLM optimization, regardless of the library or API you are using. |
| 5 | + |
| 6 | +### `ipex_llm.optimize_model`_`(model, low_bit='sym_int4', optimize_llm=True, modules_to_not_convert=None, cpu_embedding=False, lightweight_bmm=False, **kwargs)`_ |
| 7 | + |
| 8 | +A method to optimize any pytorch model. |
| 9 | + |
| 10 | +- **Parameters**: |
| 11 | + |
| 12 | + - **model**: The original PyTorch model (nn.module) |
| 13 | + |
| 14 | + - **low_bit**: str value, options are `'sym_int4'`, `'asym_int4'`, `'sym_int5'`, `'asym_int5'`, `'sym_int8'`, `'nf3'`, `'nf4'`, `'fp4'`, `'fp8'`, `'fp8_e4m3'`, `'fp8_e5m2'`, `'fp16'` or `'bf16'`, `'sym_int4'` means symmetric int 4, `'asym_int4'` means asymmetric int 4, `'nf4'` means 4-bit NormalFloat, etc. Relevant low bit optimizations will be applied to the model. |
| 15 | + |
| 16 | + - **optimize_llm**: Whether to further optimize llm model. Default to be `True`. |
| 17 | + |
| 18 | + - **modules_to_not_convert**: list of str value, modules (`nn.Module`) that are skipped when conducting model optimizations. Default to be `None`. |
| 19 | + |
| 20 | + - **cpu_embedding**: Whether to replace the Embedding layer, may need to set it to `True` when running IPEX-LLM on GPU. Default to be `False`. |
| 21 | + |
| 22 | + - **lightweight_bmm**: Whether to replace the `torch.bmm` ops, may need to set it to `True` when running IPEX-LLM on GPU on Windows. Default to be `False`. |
| 23 | + |
| 24 | +- **Returns**: The optimized model. |
| 25 | + |
| 26 | +- **Example**: |
| 27 | + |
| 28 | + ```python |
| 29 | + # Take OpenAI Whisper model as an example |
| 30 | + from ipex_llm import optimize_model |
| 31 | + model = whisper.load_model('tiny') # Load whisper model under pytorch framework |
| 32 | + model = optimize_model(model) # With only one line code change |
| 33 | + # Use the optimized model without other API change |
| 34 | + result = model.transcribe(audio, verbose=True, language="English") |
| 35 | + # (Optional) you can also save the optimized model by calling 'save_low_bit' |
| 36 | + model.save_low_bit(saved_dir) |
| 37 | + ``` |
| 38 | + |
| 39 | +## Load Optimized Model |
| 40 | + |
| 41 | +To avoid high resource consumption during the loading processes of the original model, we provide save/load API to support the saving of model after low-bit optimization and the loading of the saved low-bit model. Saving and loading operations are platform-independent, regardless of their operating systems. |
| 42 | + |
| 43 | +### `ipex_llm.optimize.load_low_bit`_`(model, model_path)`_ |
| 44 | + |
| 45 | +Load the optimized pytorch model. |
| 46 | + |
| 47 | +- **Parameters**: |
| 48 | + |
| 49 | + - **model**: The PyTorch model instance. |
| 50 | + |
| 51 | + - **model_path**: The path of saved optimized model. |
| 52 | + |
| 53 | + |
| 54 | +- **Returns**: The optimized model. |
| 55 | + |
| 56 | +- **Example**: |
| 57 | + |
| 58 | + ```python |
| 59 | + # Example 1: |
| 60 | + # Take ChatGLM2-6B model as an example |
| 61 | + # Make sure you have saved the optimized model by calling 'save_low_bit' |
| 62 | + from ipex_llm.optimize import low_memory_init, load_low_bit |
| 63 | + with low_memory_init(): # Fast and low cost by loading model on meta device |
| 64 | + model = AutoModel.from_pretrained(saved_dir, |
| 65 | + torch_dtype="auto", |
| 66 | + trust_remote_code=True) |
| 67 | + model = load_low_bit(model, saved_dir) # Load the optimized model |
| 68 | + ``` |
| 69 | + |
| 70 | + ```python |
| 71 | + # Example 2: |
| 72 | + # If the model doesn't fit 'low_memory_init' method, |
| 73 | + # alternatively, you can obtain the model instance through traditional loading method. |
| 74 | + # Take OpenAI Whisper model as an example |
| 75 | + # Make sure you have saved the optimized model by calling 'save_low_bit' |
| 76 | + from ipex_llm.optimize import load_low_bit |
| 77 | + model = whisper.load_model('tiny') # A model instance through traditional loading method |
| 78 | + model = load_low_bit(model, saved_dir) # Load the optimized model |
| 79 | + ``` |
0 commit comments