From 0d22665f6c0f632323f3e32f438ac1b31a67e4b6 Mon Sep 17 00:00:00 2001 From: Ekaterina Aidova Date: Thu, 28 Dec 2023 21:40:34 +0400 Subject: [PATCH] update stable zephyr notebook on the latest stateful model support (#1544) * update stable zephyr notebook on the latest stateful model support * Update 273-stable-zephyr-3b-chatbot.ipynb * update outputs --- README.md | 1 + .../273-stable-zephyr-3b-chatbot.ipynb | 220 ++++++++++++------ .../273-stable-zephyr-3b-chatbot/README.md | 2 +- 3 files changed, 156 insertions(+), 67 deletions(-) diff --git a/README.md b/README.md index 6eeb9282278..dddfcd1e25c 100644 --- a/README.md +++ b/README.md @@ -222,6 +222,7 @@ Demos that demonstrate inference on a particular model. | [270-sound-generation-audioldm2](notebooks/270-sound-generation-audioldm2/)
| Sound Generation with AudioLDM2 and OpenVINO™ | | | [271-sdxl-turbo](notebooks/271-sdxl-turbo/)
| Single-step image generation using SDXL-turbo and OpenVINO | | | [272-paint-by-example](notebooks/272-paint-by-example/)
| Exemplar based image editing using diffusion models, [Paint-by-Example](https://github.com/Fantasy-Studio/Paint-by-Example), and OpenVINO™ | ui_example | +| [273-stable-zephyr-3b-chatbot](notebooks/273-stable-zephyr-3b-chatbot)
| Use Stable-Zephyr as chatbot assistant with OpenVINO | | | [274-efficient-sam](notebooks/274-efficient-sam/)
| Object segmentation with EfficientSAM and OpenVINO™ | | | [275-llm-question-answering](notebooks/275-llm-question-answering)
| LLM Instruction following pipeline | | diff --git a/notebooks/273-stable-zephyr-3b-chatbot/273-stable-zephyr-3b-chatbot.ipynb b/notebooks/273-stable-zephyr-3b-chatbot/273-stable-zephyr-3b-chatbot.ipynb index 96270e1a7ba..0eeaa84dd23 100644 --- a/notebooks/273-stable-zephyr-3b-chatbot/273-stable-zephyr-3b-chatbot.ipynb +++ b/notebooks/273-stable-zephyr-3b-chatbot/273-stable-zephyr-3b-chatbot.ipynb @@ -13,7 +13,7 @@ "\n", "`Stable Zephyr 3B` is a 3 billion parameter model that demonstrated outstanding results on many LLM evaluation benchmarks outperforming many popular models in relatively small size. Inspired by [HugginFaceH4's Zephyr 7B](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) training pipeline this model was trained on a mix of publicly available datasets, synthetic datasets using [Direct Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290), evaluation for this model based on [MT Bench](https://tatsu-lab.github.io/alpaca_eval/) and [Alpaca Benchmark](https://tatsu-lab.github.io/alpaca_eval/). More details about model can be found in [model card](https://huggingface.co/stabilityai/stablelm-zephyr-3b)\n", "\n", - "In this tutorial, we consider how to optimize and run this model using the OpenVINO toolkit. For the convenience of the conversion step and model performance evaluation, we will use [llm_bench](https://github.com/openvinotoolkit/openvino.genai/tree/master/llm_bench/python) tool, which provides a unified approach to estimate performance for LLM. It is based on pipelines provided by Optimum-Intel and allows to estimate performance for Pytorch and OpenVINO models using almost the same code. We also demonstrate how to apply BetterTransformer optimization and making model stateful, that provides opportunity for processing model cache state." + "In this tutorial, we consider how to optimize and run this model using the OpenVINO toolkit. For the convenience of the conversion step and model performance evaluation, we will use [llm_bench](https://github.com/openvinotoolkit/openvino.genai/tree/master/llm_bench/python) tool, which provides a unified approach to estimate performance for LLM. It is based on pipelines provided by Optimum-Intel and allows to estimate performance for Pytorch and OpenVINO models using almost the same code. We also demonstrate how to make model stateful, that provides opportunity for processing model cache state." ] }, { @@ -28,7 +28,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "id": "dd1d538e-c22b-4a6e-9f5b-4462b1a2a43f", "metadata": {}, "outputs": [], @@ -41,9 +41,6 @@ "\n", "if not genai_llm_bench.exists():\n", " !git clone https://github.com/openvinotoolkit/openvino.genai.git\n", - " %cd openvino.genai\n", - " !git checkout e5d7861\n", - " %cd ..\n", "\n", "sys.path.append(str(genai_llm_bench))" ] @@ -55,9 +52,10 @@ "metadata": {}, "outputs": [], "source": [ + "%pip uninstall -q -y optimum-intel optimum\n", "%pip install -q --extra-index-url https://download.pytorch.org/whl/cpu -r ./openvino.genai/llm_bench/python/requirements.txt\n", - "%pip uninstall -q -y openvino-dev openvino openvino-nightly\n", - "%pip install openvino-nightly" + "%pip uninstall -q -y openvino openvino-dev openvino-nightly\n", + "%pip install -q openvino-nightly" ] }, { @@ -67,7 +65,7 @@ "source": [ "## Convert model to OpenVINO Intermediate Representation (IR) and compress model weights to INT4 using NNCF\n", "\n", - "llm_bench provides conversion script for converting LLMS into OpenVINO IR format compatible with Optimum-Intel. It also allows to compress model weights into INT8 or INT4 precision with [NNCF](https://github.com/openvinotoolkit/nncf). For enabling weights compression in INT4 we should use `--compress_weights 4BIT_DEFAULT` argument. The Weights Compression algorithm is aimed at compressing the weights of the models and can be used to optimize the model footprint and performance of large models where the size of weights is relatively larger than the size of activations, for example, Large Language Models (LLM). Compared to INT8 compression, INT4 compression improves performance even more but introduces a minor drop in prediction quality. Additionally, it support way to optimize models using [BetterTransformer](https://huggingface.co/docs/optimum/bettertransformer) library, a fast path of standard PyTorch Transformer APIs to benefit from interesting speedups on CPU & GPU through sparsity and fused kernels as Flash Attention using `--bettertransformer` flag." + "llm_bench provides conversion script for converting LLMS into OpenVINO IR format compatible with Optimum-Intel. It also allows to compress model weights into INT8 or INT4 precision with [NNCF](https://github.com/openvinotoolkit/nncf). For enabling weights compression in INT4 we should use `--compress_weights 4BIT_DEFAULT` argument. The Weights Compression algorithm is aimed at compressing the weights of the models and can be used to optimize the model footprint and performance of large models where the size of weights is relatively larger than the size of activations, for example, Large Language Models (LLM). Compared to INT8 compression, INT4 compression improves performance even more but introduces a minor drop in prediction quality." ] }, { @@ -75,14 +73,58 @@ "execution_count": 3, "id": "aeacd53b-755a-464f-b984-d88c3625d687", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino\n", + "/home/ea/work/genai_env/lib/python3.8/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11080). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)\n", + " return torch._C._cuda_getDeviceCount() > 0\n", + "No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'\n", + "[ INFO ] openvino runtime version: 2024.0.0-13826-b51c5c0a997\n", + "model.safetensors: 100%|███████████████████| 5.59G/5.59G [04:19<00:00, 21.6MB/s]\n", + "generation_config.json: 100%|██████████████████| 111/111 [00:00<00:00, 13.6kB/s]\n", + "tokenizer_config.json: 100%|████████████████| 5.21k/5.21k [00:00<00:00, 839kB/s]\n", + "tokenizer.json: 100%|██████████████████████| 2.11M/2.11M [00:01<00:00, 2.10MB/s]\n", + "special_tokens_map.json: 100%|██████████████████| 587/587 [00:00<00:00, 429kB/s]\n", + "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n", + "Using the export variant default. Available variants are:\n", + " - default: The default ONNX variant.\n", + "Using framework PyTorch: 2.1.2+cu121\n", + "Overriding 1 configuration item(s)\n", + "\t- use_cache -> True\n", + "/home/ea/.cache/huggingface/modules/transformers_modules/stabilityai/stable-zephyr-3b/9974c58a0ec4be4cd6f55e814a2a93b9cf163823/modeling_stablelm_epoch.py:106: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " if seq_len > self.max_seq_len_cached:\n", + "/home/ea/.cache/huggingface/modules/transformers_modules/stabilityai/stable-zephyr-3b/9974c58a0ec4be4cd6f55e814a2a93b9cf163823/modeling_stablelm_epoch.py:236: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):\n", + "/home/ea/.cache/huggingface/modules/transformers_modules/stabilityai/stable-zephyr-3b/9974c58a0ec4be4cd6f55e814a2a93b9cf163823/modeling_stablelm_epoch.py:243: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):\n", + "/home/ea/.cache/huggingface/modules/transformers_modules/stabilityai/stable-zephyr-3b/9974c58a0ec4be4cd6f55e814a2a93b9cf163823/modeling_stablelm_epoch.py:253: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):\n", + "[ INFO ] Compress model weights to 4BIT_DEFAULT\n", + "[ INFO ] Compression options:\n", + "[ INFO ] {'mode': , 'group_size': 128}\n", + "INFO:nncf:Statistics of the bitwidth distribution:\n", + "+--------------+---------------------------+-----------------------------------+\n", + "| Num bits (N) | % all parameters (layers) | % ratio-defining parameters |\n", + "| | | (layers) |\n", + "+==============+===========================+===================================+\n", + "| 8 | 9% (2 / 226) | 0% (0 / 224) |\n", + "+--------------+---------------------------+-----------------------------------+\n", + "| 4 | 91% (224 / 226) | 100% (224 / 224) |\n", + "+--------------+---------------------------+-----------------------------------+\n", + "\u001b[2KApplying Weight Compression \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[35m100%\u001b[0m \u001b[38;2;0;104;181m226/226\u001b[0m • \u001b[38;2;0;104;181m0:02:36\u001b[0m • \u001b[38;2;0;104;181m0:00:00\u001b[0m;0;104;181m0:00:01\u001b[0m181m0:00:08\u001b[0m\n", + "\u001b[?25h" + ] + } + ], "source": [ "model_path = Path(\"stable-zephyr-3b/pytorch/dldt/compressed_weights/OV_FP16-4BIT_DEFAULT\") \n", "\n", "convert_script = genai_llm_bench / \"convert.py\"\n", "\n", - "if not (model_path / \"openvino_model.xml\").exists():\n", - " !python $convert_script --model_id stabilityai/stable-zephyr-3b --precision FP16 --compress_weights 4BIT_DEFAULT --bettertransformer --output stable-zephyr-3b" + "!python $convert_script --model_id stabilityai/stable-zephyr-3b --precision FP16 --compress_weights 4BIT_DEFAULT --output stable-zephyr-3b --force_convert" ] }, { @@ -106,40 +148,37 @@ "output_type": "stream", "text": [ "INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino\n", + "/home/ea/work/genai_env/lib/python3.8/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11080). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)\n", + " return torch._C._cuda_getDeviceCount() > 0\n", "No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'\n", - "[ INFO ] ==SUCCESS FOUND==: use_case: text_gen, model_name: stable-zephyr-3b\n", + "[ INFO ] ==SUCCESS FOUND==: use_case: text_gen, model_type: stable-zephyr-3b\n", "[ INFO ] ov_config={'PERFORMANCE_HINT': 'LATENCY', 'CACHE_DIR': '', 'NUM_STREAMS': '1'}\n", "OPENVINO_TORCH_BACKEND_DEVICE=CPU\n", - "[ INFO ] model_path=stable-zephyr-3b/pytorch/dldt/compressed_weights/OV_FP16-4BIT_DEFAULT, openvino runtime version:2023.3.0-13522-0a7d1d770ff\n", + "[ INFO ] model_path=stable-zephyr-3b/pytorch/dldt/compressed_weights/OV_FP16-4BIT_DEFAULT, openvino runtime version: 2024.0.0-13826-b51c5c0a997\n", "Compiling the model to CPU ...\n", - "[ INFO ] From pretrained time: 5.11s\n", + "[ INFO ] From pretrained time: 5.89s\n", "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n", "[ INFO ] num_iters=0, num_text_list=1\n", "[ INFO ] input_text=Tell me story about cats\n", "[ INFO ] Input token size:5, max_output_token_size:512\n", "Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.\n", - "/home/ea/work/openvino_notebooks/notebooks/273-stable-zephyr-3b-chatbot/openvino.genai/llm_bench/python/utils/ov_model_classes.py:998: FutureWarning: `shared_memory` is deprecated and will be removed in 2024.0. Value of `shared_memory` is going to override `share_inputs` value. Please use only `share_inputs` explicitly.\n", - " self.request.start_async(inputs, shared_memory=True)\n", "[ INFO ] [warm-up] Input token size: 5\n", - "[ INFO ] [warm-up] Output size: 512\n", + "[ INFO ] [warm-up] Output size: 290\n", "[ INFO ] [warm-up] Infer count: 512\n", - "[ INFO ] [warm-up] Generation Time: 33.60s\n", - "[ INFO ] [warm-up] Latency: 65.62 ms/token\n", + "[ INFO ] [warm-up] Tokenization Time: 2.29ms\n", + "[ INFO ] [warm-up] Detokenization Time: 0.50ms\n", + "[ INFO ] [warm-up] Generation Time: 19.75s\n", + "[ INFO ] [warm-up] Latency: 68.09 ms/token\n", "[ INFO ] [warm-up] Generated:\n", - "Tell me story about cats and dogs\n", - "Once upon a time, in a small village, there lived a kind and gentle old lady named Grandma.\n", - "Grandma had a small garden behind her house, where she grew beautiful flowers and vegetables. She loved to spend her days in the garden, tending to her plants and enjoying the peaceful surroundings.\n", - "One day, a curious little kitten wandered into the garden. It was playful and curious, and it seemed to be friendly towards Grandma. The kitten was so cute that Grandma couldn't resist it and decided to keep it. From that day on, the garden was filled with the joyful sounds of the kitten playing and exploring.\n", - "One day, while Grandma was watering the plants, she heard a loud barking coming from the other side of the garden. It was a big, friendly dog who had wandered into the village. The dog was looking for someone to help him find his owner, as he had been separated from his family during a move.\n", - "Grandma felt sorry for the dog and decided to take care of him. She gave him some food and water and made him a cozy bed in the shed. The dog was grateful for the kind treatment and soon became a regular visitor in the garden.\n", - "The kitten and the dog soon became good friends, playing and exploring together. They would often take long walks through the village, looking for new adventures to share. The kitten would chase after the dog, and the dog would bring the kitten back little treasures he found along the way.\n", - "One day, while they were walking, they heard a loud noise coming from the other side of the garden. It was a group of stray cats who had come to visit the village. The cats were sneaky and mischievous, and they loved to play tricks on the villagers.\n", - "The kitten and the dog were both scared at first, but they soon realized that the cats were not a threat to them. The kitten and the dog decided to make friends with the cats and show them that they were friendly too.\n", - "The cats were amazed by the kitten's playful nature and the dog's friendly demeanor. They soon became good friends with the kitten and the dog, and the garden became a lively and colorful place once again.\n", - "From that day on, the kitten, the dog, and the cats would often come to visit Grandma in the garden, bringing with them new adventures and memories to cherish. And Grandma would always be happy to see\n", - "[ INFO ] [warm-up] Result MD5:['ff3143d4365f5fd1a308832647e3e565']\n", - "[ INFO ] [warm-up] First token latency: 821.26 ms/token, other tokens latency: 64.11 ms/token, len of tokens: 512\n", - "[ INFO ] [warm-up] First token infer latency: 820.47 ms/token, other tokens infer latency: 63.59 ms/token, len of tokens: 512\n" + "Tell me story about cats and dogs.\n", + "Once upon a time, in a small village, there lived a young girl named Lily. She had two pets, a cat named Mittens and a dog named Max. Mittens was very playful and loved to chase Max around the house. Max, on the other hand, was a bit timid and would often hide when Mittens came around.\n", + "One day, Mittens and Max were playing together in the backyard when a loud thunderstorm came suddenly. Mittens, being afraid of the thunder, ran inside the house, leaving Max behind. The rain was coming down hard, and Max was struggling to find his way back inside.\n", + "Lily, who was watching the storm from her bedroom, heard Max's cries and knew she had to help him. She ran down the stairs and found Max standing in the rain, looking lost. Lily knew just what to do. She picked up Max and carried him back to the house, where Mittens was waiting.\n", + "Mittens was relieved to see Max safe and sound, and the two of them snuggled up together on the couch for the rest of the storm. From that day on, Max was no longer afraid of Mittens, and the three of them became closer than ever before.\n", + "And that, my dear friends, is the story of Mittens, Max, and Lily, and how they overcame their fears and became a true family.<|endoftext|>\n", + "[ INFO ] [warm-up] Result MD5:['f5575487f181d7de8e4c095b39fa4180']\n", + "[ INFO ] [warm-up] First token latency: 1030.65 ms/token, other tokens latency: 64.68 ms/token, len of tokens: 290\n", + "[ INFO ] [warm-up] First token infer latency: 1021.36 ms/token, other tokens infer latency: 64.10 ms/token, len of tokens: 290\n" ] } ], @@ -160,7 +199,7 @@ "\n", "With increasing model size like in modern LLMs, we also can note an increase in the number of attention blocks and size past key values tensors respectively. The strategy for handling cache state as model inputs and outputs in the inference cycle may become a bottleneck for memory-bounded systems, especially with processing long input sequences, for example in a chatbot scenario. OpenVINO suggests a transformation that removes inputs and corresponding outputs with cache tensors from the model keeping cache handling logic inside the model. Hiding the cache enables storing and updating the cache values in a more device-friendly representation. It helps to reduce memory consumption and additionally optimize model performance.\n", "\n", - "You can estimate the model performance by adding stateful transformation using `--make_stateful` flag." + "You can estimate the model performance by adding stateful transformation using `--stateful` flag on conversion step" ] }, { @@ -174,43 +213,96 @@ "output_type": "stream", "text": [ "INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino\n", + "/home/ea/work/genai_env/lib/python3.8/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11080). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)\n", + " return torch._C._cuda_getDeviceCount() > 0\n", + "No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'\n", + "[ INFO ] openvino runtime version: 2024.0.0-13826-b51c5c0a997\n", + "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n", + "Using the export variant default. Available variants are:\n", + " - default: The default ONNX variant.\n", + "Using framework PyTorch: 2.1.2+cu121\n", + "The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.\n", + "Overriding 1 configuration item(s)\n", + "\t- use_cache -> True\n", + "/home/ea/work/openvino_notebooks/notebooks/273-stable-zephyr-3b-chatbot/openvino.genai/llm_bench/python/utils/conversion_utils/better_transformer_patch.py:289: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " if attention_mask.size(0) > 1:\n", + "/home/ea/work/openvino_notebooks/notebooks/273-stable-zephyr-3b-chatbot/openvino.genai/llm_bench/python/utils/conversion_utils/better_transformer_patch.py:290: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " if input_shape[-1] > 1:\n", + "/home/ea/.cache/huggingface/modules/transformers_modules/stabilityai/stable-zephyr-3b/9974c58a0ec4be4cd6f55e814a2a93b9cf163823/modeling_stablelm_epoch.py:106: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " if seq_len > self.max_seq_len_cached:\n", + "/home/ea/work/openvino_notebooks/notebooks/273-stable-zephyr-3b-chatbot/openvino.genai/llm_bench/python/utils/conversion_utils/better_transformer_patch.py:380: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):\n", + + "[ INFO ] Compress model weights to 4BIT_DEFAULT\n", + "[ INFO ] Compression options:\n", + "[ INFO ] {'mode': , 'group_size': 128}\n", + "INFO:nncf:Statistics of the bitwidth distribution:\n", + "+--------------+---------------------------+-----------------------------------+\n", + "| Num bits (N) | % all parameters (layers) | % ratio-defining parameters |\n", + "| | | (layers) |\n", + "+==============+===========================+===================================+\n", + "| 8 | 9% (2 / 226) | 0% (0 / 224) |\n", + "+--------------+---------------------------+-----------------------------------+\n", + "| 4 | 91% (224 / 226) | 100% (224 / 224) |\n", + "+--------------+---------------------------+-----------------------------------+\n", + "\u001b[2KApplying Weight Compression \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[35m100%\u001b[0m \u001b[38;2;0;104;181m226/226\u001b[0m • \u001b[38;2;0;104;181m0:02:35\u001b[0m • \u001b[38;2;0;104;181m0:00:00\u001b[0m;0;104;181m0:00:01\u001b[0m181m0:00:08\u001b[0m\n", + "\u001b[?25h" + ] + } + ], + "source": [ + "stateful_model_path = Path(\"stable-zephyr-3b-stateful/pytorch/dldt/compressed_weights/OV_FP16-4BIT_DEFAULT\") \n", + "\n", + "!python $convert_script --model_id stabilityai/stable-zephyr-3b --precision FP16 --compress_weights 4BIT_DEFAULT --output stable-zephyr-3b-stateful --force_convert --stateful" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "8c12bd81-b88e-426c-822b-01df7749abab", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino\n", + "/home/ea/work/genai_env/lib/python3.8/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11080). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)\n", + " return torch._C._cuda_getDeviceCount() > 0\n", "No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'\n", - "[ INFO ] ==SUCCESS FOUND==: use_case: text_gen, model_name: stable-zephyr-3b\n", + "[ INFO ] ==SUCCESS FOUND==: use_case: text_gen, model_type: stable-zephyr-3b-stateful\n", "[ INFO ] ov_config={'PERFORMANCE_HINT': 'LATENCY', 'CACHE_DIR': '', 'NUM_STREAMS': '1'}\n", "OPENVINO_TORCH_BACKEND_DEVICE=CPU\n", - "[ INFO ] model_path=stable-zephyr-3b/pytorch/dldt/compressed_weights/OV_FP16-4BIT_DEFAULT, openvino runtime version:2023.3.0-13522-0a7d1d770ff\n", + "[ INFO ] model_path=stable-zephyr-3b-stateful/pytorch/dldt/compressed_weights/OV_FP16-4BIT_DEFAULT, openvino runtime version: 2024.0.0-13826-b51c5c0a997\n", "Compiling the model to CPU ...\n", - "[ INFO ] From pretrained time: 7.12s\n", + "[ INFO ] From pretrained time: 5.70s\n", "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n", "[ INFO ] num_iters=0, num_text_list=1\n", "[ INFO ] input_text=Tell me story about cats\n", "[ INFO ] Input token size:5, max_output_token_size:512\n", "Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.\n", "[ INFO ] [warm-up] Input token size: 5\n", - "[ INFO ] [warm-up] Output size: 512\n", + "[ INFO ] [warm-up] Output size: 290\n", "[ INFO ] [warm-up] Infer count: 512\n", - "[ INFO ] [warm-up] Generation Time: 27.29s\n", - "[ INFO ] [warm-up] Latency: 53.30 ms/token\n", + "[ INFO ] [warm-up] Tokenization Time: 1.99ms\n", + "[ INFO ] [warm-up] Detokenization Time: 0.46ms\n", + "[ INFO ] [warm-up] Generation Time: 16.35s\n", + "[ INFO ] [warm-up] Latency: 56.37 ms/token\n", "[ INFO ] [warm-up] Generated:\n", - "Tell me story about cats and dogs\n", - "Once upon a time, in a small village, there lived a kind and gentle old lady named Grandma.\n", - "Grandma had a small garden behind her house, where she grew beautiful flowers and vegetables. She loved to spend her days in the garden, tending to her plants and enjoying the peaceful surroundings.\n", - "One day, a curious little kitten wandered into the garden. It was playful and curious, and it seemed to be friendly towards Grandma. The kitten was so cute that Grandma couldn't resist it and decided to keep it. From that day on, the garden was filled with the joyful sounds of the kitten playing and exploring.\n", - "One day, while Grandma was watering the plants, she heard a loud barking coming from the other side of the garden. It was a big, friendly dog who had wandered into the village. The dog was looking for someone to help him find his owner, as he had been separated from his family during a move.\n", - "Grandma felt sorry for the dog and decided to take care of him. She gave him some food and water and made him a cozy bed in the shed. The dog was grateful for the kind treatment and soon became a regular visitor in the garden.\n", - "The kitten and the dog soon became good friends, playing and exploring together. They would often take long walks through the village, looking for new adventures to share. The kitten would chase after the dog, and the dog would bring the kitten back little treasures he found along the way.\n", - "One day, while they were walking, they heard a loud noise coming from the other side of the garden. It was a group of stray cats who had come to visit the village. The cats were sneaky and mischievous, and they loved to play tricks on the villagers.\n", - "The kitten and the dog were both scared at first, but they soon realized that the cats were not a threat to them. The kitten and the dog decided to make friends with the cats and show them that they were friendly too.\n", - "The cats were amazed by the kitten's playful nature and the dog's friendly demeanor. They soon became good friends with the kitten and the dog, and the garden became a lively and colorful place once again.\n", - "From that day on, the kitten, the dog, and the cats would often come to visit Grandma in the garden, bringing with them new adventures and memories to cherish. And Grandma would always be happy to see\n", - "[ INFO ] [warm-up] Result MD5:['ff3143d4365f5fd1a308832647e3e565']\n", - "[ INFO ] [warm-up] First token latency: 699.40 ms/token, other tokens latency: 51.99 ms/token, len of tokens: 512\n", - "[ INFO ] [warm-up] First token infer latency: 698.34 ms/token, other tokens infer latency: 51.41 ms/token, len of tokens: 512\n" + "Tell me story about cats and dogs.\n", + "Once upon a time, in a small village, there lived a young girl named Lily. She had two pets, a cat named Mittens and a dog named Max. Mittens was very playful and loved to chase Max around the house. Max, on the other hand, was a bit timid and would often hide when Mittens came around.\n", + "One day, Mittens and Max were playing together in the backyard when a loud thunderstorm came suddenly. Mittens, being afraid of the thunder, ran inside the house, leaving Max behind. The rain was coming down hard, and Max was struggling to find his way back inside.\n", + "Lily, who was watching the storm from her bedroom, heard Max's cries and knew she had to help him. She ran down the stairs and found Max standing in the rain, looking lost. Lily knew just what to do. She picked up Max and carried him back to the house, where Mittens was waiting.\n", + "Mittens was relieved to see Max safe and sound, and the two of them snuggled up together on the couch for the rest of the storm. From that day on, Max was no longer afraid of Mittens, and the three of them became closer than ever before.\n", + "And that, my dear friends, is the story of Mittens, Max, and Lily, and how they overcame their fears and became a true family.<|endoftext|>\n", + "[ INFO ] [warm-up] Result MD5:['f5575487f181d7de8e4c095b39fa4180']\n", + "[ INFO ] [warm-up] First token latency: 1074.80 ms/token, other tokens latency: 52.77 ms/token, len of tokens: 290\n", + "[ INFO ] [warm-up] First token infer latency: 1073.78 ms/token, other tokens infer latency: 52.15 ms/token, len of tokens: 290\n" ] } ], "source": [ - "!python $benchmark_script -m $model_path -ic 512 -p \"Tell me story about cats\" --make_stateful " + "!python $benchmark_script -m $stateful_model_path -ic 512 -p \"Tell me story about cats\"" ] }, { @@ -222,15 +314,14 @@ "\n", "Running model with Optimum-Intel API required following steps:\n", "1. register normalized config for model\n", - "2. create instance of `OVModelForCausalLM` class using `from_pretrained` method and providing path to the model.\n", - "3. patch inter processing for applying `make_stateful=True`\n", + "2. create instance of `OVModelForCausalLM` class using `from_pretrained` method and providing path to the model and `stateful` flag\n", "\n", "The model text generation interface remains without changes, the text generation process started with running `ov_model.generate` method and passing text encoded by the tokenizer as input. This method returns a sequence of generated token ids that should be decoded using a tokenizer" ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 7, "id": "433988ee-bda7-4224-9bf5-b013de4fcd65", "metadata": {}, "outputs": [ @@ -245,23 +336,21 @@ "name": "stderr", "output_type": "stream", "text": [ - "No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'\n", - "Compiling the model to CPU ...\n", - "Setting OpenVINO CACHE_DIR to stable-zephyr-3b/pytorch/dldt/compressed_weights/OV_FP16-4BIT_DEFAULT/model_cache\n" + "/home/ea/work/genai_env/lib/python3.8/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11080). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)\n", + " return torch._C._cuda_getDeviceCount() > 0\n", + "No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'\n" ] } ], "source": [ "from utils.ov_model_classes import register_normalized_configs\n", - "from utils.ov_utils import patch_inter_processing\n", "from optimum.intel.openvino import OVModelForCausalLM\n", "from transformers import AutoConfig\n", "\n", "# Load model into Optimum Interface\n", "register_normalized_configs()\n", "\n", - "ov_model = OVModelForCausalLM.from_pretrained(model_path, compile=False, config=AutoConfig.from_pretrained(model_path, trust_remote_code=True))\n", - "patch_inter_processing(ov_model, fuse_decoding_strategy=False, fuse_cache_reorder=False, make_stateful=True, save_prepared_model=None)" + "ov_model = OVModelForCausalLM.from_pretrained(model_path, compile=False, config=AutoConfig.from_pretrained(stateful_model_path, trust_remote_code=True), stateful=True)" ] }, { @@ -315,7 +404,6 @@ " TextIteratorStreamer,\n", ")\n", "\n", - "\n", "model_name = \"stable-zephyr-3b\"\n", "\n", "tok = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)\n", @@ -629,7 +717,7 @@ "# if you have any issue to launch on your platform, you can pass share=True to launch method:\n", "# demo.launch(share=True)\n", "# it creates a publicly shareable link for the interface. Read more in the docs: https://gradio.app/docs/\n", - "demo.launch()" + "demo.launch(share=True)" ] } ], diff --git a/notebooks/273-stable-zephyr-3b-chatbot/README.md b/notebooks/273-stable-zephyr-3b-chatbot/README.md index 76d9037ca71..0caa2c1bbe0 100644 --- a/notebooks/273-stable-zephyr-3b-chatbot/README.md +++ b/notebooks/273-stable-zephyr-3b-chatbot/README.md @@ -6,7 +6,7 @@ While a decent intent-based chatbot can answer basic, one-touch inquiries like o `Stable Zephyr 3B` is a 3 billion parameter model that demonstrated outstanding results on many LLM evaluation benchmarks outperforming many popular models in relatively small size. Inspired by [HugginFaceH4's Zephyr 7B](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) training pipeline this model was trained on a mix of publicly available datasets, synthetic datasets using [Direct Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290), evaluation for this model based on [MT Bench](https://tatsu-lab.github.io/alpaca_eval/) and [Alpaca Benchmark](https://tatsu-lab.github.io/alpaca_eval/). More details about model can be found in [model card](https://huggingface.co/stabilityai/stablelm-zephyr-3b) -In this tutorial, we consider how to optimize and run this model using the OpenVINO toolkit. For the convenience of the conversion step and model performance evaluation, we will use [llm_bench](https://github.com/openvinotoolkit/openvino.genai/tree/master/llm_bench/python) tool, which provides a unified approach to estimate performance for LLM. It is based on pipelines provided by [Optimum-Intel](https://github.com/huggingface/optimum-intel) and allows to estimate performance for Pytorch and OpenVINO models using almost the same code. We also demonstrate how to apply [BetterTransformer](https://huggingface.co/docs/optimum/bettertransformer/overview) optimization and how to make model stateful using OpenVINO transformations, which improves process of caching model state. +In this tutorial, we consider how to optimize and run this model using the OpenVINO toolkit. For the convenience of the conversion step and model performance evaluation, we will use [llm_bench](https://github.com/openvinotoolkit/openvino.genai/tree/master/llm_bench/python) tool, which provides a unified approach to estimate performance for LLM. It is based on pipelines provided by [Optimum-Intel](https://github.com/huggingface/optimum-intel) and allows to estimate performance for Pytorch and OpenVINO models using almost the same code. We also demonstrate how to make model stateful using OpenVINO transformations, which improves process of caching model state. ## Notebook Contents