openvinotoolkit
diff --git a/‎supplementary_materials/notebooks/fastdraft-deepseek/fastdraft_deepseek.ipynb
Lines changed: 0 additions & 131 deletions b/‎supplementary_materials/notebooks/fastdraft-deepseek/fastdraft_deepseek.ipynb
Lines changed: 0 additions & 131 deletions
diff --git a/‎supplementary_materials/notebooks/fastdraft-deepseek/prompts.json
Lines changed: 0 additions & 44 deletions b/‎supplementary_materials/notebooks/fastdraft-deepseek/prompts.json
Lines changed: 0 additions & 44 deletions
@@ -309,137 +309,6 @@
     "print(f\"End to end speedup with FastDraft and speculative decoding is {ar_gen_time / sd_gen_time:.2f}x\")"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Evaluate Speculative Decoding Speedup On Multiple Examples\n",
-    "\n",
-    "In this section we compare auto-regressive generation and speculative-decoding generation with DeepSeek-R1-Distill-Llama-8B model on multiple examples. \n",
-    "We use 40 example-prompts taken from [MT-Bench](https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts) and from [databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) datasets.\n",
-    "We loop over these examples and measure generation times, first without speculative-decoding and later with speculative-decoding. Eventually we compare generation times for both methods and compute the average speedup gain."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### 1. Run target model without speculative decoding\n",
-    "As in previous section, we will first run generation without speculative-decoding, but this time we will run it over 40 examples."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import openvino_genai as ov_genai\n",
-    "import sys\n",
-    "import time\n",
-    "from tqdm import tqdm\n",
-    "from llm_pipeline_with_hf_tokenizer import LLMPipelineWithHFTokenizer\n",
-    "\n",
-    "print(f\"Loading model from {model_dir}\")\n",
-    "\n",
-    "# Define scheduler\n",
-    "scheduler_config = ov_genai.SchedulerConfig()\n",
-    "scheduler_config.num_kv_blocks = 2048 // 16\n",
-    "scheduler_config.dynamic_split_fuse = False\n",
-    "scheduler_config.max_num_batched_tokens = 2048\n",
-    "\n",
-    "pipe = LLMPipelineWithHFTokenizer(model_dir, device, scheduler_config=scheduler_config)\n",
-    "\n",
-    "generation_config = ov_genai.GenerationConfig()\n",
-    "generation_config.max_new_tokens = 1024\n",
-    "\n",
-    "print(\"Loading prompts...\")\n",
-    "import json\n",
-    "f= open('prompts.json')\n",
-    "prompts = json.load(f)\n",
-    "prompts = [[{\"role\": \"user\", \"content\": p }] for p in prompts]\n",
-    "\n",
-    "times_auto_regressive = []\n",
-    "for prompt in tqdm(prompts):\n",
-    "    start_time = time.perf_counter()\n",
-    "    result = pipe.generate(prompt, generation_config, apply_chat_template=True)\n",
-    "    end_time = time.perf_counter()\n",
-    "    times_auto_regressive.append(end_time - start_time)\n",
-    "print(\"Done\")\n",
-    "\n",
-    "import gc\n",
-    "\n",
-    "del pipe\n",
-    "gc.collect()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### 2. Run target model with speculative decoding\n",
-    "Now we will run generation with speculative-decoding over the same 40 examples."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "print(f\"Loading draft from {draft_model_path}\")\n",
-    "\n",
-    "# Define scheduler for the draft\n",
-    "\n",
-    "draft_scheduler_config = ov_genai.SchedulerConfig()\n",
-    "draft_scheduler_config.num_kv_blocks = 2048 // 16\n",
-    "draft_scheduler_config.dynamic_split_fuse = False\n",
-    "draft_scheduler_config.max_num_batched_tokens = 2048\n",
-    "\n",
-    "draft_model = ov_genai.draft_model(draft_model_path, device, scheduler_config=draft_scheduler_config)\n",
-    "\n",
-    "pipe = LLMPipelineWithHFTokenizer(model_dir, device, scheduler_config=scheduler_config, draft_model=draft_model)\n",
-    "\n",
-    "\n",
-    "generation_config = ov_genai.GenerationConfig()\n",
-    "generation_config.num_assistant_tokens = 3\n",
-    "generation_config.max_new_tokens = 2048\n",
-    "\n",
-    "times_speculative_decoding = []\n",
-    "\n",
-    "print(\"Running Speculative Decoding generation...\")\n",
-    "for prompt in tqdm(prompts):\n",
-    "    start_time = time.perf_counter()\n",
-    "    result = pipe.generate(prompt, generation_config, apply_chat_template=True)\n",
-    "    end_time = time.perf_counter()\n",
-    "    times_speculative_decoding.append((end_time - start_time))\n",
-    "print(\"Done\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### 3. Calculate speedup\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "avg_speedup = sum([x / y for x, y in zip(times_auto_regressive, times_speculative_decoding)]) / len(prompts)\n",
-    "print(f\"average speedup: {avg_speedup:.2f}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We see that by using speculative-decoding with FastDraft we can accelerate DeepSeek-R1-Distill-Llama-8B generation by ~1.5x on avarage."
-   ]
-  },
   {
    "attachments": {},
    "cell_type": "markdown",