Skip to content

Commit ca48d78

Browse files
authored
Revert "add batch speedup evaluation to fastdraft_deepseek notebook" (#2736)
Reverts #2734
1 parent ee5a0f3 commit ca48d78

File tree

2 files changed

+0
-175
lines changed

2 files changed

+0
-175
lines changed

supplementary_materials/notebooks/fastdraft-deepseek/fastdraft_deepseek.ipynb

Lines changed: 0 additions & 131 deletions
Original file line numberDiff line numberDiff line change
@@ -309,137 +309,6 @@
309309
"print(f\"End to end speedup with FastDraft and speculative decoding is {ar_gen_time / sd_gen_time:.2f}x\")"
310310
]
311311
},
312-
{
313-
"cell_type": "markdown",
314-
"metadata": {},
315-
"source": [
316-
"## Evaluate Speculative Decoding Speedup On Multiple Examples\n",
317-
"\n",
318-
"In this section we compare auto-regressive generation and speculative-decoding generation with DeepSeek-R1-Distill-Llama-8B model on multiple examples. \n",
319-
"We use 40 example-prompts taken from [MT-Bench](https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts) and from [databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) datasets.\n",
320-
"We loop over these examples and measure generation times, first without speculative-decoding and later with speculative-decoding. Eventually we compare generation times for both methods and compute the average speedup gain."
321-
]
322-
},
323-
{
324-
"cell_type": "markdown",
325-
"metadata": {},
326-
"source": [
327-
"### 1. Run target model without speculative decoding\n",
328-
"As in previous section, we will first run generation without speculative-decoding, but this time we will run it over 40 examples."
329-
]
330-
},
331-
{
332-
"cell_type": "code",
333-
"execution_count": null,
334-
"metadata": {},
335-
"outputs": [],
336-
"source": [
337-
"import openvino_genai as ov_genai\n",
338-
"import sys\n",
339-
"import time\n",
340-
"from tqdm import tqdm\n",
341-
"from llm_pipeline_with_hf_tokenizer import LLMPipelineWithHFTokenizer\n",
342-
"\n",
343-
"print(f\"Loading model from {model_dir}\")\n",
344-
"\n",
345-
"# Define scheduler\n",
346-
"scheduler_config = ov_genai.SchedulerConfig()\n",
347-
"scheduler_config.num_kv_blocks = 2048 // 16\n",
348-
"scheduler_config.dynamic_split_fuse = False\n",
349-
"scheduler_config.max_num_batched_tokens = 2048\n",
350-
"\n",
351-
"pipe = LLMPipelineWithHFTokenizer(model_dir, device, scheduler_config=scheduler_config)\n",
352-
"\n",
353-
"generation_config = ov_genai.GenerationConfig()\n",
354-
"generation_config.max_new_tokens = 1024\n",
355-
"\n",
356-
"print(\"Loading prompts...\")\n",
357-
"import json\n",
358-
"f= open('prompts.json')\n",
359-
"prompts = json.load(f)\n",
360-
"prompts = [[{\"role\": \"user\", \"content\": p }] for p in prompts]\n",
361-
"\n",
362-
"times_auto_regressive = []\n",
363-
"for prompt in tqdm(prompts):\n",
364-
" start_time = time.perf_counter()\n",
365-
" result = pipe.generate(prompt, generation_config, apply_chat_template=True)\n",
366-
" end_time = time.perf_counter()\n",
367-
" times_auto_regressive.append(end_time - start_time)\n",
368-
"print(\"Done\")\n",
369-
"\n",
370-
"import gc\n",
371-
"\n",
372-
"del pipe\n",
373-
"gc.collect()"
374-
]
375-
},
376-
{
377-
"cell_type": "markdown",
378-
"metadata": {},
379-
"source": [
380-
"### 2. Run target model with speculative decoding\n",
381-
"Now we will run generation with speculative-decoding over the same 40 examples."
382-
]
383-
},
384-
{
385-
"cell_type": "code",
386-
"execution_count": null,
387-
"metadata": {},
388-
"outputs": [],
389-
"source": [
390-
"print(f\"Loading draft from {draft_model_path}\")\n",
391-
"\n",
392-
"# Define scheduler for the draft\n",
393-
"\n",
394-
"draft_scheduler_config = ov_genai.SchedulerConfig()\n",
395-
"draft_scheduler_config.num_kv_blocks = 2048 // 16\n",
396-
"draft_scheduler_config.dynamic_split_fuse = False\n",
397-
"draft_scheduler_config.max_num_batched_tokens = 2048\n",
398-
"\n",
399-
"draft_model = ov_genai.draft_model(draft_model_path, device, scheduler_config=draft_scheduler_config)\n",
400-
"\n",
401-
"pipe = LLMPipelineWithHFTokenizer(model_dir, device, scheduler_config=scheduler_config, draft_model=draft_model)\n",
402-
"\n",
403-
"\n",
404-
"generation_config = ov_genai.GenerationConfig()\n",
405-
"generation_config.num_assistant_tokens = 3\n",
406-
"generation_config.max_new_tokens = 2048\n",
407-
"\n",
408-
"times_speculative_decoding = []\n",
409-
"\n",
410-
"print(\"Running Speculative Decoding generation...\")\n",
411-
"for prompt in tqdm(prompts):\n",
412-
" start_time = time.perf_counter()\n",
413-
" result = pipe.generate(prompt, generation_config, apply_chat_template=True)\n",
414-
" end_time = time.perf_counter()\n",
415-
" times_speculative_decoding.append((end_time - start_time))\n",
416-
"print(\"Done\")"
417-
]
418-
},
419-
{
420-
"cell_type": "markdown",
421-
"metadata": {},
422-
"source": [
423-
"### 3. Calculate speedup\n"
424-
]
425-
},
426-
{
427-
"cell_type": "code",
428-
"execution_count": null,
429-
"metadata": {},
430-
"outputs": [],
431-
"source": [
432-
"avg_speedup = sum([x / y for x, y in zip(times_auto_regressive, times_speculative_decoding)]) / len(prompts)\n",
433-
"print(f\"average speedup: {avg_speedup:.2f}\")"
434-
]
435-
},
436-
{
437-
"cell_type": "markdown",
438-
"metadata": {},
439-
"source": [
440-
"We see that by using speculative-decoding with FastDraft we can accelerate DeepSeek-R1-Distill-Llama-8B generation by ~1.5x on avarage."
441-
]
442-
},
443312
{
444313
"attachments": {},
445314
"cell_type": "markdown",

supplementary_materials/notebooks/fastdraft-deepseek/prompts.json

Lines changed: 0 additions & 44 deletions
This file was deleted.

0 commit comments

Comments
 (0)