|
309 | 309 | "print(f\"End to end speedup with FastDraft and speculative decoding is {ar_gen_time / sd_gen_time:.2f}x\")"
|
310 | 310 | ]
|
311 | 311 | },
|
312 |
| - { |
313 |
| - "cell_type": "markdown", |
314 |
| - "metadata": {}, |
315 |
| - "source": [ |
316 |
| - "## Evaluate Speculative Decoding Speedup On Multiple Examples\n", |
317 |
| - "\n", |
318 |
| - "In this section we compare auto-regressive generation and speculative-decoding generation with DeepSeek-R1-Distill-Llama-8B model on multiple examples. \n", |
319 |
| - "We use 40 example-prompts taken from [MT-Bench](https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts) and from [databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) datasets.\n", |
320 |
| - "We loop over these examples and measure generation times, first without speculative-decoding and later with speculative-decoding. Eventually we compare generation times for both methods and compute the average speedup gain." |
321 |
| - ] |
322 |
| - }, |
323 |
| - { |
324 |
| - "cell_type": "markdown", |
325 |
| - "metadata": {}, |
326 |
| - "source": [ |
327 |
| - "### 1. Run target model without speculative decoding\n", |
328 |
| - "As in previous section, we will first run generation without speculative-decoding, but this time we will run it over 40 examples." |
329 |
| - ] |
330 |
| - }, |
331 |
| - { |
332 |
| - "cell_type": "code", |
333 |
| - "execution_count": null, |
334 |
| - "metadata": {}, |
335 |
| - "outputs": [], |
336 |
| - "source": [ |
337 |
| - "import openvino_genai as ov_genai\n", |
338 |
| - "import sys\n", |
339 |
| - "import time\n", |
340 |
| - "from tqdm import tqdm\n", |
341 |
| - "from llm_pipeline_with_hf_tokenizer import LLMPipelineWithHFTokenizer\n", |
342 |
| - "\n", |
343 |
| - "print(f\"Loading model from {model_dir}\")\n", |
344 |
| - "\n", |
345 |
| - "# Define scheduler\n", |
346 |
| - "scheduler_config = ov_genai.SchedulerConfig()\n", |
347 |
| - "scheduler_config.num_kv_blocks = 2048 // 16\n", |
348 |
| - "scheduler_config.dynamic_split_fuse = False\n", |
349 |
| - "scheduler_config.max_num_batched_tokens = 2048\n", |
350 |
| - "\n", |
351 |
| - "pipe = LLMPipelineWithHFTokenizer(model_dir, device, scheduler_config=scheduler_config)\n", |
352 |
| - "\n", |
353 |
| - "generation_config = ov_genai.GenerationConfig()\n", |
354 |
| - "generation_config.max_new_tokens = 1024\n", |
355 |
| - "\n", |
356 |
| - "print(\"Loading prompts...\")\n", |
357 |
| - "import json\n", |
358 |
| - "f= open('prompts.json')\n", |
359 |
| - "prompts = json.load(f)\n", |
360 |
| - "prompts = [[{\"role\": \"user\", \"content\": p }] for p in prompts]\n", |
361 |
| - "\n", |
362 |
| - "times_auto_regressive = []\n", |
363 |
| - "for prompt in tqdm(prompts):\n", |
364 |
| - " start_time = time.perf_counter()\n", |
365 |
| - " result = pipe.generate(prompt, generation_config, apply_chat_template=True)\n", |
366 |
| - " end_time = time.perf_counter()\n", |
367 |
| - " times_auto_regressive.append(end_time - start_time)\n", |
368 |
| - "print(\"Done\")\n", |
369 |
| - "\n", |
370 |
| - "import gc\n", |
371 |
| - "\n", |
372 |
| - "del pipe\n", |
373 |
| - "gc.collect()" |
374 |
| - ] |
375 |
| - }, |
376 |
| - { |
377 |
| - "cell_type": "markdown", |
378 |
| - "metadata": {}, |
379 |
| - "source": [ |
380 |
| - "### 2. Run target model with speculative decoding\n", |
381 |
| - "Now we will run generation with speculative-decoding over the same 40 examples." |
382 |
| - ] |
383 |
| - }, |
384 |
| - { |
385 |
| - "cell_type": "code", |
386 |
| - "execution_count": null, |
387 |
| - "metadata": {}, |
388 |
| - "outputs": [], |
389 |
| - "source": [ |
390 |
| - "print(f\"Loading draft from {draft_model_path}\")\n", |
391 |
| - "\n", |
392 |
| - "# Define scheduler for the draft\n", |
393 |
| - "\n", |
394 |
| - "draft_scheduler_config = ov_genai.SchedulerConfig()\n", |
395 |
| - "draft_scheduler_config.num_kv_blocks = 2048 // 16\n", |
396 |
| - "draft_scheduler_config.dynamic_split_fuse = False\n", |
397 |
| - "draft_scheduler_config.max_num_batched_tokens = 2048\n", |
398 |
| - "\n", |
399 |
| - "draft_model = ov_genai.draft_model(draft_model_path, device, scheduler_config=draft_scheduler_config)\n", |
400 |
| - "\n", |
401 |
| - "pipe = LLMPipelineWithHFTokenizer(model_dir, device, scheduler_config=scheduler_config, draft_model=draft_model)\n", |
402 |
| - "\n", |
403 |
| - "\n", |
404 |
| - "generation_config = ov_genai.GenerationConfig()\n", |
405 |
| - "generation_config.num_assistant_tokens = 3\n", |
406 |
| - "generation_config.max_new_tokens = 2048\n", |
407 |
| - "\n", |
408 |
| - "times_speculative_decoding = []\n", |
409 |
| - "\n", |
410 |
| - "print(\"Running Speculative Decoding generation...\")\n", |
411 |
| - "for prompt in tqdm(prompts):\n", |
412 |
| - " start_time = time.perf_counter()\n", |
413 |
| - " result = pipe.generate(prompt, generation_config, apply_chat_template=True)\n", |
414 |
| - " end_time = time.perf_counter()\n", |
415 |
| - " times_speculative_decoding.append((end_time - start_time))\n", |
416 |
| - "print(\"Done\")" |
417 |
| - ] |
418 |
| - }, |
419 |
| - { |
420 |
| - "cell_type": "markdown", |
421 |
| - "metadata": {}, |
422 |
| - "source": [ |
423 |
| - "### 3. Calculate speedup\n" |
424 |
| - ] |
425 |
| - }, |
426 |
| - { |
427 |
| - "cell_type": "code", |
428 |
| - "execution_count": null, |
429 |
| - "metadata": {}, |
430 |
| - "outputs": [], |
431 |
| - "source": [ |
432 |
| - "avg_speedup = sum([x / y for x, y in zip(times_auto_regressive, times_speculative_decoding)]) / len(prompts)\n", |
433 |
| - "print(f\"average speedup: {avg_speedup:.2f}\")" |
434 |
| - ] |
435 |
| - }, |
436 |
| - { |
437 |
| - "cell_type": "markdown", |
438 |
| - "metadata": {}, |
439 |
| - "source": [ |
440 |
| - "We see that by using speculative-decoding with FastDraft we can accelerate DeepSeek-R1-Distill-Llama-8B generation by ~1.5x on avarage." |
441 |
| - ] |
442 |
| - }, |
443 | 312 | {
|
444 | 313 | "attachments": {},
|
445 | 314 | "cell_type": "markdown",
|
|
0 commit comments