|
67 | 67 | "cell_type": "markdown",
|
68 | 68 | "metadata": {},
|
69 | 69 | "source": [
|
70 |
| - "## Baseline Evaluation of Llama 3.1 8B Instruct with BigBench\n", |
| 70 | + "## Baseline Evaluation of Llama 3.1 8B Instruct with LM Evaluation Harness\n", |
71 | 71 | "\n",
|
72 | 72 | "The Nemo Evaluator microservice allows users to run a number of academic benchmarks, all of which are accessible through the Nemo Evaluator API.\n",
|
73 | 73 | "\n",
|
74 | 74 | "> NOTE: For more details on what evaluations are available, please head to the [Evaluation documentation](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/evaluations.html)\n",
|
75 | 75 | "\n",
|
76 |
| - "For this notebook, we will be running the BigBench evaluation (details available [here](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/evaluations.html#bigbench))! This benchmark consists of 200+ tasks for evaluating LLMs." |
| 76 | + "For this notebook, we will be running the LM Evaluation Harness evaluation!" |
77 | 77 | ]
|
78 | 78 | },
|
79 | 79 | {
|
|
90 | 90 | "outputs": [],
|
91 | 91 | "source": [
|
92 | 92 | "model_config = {\n",
|
93 |
| - " \"llm_type\": \"nvidia-nemo-nim\",\n", |
94 | 93 | " \"llm_name\": \"my-customized-model\",\n",
|
95 | 94 | " \"inference_url\": \"MY_NIM_URL/v1\",\n",
|
96 | 95 | " \"use_chat_endpoint\": False,\n",
|
|
103 | 102 | "source": [
|
104 | 103 | "Now we can initialize our evaluation config, which is how we communicate which benchmark tasks, subtasks, etc. to use during evaluation. \n",
|
105 | 104 | "\n",
|
106 |
| - "For this evaluation, we'll focus on a small subset of BigBench by choosing the `intent_recognition` task. \n", |
| 105 | + "For this evaluation, we'll focus on the [GSM8K](https://arxiv.org/abs/2110.14168) evaluation which uses Eleuther AI's [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/v0.4.3) as a backend. \n", |
107 | 106 | "\n",
|
108 |
| - "`intent_recognition` is a task specifically tailored to determine if the model is good at recognizing a given utterance's intent. More details available [here](https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/intent_recognition).\n", |
109 |
| - "\n", |
110 |
| - "We'll also select the `tydiqa_goldp.en` task to see how Llama 3 8B Instruct stacks up on the English subset of the `TyDi QA` benchmark. More details available [here](https://github.com/google-research-datasets/tydiqa)." |
| 107 | + "The LM Evaluation Harness supports more than 60 standard academic benchmarks for LLMs!" |
111 | 108 | ]
|
112 | 109 | },
|
113 | 110 | {
|
|
118 | 115 | "source": [
|
119 | 116 | "evaluation_config = {\n",
|
120 | 117 | " \"eval_type\": \"automatic\",\n",
|
121 |
| - " \"eval_subtype\": \"bigbench\",\n", |
122 |
| - " \"standard_tasks\": [\n", |
123 |
| - " \"intent_recognition\",\n", |
124 |
| - " ],\n", |
125 |
| - " \"tydiqa_tasks\": [\n", |
126 |
| - " \"tydiqa_goldp.en\",\n", |
127 |
| - " ],\n", |
128 |
| - " \"standard_tasks_args\": \"--max_length=64 --json_shots='0,2'\",\n", |
129 |
| - " \"tydiqa_tasks_args\": \"--max_length=16 --json_shots='1,8'\",\n", |
130 |
| - " \"few_shot_example_separator_override\": {\n", |
131 |
| - " \"standard_tasks\": {\n", |
132 |
| - " \"default\": None\n", |
133 |
| - " },\n", |
134 |
| - " \"tydiqa_tasks\": {\n", |
135 |
| - " \"default\": None\n", |
136 |
| - " }\n", |
137 |
| - " },\n", |
138 |
| - " \"example_input_prefix_override\": {\n", |
139 |
| - " \"standard_tasks\": {\n", |
140 |
| - " \"default\": None\n", |
141 |
| - " },\n", |
142 |
| - " \"tydiqa_tasks\": {\n", |
143 |
| - " \"default\": None\n", |
144 |
| - " }\n", |
145 |
| - " },\n", |
146 |
| - " \"example_output_prefix_override\": {\n", |
147 |
| - " \"standard_tasks\": {\n", |
148 |
| - " \"default\": None,\n", |
149 |
| - " \"abstract_narrative_understanding\": None\n", |
150 |
| - " },\n", |
151 |
| - " \"tydiqa_tasks\": {\n", |
152 |
| - " \"default\": None\n", |
153 |
| - " }\n", |
154 |
| - " },\n", |
155 |
| - " \"stop_string_override\": {\n", |
156 |
| - " \"standard_tasks\": {\n", |
157 |
| - " \"default\": None,\n", |
158 |
| - " \"abstract_narrative_understanding\": None\n", |
159 |
| - " },\n", |
160 |
| - " \"tydiqa_tasks\": {\n", |
161 |
| - " \"default\": None\n", |
| 118 | + " \"eval_subtype\": \"lm_eval_harness\",\n", |
| 119 | + " \"tasks\": [\n", |
| 120 | + " {\n", |
| 121 | + " \"task_name\" : \"gsm8k\",\n", |
| 122 | + " \"task_config\" : None,\n", |
| 123 | + " \"num_fewshot\" : 5,\n", |
| 124 | + " \"batch_size\" : 16,\n", |
| 125 | + " \"bootstrap_iters\" : 1000,\n", |
| 126 | + " \"limit\" : -1\n", |
162 | 127 | " }\n",
|
163 |
| - " }\n", |
| 128 | + " ]\n", |
164 | 129 | "}"
|
165 | 130 | ]
|
166 | 131 | },
|
|
375 | 340 | "outputs": [],
|
376 | 341 | "source": [
|
377 | 342 | "model_config = {\n",
|
378 |
| - " \"llm_type\" : \"nvidia-nemo-nim\",\n", |
379 | 343 | " \"llm_name\" : \"my-customized-model\",\n",
|
380 |
| - " \"container\" : \"my-customized-container\",\n", |
381 | 344 | " \"inference_url\" : \"my-customized-inference-url\",\n",
|
382 | 345 | " \"use_chat_endpoint\" : False,\n",
|
383 | 346 | "}"
|
|
522 | 485 | "outputs": [],
|
523 | 486 | "source": [
|
524 | 487 | "model_config = {\n",
|
525 |
| - " \"llm_type\" : \"nvidia-nemo-nim\",\n", |
526 | 488 | " \"llm_name\" : \"my-customized-model\",\n",
|
527 |
| - " \"container\" : \"my-customized-container\",\n", |
528 | 489 | " \"inference_url\" : \"my-customized-inference-url\",\n",
|
529 | 490 | " \"use_chat_endpoint\" : False,\n",
|
530 | 491 | "}"
|
|
0 commit comments