Skip to content

Commit 9720d61

Browse files
Updating Evaluator Notebook to LM Eval Harness (#185)
Co-authored-by: Chris Alexiuk <[email protected]>
1 parent e85ddc1 commit 9720d61

File tree

1 file changed

+14
-53
lines changed

1 file changed

+14
-53
lines changed

RAG/notebooks/nemo/Nemo Evaluator Llama 3.1 Workbook/evaluator_notebook.ipynb

Lines changed: 14 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -67,13 +67,13 @@
6767
"cell_type": "markdown",
6868
"metadata": {},
6969
"source": [
70-
"## Baseline Evaluation of Llama 3.1 8B Instruct with BigBench\n",
70+
"## Baseline Evaluation of Llama 3.1 8B Instruct with LM Evaluation Harness\n",
7171
"\n",
7272
"The Nemo Evaluator microservice allows users to run a number of academic benchmarks, all of which are accessible through the Nemo Evaluator API.\n",
7373
"\n",
7474
"> NOTE: For more details on what evaluations are available, please head to the [Evaluation documentation](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/evaluations.html)\n",
7575
"\n",
76-
"For this notebook, we will be running the BigBench evaluation (details available [here](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/evaluations.html#bigbench))! This benchmark consists of 200+ tasks for evaluating LLMs."
76+
"For this notebook, we will be running the LM Evaluation Harness evaluation!"
7777
]
7878
},
7979
{
@@ -90,7 +90,6 @@
9090
"outputs": [],
9191
"source": [
9292
"model_config = {\n",
93-
" \"llm_type\": \"nvidia-nemo-nim\",\n",
9493
" \"llm_name\": \"my-customized-model\",\n",
9594
" \"inference_url\": \"MY_NIM_URL/v1\",\n",
9695
" \"use_chat_endpoint\": False,\n",
@@ -103,11 +102,9 @@
103102
"source": [
104103
"Now we can initialize our evaluation config, which is how we communicate which benchmark tasks, subtasks, etc. to use during evaluation. \n",
105104
"\n",
106-
"For this evaluation, we'll focus on a small subset of BigBench by choosing the `intent_recognition` task. \n",
105+
"For this evaluation, we'll focus on the [GSM8K](https://arxiv.org/abs/2110.14168) evaluation which uses Eleuther AI's [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/v0.4.3) as a backend. \n",
107106
"\n",
108-
"`intent_recognition` is a task specifically tailored to determine if the model is good at recognizing a given utterance's intent. More details available [here](https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/intent_recognition).\n",
109-
"\n",
110-
"We'll also select the `tydiqa_goldp.en` task to see how Llama 3 8B Instruct stacks up on the English subset of the `TyDi QA` benchmark. More details available [here](https://github.com/google-research-datasets/tydiqa)."
107+
"The LM Evaluation Harness supports more than 60 standard academic benchmarks for LLMs!"
111108
]
112109
},
113110
{
@@ -118,49 +115,17 @@
118115
"source": [
119116
"evaluation_config = {\n",
120117
" \"eval_type\": \"automatic\",\n",
121-
" \"eval_subtype\": \"bigbench\",\n",
122-
" \"standard_tasks\": [\n",
123-
" \"intent_recognition\",\n",
124-
" ],\n",
125-
" \"tydiqa_tasks\": [\n",
126-
" \"tydiqa_goldp.en\",\n",
127-
" ],\n",
128-
" \"standard_tasks_args\": \"--max_length=64 --json_shots='0,2'\",\n",
129-
" \"tydiqa_tasks_args\": \"--max_length=16 --json_shots='1,8'\",\n",
130-
" \"few_shot_example_separator_override\": {\n",
131-
" \"standard_tasks\": {\n",
132-
" \"default\": None\n",
133-
" },\n",
134-
" \"tydiqa_tasks\": {\n",
135-
" \"default\": None\n",
136-
" }\n",
137-
" },\n",
138-
" \"example_input_prefix_override\": {\n",
139-
" \"standard_tasks\": {\n",
140-
" \"default\": None\n",
141-
" },\n",
142-
" \"tydiqa_tasks\": {\n",
143-
" \"default\": None\n",
144-
" }\n",
145-
" },\n",
146-
" \"example_output_prefix_override\": {\n",
147-
" \"standard_tasks\": {\n",
148-
" \"default\": None,\n",
149-
" \"abstract_narrative_understanding\": None\n",
150-
" },\n",
151-
" \"tydiqa_tasks\": {\n",
152-
" \"default\": None\n",
153-
" }\n",
154-
" },\n",
155-
" \"stop_string_override\": {\n",
156-
" \"standard_tasks\": {\n",
157-
" \"default\": None,\n",
158-
" \"abstract_narrative_understanding\": None\n",
159-
" },\n",
160-
" \"tydiqa_tasks\": {\n",
161-
" \"default\": None\n",
118+
" \"eval_subtype\": \"lm_eval_harness\",\n",
119+
" \"tasks\": [\n",
120+
" {\n",
121+
" \"task_name\" : \"gsm8k\",\n",
122+
" \"task_config\" : None,\n",
123+
" \"num_fewshot\" : 5,\n",
124+
" \"batch_size\" : 16,\n",
125+
" \"bootstrap_iters\" : 1000,\n",
126+
" \"limit\" : -1\n",
162127
" }\n",
163-
" }\n",
128+
" ]\n",
164129
"}"
165130
]
166131
},
@@ -375,9 +340,7 @@
375340
"outputs": [],
376341
"source": [
377342
"model_config = {\n",
378-
" \"llm_type\" : \"nvidia-nemo-nim\",\n",
379343
" \"llm_name\" : \"my-customized-model\",\n",
380-
" \"container\" : \"my-customized-container\",\n",
381344
" \"inference_url\" : \"my-customized-inference-url\",\n",
382345
" \"use_chat_endpoint\" : False,\n",
383346
"}"
@@ -522,9 +485,7 @@
522485
"outputs": [],
523486
"source": [
524487
"model_config = {\n",
525-
" \"llm_type\" : \"nvidia-nemo-nim\",\n",
526488
" \"llm_name\" : \"my-customized-model\",\n",
527-
" \"container\" : \"my-customized-container\",\n",
528489
" \"inference_url\" : \"my-customized-inference-url\",\n",
529490
" \"use_chat_endpoint\" : False,\n",
530491
"}"

0 commit comments

Comments
 (0)