Updating Evaluator Notebook to LM Eval Harness (#185)

chrisalexiuk-nvidia · chris-alexiuk-1 · web-flow · commit 9720d619f53d · 2024-09-18T16:53:58.000-07:00
Co-authored-by: Chris Alexiuk &lt;chris@alexiuk.ca&gt;
diff --git a/RAG/notebooks/nemo/Nemo Evaluator Llama 3.1 Workbook/evaluator_notebook.ipynb b/RAG/notebooks/nemo/Nemo Evaluator Llama 3.1 Workbook/evaluator_notebook.ipynb
@@ -67,13 +67,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Baseline Evaluation of Llama 3.1 8B Instruct with BigBench\n",
+    "## Baseline Evaluation of Llama 3.1 8B Instruct with LM Evaluation Harness\n",
     "\n",
     "The Nemo Evaluator microservice allows users to run a number of academic benchmarks, all of which are accessible through the Nemo Evaluator API.\n",
     "\n",
     "> NOTE: For more details on what evaluations are available, please head to the [Evaluation documentation](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/evaluations.html)\n",
     "\n",
-    "For this notebook, we will be running the BigBench evaluation (details available [here](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/evaluations.html#bigbench))! This benchmark consists of 200+ tasks for evaluating LLMs."
+    "For this notebook, we will be running the LM Evaluation Harness evaluation!"
    ]
   },
   {
@@ -90,7 +90,6 @@
    "outputs": [],
    "source": [
     "model_config = {\n",
-    "        \"llm_type\": \"nvidia-nemo-nim\",\n",
     "        \"llm_name\": \"my-customized-model\",\n",
     "        \"inference_url\": \"MY_NIM_URL/v1\",\n",
     "        \"use_chat_endpoint\": False,\n",
@@ -103,11 +102,9 @@
    "source": [
     "Now we can initialize our evaluation config, which is how we communicate which benchmark tasks, subtasks, etc. to use during evaluation. \n",
     "\n",
-    "For this evaluation, we'll focus on a small subset of BigBench by choosing the `intent_recognition` task. \n",
+    "For this evaluation, we'll focus on the [GSM8K](https://arxiv.org/abs/2110.14168) evaluation which uses Eleuther AI's [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/v0.4.3) as a backend. \n",
     "\n",
-    "`intent_recognition` is a task specifically tailored to determine if the model is good at recognizing a given utterance's intent. More details available [here](https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/intent_recognition).\n",
-    "\n",
-    "We'll also select the `tydiqa_goldp.en` task to see how Llama 3 8B Instruct stacks up on the English subset of the `TyDi QA` benchmark. More details available [here](https://github.com/google-research-datasets/tydiqa)."
+    "The LM Evaluation Harness supports more than 60 standard academic benchmarks for LLMs!"
    ]
   },
   {
@@ -118,49 +115,17 @@
    "source": [
     "evaluation_config = {\n",
     "    \"eval_type\": \"automatic\",\n",
-    "    \"eval_subtype\": \"bigbench\",\n",
-    "    \"standard_tasks\": [\n",
-    "        \"intent_recognition\",\n",
-    "    ],\n",
-    "    \"tydiqa_tasks\": [\n",
-    "        \"tydiqa_goldp.en\",\n",
-    "    ],\n",
-    "    \"standard_tasks_args\": \"--max_length=64 --json_shots='0,2'\",\n",
-    "    \"tydiqa_tasks_args\": \"--max_length=16 --json_shots='1,8'\",\n",
-    "    \"few_shot_example_separator_override\": {\n",
-    "        \"standard_tasks\": {\n",
-    "            \"default\": None\n",
-    "        },\n",
-    "        \"tydiqa_tasks\": {\n",
-    "            \"default\": None\n",
-    "        }\n",
-    "    },\n",
-    "    \"example_input_prefix_override\": {\n",
-    "        \"standard_tasks\": {\n",
-    "            \"default\": None\n",
-    "        },\n",
-    "        \"tydiqa_tasks\": {\n",
-    "            \"default\": None\n",
-    "        }\n",
-    "    },\n",
-    "    \"example_output_prefix_override\": {\n",
-    "        \"standard_tasks\": {\n",
-    "            \"default\": None,\n",
-    "            \"abstract_narrative_understanding\": None\n",
-    "        },\n",
-    "        \"tydiqa_tasks\": {\n",
-    "            \"default\": None\n",
-    "        }\n",
-    "    },\n",
-    "    \"stop_string_override\": {\n",
-    "        \"standard_tasks\": {\n",
-    "            \"default\": None,\n",
-    "            \"abstract_narrative_understanding\": None\n",
-    "        },\n",
-    "        \"tydiqa_tasks\": {\n",
-    "            \"default\": None\n",
+    "    \"eval_subtype\": \"lm_eval_harness\",\n",
+    "    \"tasks\": [\n",
+    "        {\n",
+    "        \"task_name\" : \"gsm8k\",\n",
+    "        \"task_config\" : None,\n",
+    "        \"num_fewshot\" : 5,\n",
+    "        \"batch_size\" : 16,\n",
+    "        \"bootstrap_iters\" : 1000,\n",
+    "        \"limit\" : -1\n",
     "        }\n",
-    "    }\n",
+    "    ]\n",
     "}"
    ]
   },
@@ -375,9 +340,7 @@
    "outputs": [],
    "source": [
     "model_config = {\n",
-    "        \"llm_type\" : \"nvidia-nemo-nim\",\n",
     "        \"llm_name\" : \"my-customized-model\",\n",
-    "        \"container\" : \"my-customized-container\",\n",
     "        \"inference_url\" : \"my-customized-inference-url\",\n",
     "        \"use_chat_endpoint\" : False,\n",
     "}"
@@ -522,9 +485,7 @@
    "outputs": [],
    "source": [
     "model_config = {\n",
-    "        \"llm_type\" : \"nvidia-nemo-nim\",\n",
     "        \"llm_name\" : \"my-customized-model\",\n",
-    "        \"container\" : \"my-customized-container\",\n",
     "        \"inference_url\" : \"my-customized-inference-url\",\n",
     "        \"use_chat_endpoint\" : False,\n",
     "}"