docs: add some documentation in the notebook itself to details the proceeded steps of the evaluation

Guilhem Santé · Guilhem Santé · commit 1ed785315cbd · 2024-08-08T08:04:09.000+02:00
diff --git a/main.ipynb b/main.ipynb
@@ -4,7 +4,20 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# liaisons-experiments - Framework Try-out"
+    "# liaisons-experiments - Large Language Models Benchmarkings for Relation-based Argument Mining\n",
+    "\n",
+    "This notebook make a evaluation of the Large Language Model landscape ability for micro-scale Relation-based Argument Mining tasks.\n",
+    "\n",
+    "## About the Task\n",
+    "\n",
+    "This work is a modest continuation of a previous work (Gorur et al., 2024), limiting the computing cost by highly reducing the size of the dataset.\n",
+    "\n",
+    "The actual task of this evaluation consists in measuring each model capability to predict the logical relation between two arguments on controversial topics collected from wikipedia (Bar-Haim et al., 2017).  \n",
+    "Predicted relation can be of either 2 or 3 classes depending of the relation dimension configuration:\n",
+    "- When *binary*, a relation can either be *support* (e.g., \"Arg A logically supports Arg B\") or *attack* (e.g., \"Arg A logically contradicts Arg B\")\n",
+    "- When *ternary*, a relation can either be *support* (e.g., \"Arg A logically supports Arg B\"), *attack* (e.g., \"Arg A logically contradicts Arg B\"), or *unrelated* (e.g., \"Arg A is logically irevelant to Arg B\")  \n",
+    "  \n",
+    "For example, the first argument `ASEAN has subscribed to the notion of democratic peace` `attack` the second argument `This house would disband ASEAN`."
    ]
   },
   {
@@ -15,10 +28,30 @@
    "source": [
     "from dotenv import load_dotenv\n",
     "\n",
-    "# Load the .env file\n",
+    "# Load the .env file to safely retrieve HuggingFace token for the task dataset,\n",
+    "# but also for the platform-hosted LLM\n",
     "load_dotenv()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Selected Models\n",
+    "\n",
+    "Capitalizing on the growing trend of open-source LLMs, this research investigates models like phi3, gemma2, and llama3 that are accessible even to users without specialized hardware.\n",
+    "\n",
+    "Expanding the evaluation, this work also included larger, platform-hosted models (e.g., gpt-3.5-turbo-0125, gemini-1.5-pro...). Their ease of scaling makes them particularly attractive for further macro-scale argument mining feature development.\n",
+    "\n",
+    "### Hyperparameters Configuration\n",
+    "\n",
+    "Following previous hyperparamters search (Gorur et al., 2024), `temperature` and `top_p` have respectively been set to 0.7 and 1 for better results. The `max_tokens` hyperparameter also have been set to a minimum to generate the expected classes (\"support\"/\"attack\"/\"unrelated\") to enable a crucial computation cost cut. However, the minimum value needed differs from the tokenizer used by each models, leading the a variation of this value.\n",
+    "\n",
+    "### Pipeline Acceleration\n",
+    "\n",
+    "Taking advantage of platform-hosted models infrastructure, the benchmarking framework propose a multithreading feature, configurable through the `num_workers` parameter. Enabling a critical performance improvement.\n"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -325,6 +358,17 @@
     "    plt.show()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prompting Techniques\n",
+    "\n",
+    "This benchmark builds upon previous research by Gorur et al. (2024) utilizing \"few-shot\" prompting. This technique involves providing X examples of desired behavior before presenting the actual prompt.  \n",
+    "\n",
+    "However, a significant discrepancy emerged between prior results and our findings. To address this gap, we also explored \"augmented few-shot\" prompting, which incorporates an additional instructional line within the prompt."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -508,6 +552,15 @@
     "\n",
     "plot_ternary_results(augmented_few_shot_ternary_results, title=\"Large Language Models for ternary argumentative relation prediction over the IBM Debater preprocessed dataset sample using augmented few shot prompting\")"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Bibliography\n",
+    "- Bar-Haim, R., Bhattacharya, I., Dinuzzo, F., Saha, A., and Slonim, N. (2017). Stance classification of context dependent claims. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 251–261.\n",
+    "- Gorur, D., Rago, A. and Toni, F. (2024). Can Large Language Models perform Relation-based Argument Mining? [online] arXiv.org. doi:https://doi.org/10.48550/arXiv.2402.11243."
+   ]
   }
  ],
  "metadata": {