|
4 | 4 | "cell_type": "markdown",
|
5 | 5 | "metadata": {},
|
6 | 6 | "source": [
|
7 |
| - "# liaisons-experiments - Framework Try-out" |
| 7 | + "# liaisons-experiments - Large Language Models Benchmarkings for Relation-based Argument Mining\n", |
| 8 | + "\n", |
| 9 | + "This notebook make a evaluation of the Large Language Model landscape ability for micro-scale Relation-based Argument Mining tasks.\n", |
| 10 | + "\n", |
| 11 | + "## About the Task\n", |
| 12 | + "\n", |
| 13 | + "This work is a modest continuation of a previous work (Gorur et al., 2024), limiting the computing cost by highly reducing the size of the dataset.\n", |
| 14 | + "\n", |
| 15 | + "The actual task of this evaluation consists in measuring each model capability to predict the logical relation between two arguments on controversial topics collected from wikipedia (Bar-Haim et al., 2017). \n", |
| 16 | + "Predicted relation can be of either 2 or 3 classes depending of the relation dimension configuration:\n", |
| 17 | + "- When *binary*, a relation can either be *support* (e.g., \"Arg A logically supports Arg B\") or *attack* (e.g., \"Arg A logically contradicts Arg B\")\n", |
| 18 | + "- When *ternary*, a relation can either be *support* (e.g., \"Arg A logically supports Arg B\"), *attack* (e.g., \"Arg A logically contradicts Arg B\"), or *unrelated* (e.g., \"Arg A is logically irevelant to Arg B\") \n", |
| 19 | + " \n", |
| 20 | + "For example, the first argument `ASEAN has subscribed to the notion of democratic peace` `attack` the second argument `This house would disband ASEAN`." |
8 | 21 | ]
|
9 | 22 | },
|
10 | 23 | {
|
|
15 | 28 | "source": [
|
16 | 29 | "from dotenv import load_dotenv\n",
|
17 | 30 | "\n",
|
18 |
| - "# Load the .env file\n", |
| 31 | + "# Load the .env file to safely retrieve HuggingFace token for the task dataset,\n", |
| 32 | + "# but also for the platform-hosted LLM\n", |
19 | 33 | "load_dotenv()"
|
20 | 34 | ]
|
21 | 35 | },
|
| 36 | + { |
| 37 | + "cell_type": "markdown", |
| 38 | + "metadata": {}, |
| 39 | + "source": [ |
| 40 | + "## Selected Models\n", |
| 41 | + "\n", |
| 42 | + "Capitalizing on the growing trend of open-source LLMs, this research investigates models like phi3, gemma2, and llama3 that are accessible even to users without specialized hardware.\n", |
| 43 | + "\n", |
| 44 | + "Expanding the evaluation, this work also included larger, platform-hosted models (e.g., gpt-3.5-turbo-0125, gemini-1.5-pro...). Their ease of scaling makes them particularly attractive for further macro-scale argument mining feature development.\n", |
| 45 | + "\n", |
| 46 | + "### Hyperparameters Configuration\n", |
| 47 | + "\n", |
| 48 | + "Following previous hyperparamters search (Gorur et al., 2024), `temperature` and `top_p` have respectively been set to 0.7 and 1 for better results. The `max_tokens` hyperparameter also have been set to a minimum to generate the expected classes (\"support\"/\"attack\"/\"unrelated\") to enable a crucial computation cost cut. However, the minimum value needed differs from the tokenizer used by each models, leading the a variation of this value.\n", |
| 49 | + "\n", |
| 50 | + "### Pipeline Acceleration\n", |
| 51 | + "\n", |
| 52 | + "Taking advantage of platform-hosted models infrastructure, the benchmarking framework propose a multithreading feature, configurable through the `num_workers` parameter. Enabling a critical performance improvement.\n" |
| 53 | + ] |
| 54 | + }, |
22 | 55 | {
|
23 | 56 | "cell_type": "code",
|
24 | 57 | "execution_count": null,
|
|
325 | 358 | " plt.show()"
|
326 | 359 | ]
|
327 | 360 | },
|
| 361 | + { |
| 362 | + "cell_type": "markdown", |
| 363 | + "metadata": {}, |
| 364 | + "source": [ |
| 365 | + "## Prompting Techniques\n", |
| 366 | + "\n", |
| 367 | + "This benchmark builds upon previous research by Gorur et al. (2024) utilizing \"few-shot\" prompting. This technique involves providing X examples of desired behavior before presenting the actual prompt. \n", |
| 368 | + "\n", |
| 369 | + "However, a significant discrepancy emerged between prior results and our findings. To address this gap, we also explored \"augmented few-shot\" prompting, which incorporates an additional instructional line within the prompt." |
| 370 | + ] |
| 371 | + }, |
328 | 372 | {
|
329 | 373 | "cell_type": "code",
|
330 | 374 | "execution_count": null,
|
|
508 | 552 | "\n",
|
509 | 553 | "plot_ternary_results(augmented_few_shot_ternary_results, title=\"Large Language Models for ternary argumentative relation prediction over the IBM Debater preprocessed dataset sample using augmented few shot prompting\")"
|
510 | 554 | ]
|
| 555 | + }, |
| 556 | + { |
| 557 | + "cell_type": "markdown", |
| 558 | + "metadata": {}, |
| 559 | + "source": [ |
| 560 | + "## Bibliography\n", |
| 561 | + "- Bar-Haim, R., Bhattacharya, I., Dinuzzo, F., Saha, A., and Slonim, N. (2017). Stance classification of context dependent claims. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 251–261.\n", |
| 562 | + "- Gorur, D., Rago, A. and Toni, F. (2024). Can Large Language Models perform Relation-based Argument Mining? [online] arXiv.org. doi:https://doi.org/10.48550/arXiv.2402.11243." |
| 563 | + ] |
511 | 564 | }
|
512 | 565 | ],
|
513 | 566 | "metadata": {
|
|
0 commit comments