Skip to content

Commit 1ed7853

Browse files
author
Guilhem Santé
committed
docs: add some documentation in the notebook itself to details the proceeded steps of the evaluation
1 parent c392148 commit 1ed7853

File tree

1 file changed

+55
-2
lines changed

1 file changed

+55
-2
lines changed

main.ipynb

+55-2
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,20 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# liaisons-experiments - Framework Try-out"
7+
"# liaisons-experiments - Large Language Models Benchmarkings for Relation-based Argument Mining\n",
8+
"\n",
9+
"This notebook make a evaluation of the Large Language Model landscape ability for micro-scale Relation-based Argument Mining tasks.\n",
10+
"\n",
11+
"## About the Task\n",
12+
"\n",
13+
"This work is a modest continuation of a previous work (Gorur et al., 2024), limiting the computing cost by highly reducing the size of the dataset.\n",
14+
"\n",
15+
"The actual task of this evaluation consists in measuring each model capability to predict the logical relation between two arguments on controversial topics collected from wikipedia (Bar-Haim et al., 2017). \n",
16+
"Predicted relation can be of either 2 or 3 classes depending of the relation dimension configuration:\n",
17+
"- When *binary*, a relation can either be *support* (e.g., \"Arg A logically supports Arg B\") or *attack* (e.g., \"Arg A logically contradicts Arg B\")\n",
18+
"- When *ternary*, a relation can either be *support* (e.g., \"Arg A logically supports Arg B\"), *attack* (e.g., \"Arg A logically contradicts Arg B\"), or *unrelated* (e.g., \"Arg A is logically irevelant to Arg B\") \n",
19+
" \n",
20+
"For example, the first argument `ASEAN has subscribed to the notion of democratic peace` `attack` the second argument `This house would disband ASEAN`."
821
]
922
},
1023
{
@@ -15,10 +28,30 @@
1528
"source": [
1629
"from dotenv import load_dotenv\n",
1730
"\n",
18-
"# Load the .env file\n",
31+
"# Load the .env file to safely retrieve HuggingFace token for the task dataset,\n",
32+
"# but also for the platform-hosted LLM\n",
1933
"load_dotenv()"
2034
]
2135
},
36+
{
37+
"cell_type": "markdown",
38+
"metadata": {},
39+
"source": [
40+
"## Selected Models\n",
41+
"\n",
42+
"Capitalizing on the growing trend of open-source LLMs, this research investigates models like phi3, gemma2, and llama3 that are accessible even to users without specialized hardware.\n",
43+
"\n",
44+
"Expanding the evaluation, this work also included larger, platform-hosted models (e.g., gpt-3.5-turbo-0125, gemini-1.5-pro...). Their ease of scaling makes them particularly attractive for further macro-scale argument mining feature development.\n",
45+
"\n",
46+
"### Hyperparameters Configuration\n",
47+
"\n",
48+
"Following previous hyperparamters search (Gorur et al., 2024), `temperature` and `top_p` have respectively been set to 0.7 and 1 for better results. The `max_tokens` hyperparameter also have been set to a minimum to generate the expected classes (\"support\"/\"attack\"/\"unrelated\") to enable a crucial computation cost cut. However, the minimum value needed differs from the tokenizer used by each models, leading the a variation of this value.\n",
49+
"\n",
50+
"### Pipeline Acceleration\n",
51+
"\n",
52+
"Taking advantage of platform-hosted models infrastructure, the benchmarking framework propose a multithreading feature, configurable through the `num_workers` parameter. Enabling a critical performance improvement.\n"
53+
]
54+
},
2255
{
2356
"cell_type": "code",
2457
"execution_count": null,
@@ -325,6 +358,17 @@
325358
" plt.show()"
326359
]
327360
},
361+
{
362+
"cell_type": "markdown",
363+
"metadata": {},
364+
"source": [
365+
"## Prompting Techniques\n",
366+
"\n",
367+
"This benchmark builds upon previous research by Gorur et al. (2024) utilizing \"few-shot\" prompting. This technique involves providing X examples of desired behavior before presenting the actual prompt. \n",
368+
"\n",
369+
"However, a significant discrepancy emerged between prior results and our findings. To address this gap, we also explored \"augmented few-shot\" prompting, which incorporates an additional instructional line within the prompt."
370+
]
371+
},
328372
{
329373
"cell_type": "code",
330374
"execution_count": null,
@@ -508,6 +552,15 @@
508552
"\n",
509553
"plot_ternary_results(augmented_few_shot_ternary_results, title=\"Large Language Models for ternary argumentative relation prediction over the IBM Debater preprocessed dataset sample using augmented few shot prompting\")"
510554
]
555+
},
556+
{
557+
"cell_type": "markdown",
558+
"metadata": {},
559+
"source": [
560+
"## Bibliography\n",
561+
"- Bar-Haim, R., Bhattacharya, I., Dinuzzo, F., Saha, A., and Slonim, N. (2017). Stance classification of context dependent claims. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 251–261.\n",
562+
"- Gorur, D., Rago, A. and Toni, F. (2024). Can Large Language Models perform Relation-based Argument Mining? [online] arXiv.org. doi:https://doi.org/10.48550/arXiv.2402.11243."
563+
]
511564
}
512565
],
513566
"metadata": {

0 commit comments

Comments
 (0)