Evaluating LLMs on the WiC Task

The code and data relevant to the LREC-COLING 2024 paper are maintained in this repository.

Yoshihiko Hayashi, "Reassessing Semantic Knowledge Encoded in Large Language Models through the Word-in-Context Task," LREC-COLING 2024.

Quick overview

To conduct classification experiments, follow these steps:

Fetch the original WiC dataset.
Use repair_tword_index.py to revise the tokenization of contextual sentences in the original data files.
Use get_descriptions.py to collect descriptions for WiC data instances. Make sure to have a valid OPENAI API KEY.
Merge relevant files to be "ready-for-experiment" by invoking merge_wic_dataset.py.
Conduct experiments using classify_control.py. Modify this script to suit your experimental requirements.

To conduct a Zero-shot run, simply use zero_shot.py after setting your own OPENAI API KEY as an environment variable.
Tested environment
- Python 3.8.10
- Pytorch 2.0.1+cu11
- transformers 4.31.0
- openai 0.27.7
- ray 2.6.2
- Other standard libraries such as numpy, pandas, sklearn, etc.
Contact mailto:[email protected]

Directories

WiC_dataset: This directory maintains the original Word-in-Context dataset. Please refer to the original WiC paper for details.
- M.T. Pilehvar and J. Camacho-Collados, "WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations," NAACL 2019.
- https://pilehvar.github.io/wic/
WiC_dataset_repaired: This directory contains three files: train_new.data.txt, dev_new.data.txt, and test_new.data.txt. The contextual sentences in these files were slightly modified from the original files stored in the WiC_dataset directory for better alignment with the tokenization process used.
descriptions: This directory is intended to maintain LLM-generated descriptions as described in section 4 of the paper. Currently, it contains only README.txt and descriptions.enc file. The descriptions.enc file can be expanded into a set of LLM-generated descriptions files by decoding them using a separately notified password. The password will be provided upon individual request. For more details, please refer to the README.txt file in this directory.
data_tsv: This directory should maitain ready-for-experiments files, which are train.tsv, dev.tsv, and test.tsv. These files are created by the Python script, merge_wic_data.py. Similar to the descriptions directory, the files are encoded and can be extracted by decoding the descriptions.enc file. For more details, please refer to the README.txt file in this directory.
Wic_results_db: This directory is intended for storing experiment result files. Currently, the following files exist:
- cl_config_1.json, cl_config_2.json: These files maintain experimental results with Configuration#1 and Configuration#2, respectively. Please refer to section 5 of the paper for details.
- zs_gpt-3.5-turbo-0613_test_2.json, zs_gpt-4-0613_test_2.json: These files respectively exemplify the experimental results of the zero-shot baselines. They are converted to associated tsv files using zs_json2tsv.py, which is also in this directory.
logs: This directory stores log files of the text classification experiments. It currently contains four files for two configurations (#1 and #2) and two LLMs (GPT-3.5 and GPT-4).
cl_conf1, cl_conf2: These directories are created for storing model checkpoints generated by the PyTorch Trainer. Due to their large size, these directories are emptied in this repository.

Python scripts

repair_tword_index.py: This script repairs the tokenization of contextual sentences in the original WiC files. A usage example is "$ python repair_tword_index.py train", which will produce a train_new.data.txt file in the WiC_dataset_repaired directory from the original train.data.txt file.
get_descriptions.py: This script collects the semantic description for a WiC data instance by consulting an LLM via OpenAI's API. A usage example is "$ python get_descriptions.py --pr_type direct --dataset train --llm gpt-4-0613", which will acquire semantic descriptions for the training dataset by using the direct-type prompt (refer to the paper) sent to GPT-4. Please note that you have to define your own OPENAI_API_KEY environment variable to use this script.
merge_wic_data.py: This script creates a ready-for-experiment file by merging the original WiC data, their repaired WiC data, and the LLM-generated descriptions. A usage example is "$ python merge_wic_data.py data_tsv", which merges ready-for-experiment files for the train, dev, and test data splits, and stores the resulting files in the ./data_tsv directory.
classify_control.py: This script controls a series of classification experiments by invoking cl_conf1.py or cl_conf2.py. A usage example is "$ python classify_control.py cl_conf1.py gpt4 > ./logs/cl_conf1_gpt4.log", which executes a full combination of conditions (coded in the script) using gpt-4 and stores the standard output content to the designated log file.
cl_conf1.py, cl_conf2.py: These files, tied to configurations #1 and #2, are set up for running experiments with specified conditions passed as arguments. Check the main function in the source code for specifics.
zero_shot.py: This script is prepared for obtaining zero-shot baseline results. A usage example is "$ python zero_shot.py --llm gpt-4-0613", which executes a zero-shot run on the test dataset using GPT-4. The prompt template is coded in this file, and the adjective to dictate the degree of semantic sameness (default is "identical") can be altered using the "adj" argument.
Other files: anal_desc.py and anaysis_tool.py can be used for replicating the additional results reported in the paper.

Errata for the LREC-COLING 2024 paper

We would like to make some revisions to Table 7 as follows. These revisions are proposed due to minor mistakes that occurred in the reported experiments. We believe these changes do not affect the insights described in the paper.

More details on the mistake

The development split of the data had not been properly processed, which affected the training process in some cases, as the development data was used to decide the stopping timing of learning processes.

Disclaimer

The code and data in this repository are provided "as is" without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose, and non-infringement. In no event shall the authors or copyright holders be liable for any claim, damages, or other liability, whether in an action of contract, tort, or otherwise, arising from, out of, or in connection with the software or the use or other dealings in the software.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating LLMs on the WiC Task

Quick overview

Directories

Python scripts

Errata for the LREC-COLING 2024 paper

More details on the mistake

Disclaimer

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
WiC_dataset		WiC_dataset
WiC_dataset_repaired		WiC_dataset_repaired
data_tsv		data_tsv
descriptions		descriptions
logs		logs
wic_results_db		wic_results_db
LICENSE		LICENSE
README.md		README.md
anal_desc.py		anal_desc.py
anaysis_tool.py		anaysis_tool.py
cl_conf1.py		cl_conf1.py
cl_conf2.py		cl_conf2.py
classify_control.py		classify_control.py
get_descriptions.py		get_descriptions.py
get_tword_vectors.py		get_tword_vectors.py
make_transformers_dataset.py		make_transformers_dataset.py
merge_wic_data.py		merge_wic_data.py
repair_tword_index.py		repair_tword_index.py
zero_shot.py		zero_shot.py

License

yoshihikohayashi/wic_llm

Folders and files

Latest commit

History

Repository files navigation

Evaluating LLMs on the WiC Task

Quick overview

Directories

Python scripts

Errata for the LREC-COLING 2024 paper

More details on the mistake

Disclaimer

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages