[doc] debug an evaluation run

RUCAIBox · Jun 15, 2024 · 0ff9488 · 0ff9488
1 parent 004ae3e
commit 0ff9488
Show file tree

Hide file tree

Showing 7 changed files with 252 additions and 18 deletions.
diff --git a/docs/README.md b/docs/README.md
@@ -7,24 +7,25 @@ Tutorial: [Training](https://github.com/RUCAIBox/LLMBox/tree/main/training)
 ## Utilization
 
 CLI Usage: [Utilization](https://github.com/RUCAIBox/LLMBox/tree/main/utilization)
-Reproduction: [test.sh](https://github.com/RUCAIBox/LLMBox/tree/main/test.sh)
+Reproduction: [test.sh](https://github.com/RUCAIBox/LLMBox/blob/main/test.sh)
 
 ### Datasets
 
-- [Supported datasets](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/supported-datasets.md)
-- [How to load datasets with subsets](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/how-to-load-datasets-with-subsets.md)
-- [How to load datasets from HuggingFace](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/how-to-load-datasets-from-huggingface.md)
-- [How to customize dataset](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/how-to-customize-dataset.md)
+- [Supported datasets](https://github.com/RUCAIBox/LLMBox/blob/main/docs/utilization/supported-datasets.md)
+- [How to load datasets with subsets](https://github.com/RUCAIBox/LLMBox/blob/main/docs/utilization/how-to-load-datasets-with-subsets.md)
+- [How to load datasets from HuggingFace](https://github.com/RUCAIBox/LLMBox/blob/main/docs/utilization/how-to-load-datasets-from-huggingface.md)
+- [How to customize dataset](https://github.com/RUCAIBox/LLMBox/blob/main/docs/utilization/how-to-customize-dataset.md)
 
 ### Models
 
-- [How to customize model](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/how-to-customize-model.md)
+- [How to customize model](https://github.com/RUCAIBox/LLMBox/blob/main/docs/utilization/how-to-customize-model.md)
 
 ## Examples
 
-- [Customize dataset](https://github.com/RUCAIBox/LLMBox/tree/main/docs/examples/customize_dataset.py)
-- [Customize HuggingFace model](https://github.com/RUCAIBox/LLMBox/tree/main/docs/examples/customize_huggingface_model.py)
+- [Customize dataset](https://github.com/RUCAIBox/LLMBox/blob/main/docs/examples/customize_dataset.py)
+- [Customize HuggingFace model](https://github.com/RUCAIBox/LLMBox/blob/main/docs/examples/customize_huggingface_model.py)
 
 ## Trouble Shooting
 
-- [vLLM no module name packaging](https://github.com/RUCAIBox/LLMBox/tree/main/docs/trouble_shooting/vllm_no_module_name_packaging.md)
+- [Debug an evaluation run](https://github.com/RUCAIBox/LLMBox/blob/main/docs/trouble_shooting/debug_evaluation_run.md)
+- [vLLM no module name packaging](https://github.com/RUCAIBox/LLMBox/blob/main/docs/trouble_shooting/vllm_no_module_name_packaging.md)
diff --git a/docs/trouble_shooting/debug_evaluation_run.md b/docs/trouble_shooting/debug_evaluation_run.md
@@ -0,0 +1,232 @@
+# [Trouble Shooting] How to Debug an Evaluation Run
+
+This tutorial focuses on debugging an evaluation run that does not reproduce the expected results. This can be caused by the model not generating any predictions, the dataset not being formatted correctly, or the metrics not being calculated correctly. This tutorial will guide you through the process of locating the problem and fixing it.
+
+## Locating the Problem
+
+Every run of the model should produce an evaluation results file, which contains the input data and the model's predictions. You can find this file in the `evaluation_results` folder.
+
+If the process ended normally, the file should be a valid JSON file with metrics:
+
+```json
+[
+    {
+        "index":0,
+        "source":"<|start_header_id|>user<|end_header_id|>\n\nAnswer the following question.\n\nQuestion: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\nAnswer: <|eot_id|> <|start_header_id|>assistant<|end_header_id|>",
+        "raw_prediction":[
+            "\n\nLet's break this down step by step!\n\nJanet's ducks lay 16 eggs per day. She eats 3 for breakfast, so that leaves:\n\n16 - 3 = 13 eggs\n\nShe bakes muffins with 4 eggs, so that leaves:\n\n13 - 4 = 9 eggs\n\nShe sells the remaining 9 eggs at the farmers' market. Each egg sells for $2, so she makes:\n\n9 eggs x $2 per egg = $18\n\nJanet makes $18 every day at the farmers' market."
+        ],
+        "processed_prediction":[
+            "18"
+        ],
+        "reference":"18",
+        "metric":{
+            "Accuracy":true
+        },
+        "subset":null
+    },
+    ...
+]
+```
+
+Alternatively, if the process was ended prematurely, the file will be a valid jsonlines file:
+
+```json
+{"index": 0, "source": ["(\"<|start_header_id|>user<|end_header_id|>\\n\\nAnswer the following question.\\n\\nQuestion: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\\nAnswer: <|eot_id|> <|start_header_id|>assistant<|end_header_id|>\",)"], "raw_prediction": "\n\nLet's break this down step by step!\n\nJanet's ducks lay 16 eggs per day.\nShe eats 3 eggs for breakfast, so that leaves 16 - 3 = 13 eggs.\nShe bakes muffins with 4 eggs, so that leaves 13 - 4 = 9 eggs.\nShe sells the remaining 9 eggs at the farmers' market.\n\nEach egg sells for $2, so she makes:\n9 eggs x $2 per egg = $18\n\nJanet makes $18 every day at the farmers' market.", "reference": "18"}
+...
+```
+
+You can look into the evaluation reulsts file to see if the model is generating normally.
+
+1. If the `raw_prediction` field is empty, the model is not generating any predictions. This might because of the model encountering a stop sequence in output. You can check the `stop` field in the generation arguments and `default_stops` in the chat_templates configuration.
+
+2. If the `raw_prediction` field seems to be normal, you can check the `processed_prediction` field to see if the answer is being extracted correctly in the `post_processing` step.
+
+3. If the `raw_prediction` field continues to output after the completion of the output, it may be that the stop sequence has not been correctly configured. You can check the `stop` field in the generation arguments and the chat_templates configuration.
+
+4. If the `reference` field is not formatted as expected, it may be that the dataset is not formatted correctly. You can check the `references` property in the dataset class is correctly formatted.
+
+5. If everything seems to be normal, you can check the `metric` to see if the metrics are being calculated correctly, especially if the metric is complex.
+
+## Fixing the Problem
+
+If you have located the problem, you can try to fix it by following the steps below.
+
+### Checking the `stop` Generation Argument
+
+The `stop` argument is a list of strings that the model will stop generating after encountering. You can check the `stop` field in the log to see if the model is correctly configured.
+
+**HuggingFace Models:**
+
+```text
+2024-06-15 19:30:19 INFO batch_sampler.py:38 Evaluating generation on mt_bench (model_attr={'model_type': 'chat', 'model_backend': 'huggingface', 'model_max_input': 8192, 'model_max_input_and_output': 8192, 'multi_turn': True}, generation_kwargs={'max_new_tokens': 1024, 'stopping_criteria': [KeyWordsCriteria(stop_sequences=[[128009]])], 'pad_token_id': 128000, 'eos_token_id': 128001}, num_shots=0, len=1, num_instances=1, use_cache=False)
+```
+
+We convert the `stop` field to a list of integers in the `stopping_criteria` field. In the above example, the stop sequence is `[128009]`, which corresponds to the `<|eot_id|>` token.
+
+**vLLM Models:**
+
+```text
+2024-06-15 20:10:33 INFO batch_sampler.py:38 Evaluating generation on mt_bench (model_attr={'model_type': 'chat', 'model_backend': 'vllm'}, generation_kwargs=SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['<|eot_id|>'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), num_shots=0, len=80, num_instances=80, use_cache=False)
+```
+
+LLaMA-3's default stop are `['<|eot_id|>']`.
+
+
+**API Models:**
+
+The following model does not use `stop`:
+
+```text
+2024-06-15 20:35:50 INFO batch_sampler.py:38 Evaluating generation on mt_bench (model_attr={'model_type': 'chat', 'model_backend': 'openai', 'multi_turn': True}, generation_kwargs={'max_tokens': 4096, 'seed': 2023}, num_shots=0, len=1, num_instances=1, use_cache=False)
+```
+
+While the following one uses `stop`:
+
+```text
+2024-06-15 20:39:37 INFO batch_sampler.py:38 Evaluating generation on drop (model_attr={'model_type': 'chat', 'model_backend': 'openai'}, generation_kwargs={'max_tokens': 64, 'seed': 2023, 'stop': ['\n'], 'temperature': 0}, num_shots=0, len=1, num_instances=1, use_cache=False)
+```
+
+**`stop` might be set in the following places:**
+
+1. In the `init_arguments` method or the class variable of the dataset class
+
+2. In the command line arguments `stop`
+
+3. In the chat template `default_stop`
+
+4. In the `transform` validation of generation arguments (Anthropic models does not support a whitespace stop)
+
+### Checking the Chat Template Configuration
+
+If you are using an instruct-tuned model, you need a chat template to correctly prompt the model. Different models may require different chat templates.
+
+Currently we support 7 chat templates including `base` (default), `llama3`, `chatml`, `llama2`, `zephyr`, `phi3`, and `alpaca`. This offers a more fine-grained control over the chat format.
+
+```python
+"llama3": {
+    "system_start": "<|start_header_id|>system<|end_header_id|>\n\n",
+    "system_end": "<|eot_id|>",
+    "user_start": "<|start_header_id|>user<|end_header_id|>\n\n",
+    "user_end": "<|eot_id|>",
+    "assistant_start": "<|start_header_id|>assistant<|end_header_id|>\n\n",
+    "assistant_end": "<|eot_id|>",
+    "auto_leading_space": True,
+    "default_stops": ["<|eot_id|>"],
+}
+```
+
+When loading a chat-based model, i.e. setting `--model_type chat`, we try to match the model with the chat template by the model's name. For example, the `Meta-Llama3-8B-Instruct` model will be matched with the `llama3` chat template.
+
+You can check that the chat template is correctly loaded in the log:
+
+```text
+2024-06-15 20:39:37 INFO Automatically set chat_template to llama3.
+```
+
+If the chat template is not correctly loaded, you can manually set the chat template by adding the `--chat_template` argument to the command line.
+
+```bash
+python inference.py -m internlm/internlm2-chat-7b -d gsm8k --chat_template chatml
+```
+
+If the chat format is not supported by LLMBox, you can create a new chat template by extending the [`chat_templates.py`](https://github.com/RUCAIBox/LLMBox/tree/main/utilization/chat_templates.py) file.
+
+Alternatively, you can also pass in a jinja2 template string, which is also compatible with HuggingFace's `tokenizers` library.
+
+### Checking the `default_stops` in the Chat Template
+
+In rare cases, you may want to modify the `default_stops` field in the chat template configuration.
+
+If the `default_stops` field prevents the model from generating output, you can try overwriting the `default_stops` arguments with an empty string.
+
+```bash
+python inference.py -m Meta-Llama3-8B-Instruct -d gsm8k --default_stops ""
+```
+
+If you need to extend the `default_stops` field in the chat template configuration.
+
+```bash
+python inference.py -m Meta-Llama3-8B-Instruct -d gsm8k --default_stops "<|eot_id|>" "<|start_header_id|>"
+```
+
+### Checking the `post_processing` Step
+
+The `post_processing` step is used to extract the answer from the model's output. If the `post_processing` step is not correctly configured, the model will not be able to extract the answer correctly.
+
+You can first locate the dataset class in the [`utilization/dataset`](https://github.com/RUCAIBox/LLMBox/tree/main/utilization/dataset) folder and check the `post_processing` method.
+
+```python
+class Drop(GenerationDataset):
+
+    ...
+
+    @staticmethod
+    def post_processing(predictions):
+        new_predictions = []
+        pattern = r"[.!(\n)]"
+        for pred in predictions:
+            match = re.search(pattern, pred)
+            if match:
+                index = match.start()
+                pred = pred[:index]
+            new_predictions.append(pred)
+        return new_predictions
+```
+
+### Checking the `references` Property
+
+The `references` property in the dataset class is used to check the model output against the reference answer. If the `references` property is not formatted correctly, the model will not be able to calculate the metrics correctly.
+
+
+```python
+class Drop(GenerationDataset):
+
+    ...
+
+    @cached_property
+    def references(self):
+        return [instance["answers_spans"]["spans"] for instance in self.evaluation_data]
+```
+
+### Checking the Metric Calculation
+
+```python
+class Drop(GenerationDataset):
+
+    metrics = [F1(force_number_match=True, word_tokenize="regex", align_bag="counter"), Em()]
+
+    ...
+```
+
+If you found `processed_prediction` matches the `reference` field, but the metric is still not calculated correctly, you can check the metric calculation method in the dataset class.
+
+
+```python
+class F1(Metric):
+
+    def __init__(
+        self,
+        *,
+        dataset: Literal["independent"] = "independent",
+        multiref_strategy: Literal["max", "leave_one_out"] = "max",
+        word_tokenize: Literal["nltk", "split", "regex"] = "nltk",
+        normalize_level: Literal["token", "text", "both"] = "both",
+        align_bag: Literal["counter", "set"] = "counter",
+        force_number_match=False,
+    ):
+        self.dataset = dataset
+        self.word_tokenize = _TOKENIZER_DICT[word_tokenize]
+        self.normalize_level = normalize_level
+        self.multiref_strategy = multiref_strategy
+        self.align_bag = align_bag
+        self.force_number_match = force_number_match
+    ...
+
+```
+
+## In Closing
+
+If you still have any problems replicating an evaluation run, please feel free to reach out to us by [creating an issue](https://github.com/RUCAIBox/LLMBox/issue).
+
+You can attach the log file and evaluation results file to the issue, and we will help you locate the problem.
diff --git a/tests/utilization/model/test_apply_prompt_template.py b/tests/utilization/model/test_apply_prompt_template.py
@@ -39,7 +39,7 @@ def test_no_smart_space(conversation: Conversation):
         "assistant_start": "",
         "assistant_end": "",
         "auto_leading_space": False,
-        "default_stops": [],
+        "default_stop": [],
     }
     formatter = ConversationFormatter(prompt_config, DEFAULT_CHAT_TEMPLATE)
     conversation.set_formatter(formatter)
@@ -58,7 +58,7 @@ def test_smart_space(conversation: Conversation):
         "assistant_start": "",
         "assistant_end": "",
         "auto_leading_space": True,
-        "default_stops": [],
+        "default_stop": [],
     }
     formatter = ConversationFormatter(prompt_config, DEFAULT_CHAT_TEMPLATE)
     conversation[2]["content"] = " This is an assistant message."  # extra leading space
@@ -80,7 +80,7 @@ def test_final_strip(conversation: Conversation):
         "auto_leading_space": True,
         "final_lstrip": False,
         "final_rstrip": False,
-        "default_stops": [],
+        "default_stop": [],
     }
     formatter = ConversationFormatter(prompt_config, DEFAULT_CHAT_TEMPLATE)
     conversation.set_formatter(formatter)

diff --git a/utilization/chat_templates.py b/utilization/chat_templates.py
@@ -74,7 +74,7 @@ def smart_space(parts: List[str], auto_leading_space: bool, remove_space_between
 #   - assistant_end: The string to append to the assistant message.
 #   - auto_leading_space: Whether to add a leading space when concatenating two
 #     strings if the first string does not end with a whitespace.
-#   - default_stops: A list of strings that indicate the end of a message.
+#   - default_stop: A list of strings that indicate the end of a message.
 #
 DEFAULT_CHAT_CONFIGS: Dict[str, Union[Dict[str, Any], str]] = {
     "base": {

diff --git a/utilization/dataset/dataset.py b/utilization/dataset/dataset.py
@@ -353,11 +353,11 @@ def _init_arguments(self):
         self._extra_model_args = deepcopy(self.extra_model_args)
 
         # apply chat template
-        if self.conversation_formatter.default_stops:
+        if self.conversation_formatter.default_stop:
             if "stop" not in self._extra_model_args:
                 self._extra_model_args["stop"] = []
-            self._extra_model_args["stop"].extend(self.conversation_formatter.default_stops)
-        logger.debug(f"Chat template stops: {self.conversation_formatter.default_stops}")
+            self._extra_model_args["stop"].extend(self.conversation_formatter.default_stop)
+        logger.debug(f"Chat template stops: {self.conversation_formatter.default_stop}")
 
         # temperature
         if self.sample_num > 1 and self._extra_model_args.get("temperature", 0) == 0:

diff --git a/utilization/metric/em_f1.py b/utilization/metric/em_f1.py
@@ -105,10 +105,11 @@ class F1(Metric):
 
     Args:
         `multiref_strategy`: Strategy to aggregate F1 scores for multiple references.
-        `force_number_match`: If reference contains numbers, prediction must matches all the numbers in the reference.
         `word_tokenize`: Tokenizer functions for different tokenization methods. Default: nltk.word_tokenize.
             DROP: https://github.com/EleutherAI/lm-evaluation-harness/blob/3196e907fa195b684470a913c7235ed7f08a4383/lm_eval/tasks/drop/utils.py#L193
             SQuAD: https://github.com/huggingface/datasets/blob/f96e74d5c633cd5435dd526adb4a74631eb05c43/metrics/squad_v2/evaluate.py#L80
+        `normalize_level`: Where to normalize the text. Default: both.
+        `align_bag`: How to align the bag of words. Default: counter.
 
     Return:
         "F1": float

diff --git a/utilization/model/model_utils/conversation.py b/utilization/model/model_utils/conversation.py
@@ -49,7 +49,7 @@ def __init__(
         chat_config: Dict[str, str],
         chat_template: str,
     ):
-        self.default_stops = chat_config.pop("default_stops", [])
+        self.default_stop = chat_config.pop("default_stop", [])
         self.auto_leading_space = chat_config.pop("auto_leading_space", True)
         self.final_lstrip = chat_config.pop("final_lstrip", True)
         self.final_rstrip = chat_config.pop("final_rstrip", True)