You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -299,8 +300,8 @@ Execution time may vary depending on the programming languages.
299
300
### APPS
300
301
[APPS](https://huggingface.co/datasets/codeparrot/apps): is a challenging benchmark for code generation with 10,000 Python problems,
301
302
5,000 for the training and 5000 for the evaluation. It has three difficulty levels: introductory, interview and competition.
302
-
Most papers finetune models on the training split before the evaluation, since the problems are often challenging the problem descriptions are long.
303
-
However, Chen et al. evaluated Codex-12B in a one-shot setting, although the details about the prompt format aren't given we propose two evaluation modes:
303
+
Most papers finetune models on the training split before the evaluation, since the problems are often challenging and the problem descriptions are long.
304
+
However, Chen et al. evaluated Codex-12B in a one-shot setting, although the details about the prompt format aren't given, we propose two evaluation modes:
304
305
with fine-tuning and in a one-shot setting:
305
306
* Prompts & generation
306
307
@@ -344,7 +345,7 @@ To use this setting (it's the case by default) set the argument `setup_apps` to
344
345
345
346
* Evaluation: we have two types of evaluations for this benchmark:
346
347
* the original Hendrycks et al. evaluation, where we do single generations (`n_samples=1`) and compute the average accuracy of the number
347
-
of tests that pass for each problem, and the sctrict accuracy, where a problem is solved if all tests pass and we average over all problems. This metric is fast to compute given that we do single generations and capture incremental improvement especially for small models. However, strict accuracy is often very low and average accuracy may not very reprsentative as the number of tests is not consistent through the problems. Recent papers evaluate this benchmark using pass@k.
348
+
of tests that pass for each problem, and the strict accuracy, where a problem is solved if all tests pass and we average over all problems. This metric is fast to compute given that we do single generations and capture incremental improvement especially for small models. However, strict accuracy is often very low and average accuracy may not very reprsentative as the number of tests is not consistent through the problems. Recent papers evaluate this benchmark using pass@k.
348
349
* we compute the pass@1, pass@10 and pass@100 and generate 200 problems per task (`n_samples=200`). Note that this takes a lot of time since there are 5000 evaluation samples, and there aren't some python stop words for the generation to prevent small models that struggle in answering from generating until max_length or EOS token.
349
350
350
351
In case of single generations (`n_samples=1`), the first metric is used, but when multiple generations are made the pass@k metric is used.
0 commit comments