Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] length controlled ALpacaEval #248

Closed
wants to merge 17 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 99 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,57 @@
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/)
[![discord](https://img.shields.io/badge/discord-server-blue?logo=discord&logoColor=white)](https://discord.gg/GJMxJSVZZM)


**AlpacaEval 2.0 with length-controlled win-rates** has a spearman correlation of **0.98** with [ChatBot Arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) while costing less than **$10** of OpenAI credits run and running in less than 3 minutes. Our goal is to have a benchmark for chat LLMs that is: fast (< 5min), cheap (< $10), and highly correlated with humans (0.98). Here's a comparison with other benchmarks:

![chat_correlations.png](notebooks%2Fchat_correlations.png)

---

Updates:

:tada: **Length-controlled Win Rates** are out and used by default! This increases the correlation with ChatBot Arena from 0.93 to 0.98, while significantly decreasing length gameability. The raw win rates are still shown on the website and the CLI. More details [here](#length-controlled-win-rates).

:tada: **AlpacaEval 2.0** is out and used by default! We improved the auto-annotator (better and cheaper) and use GPT-4 turbo as baseline. More details [here](#alpacaeval-20). For the old version, set your environment variable `IS_ALPACA_EVAL_2=False`.

---

<details open>
<summary><b>Table of Contents</b></summary>

1. [Overview](#overview)
2. [Quick Start](#quick-start)
2. [Leaderboards and how to interpret them](#leaderboards-and-how-to-interpret-them)
- [Models](#models)
- [Evaluators](#evaluators)
3. [Use-cases](#use-cases)
- [Evaluating a model](#evaluating-a-model)
- [Making a new leaderboard](#making-a-new-leaderboard)
- [Making a new evaluator](#making-a-new-evaluator)
4. [Contributing](#contributing)
- [Contributing a model](#contributing-a-model)
- [Contributing an evaluator](#contributing-an-evaluator)
- [Contributing an eval set](#contributing-an-eval-set)
- [Contributing a completion function](#contributing-a-completion-function)
5. [Limitations](#limitations)
6. [Analysis](#additional-analysis-and-plots)
- [Analyzing an evaluator](#analyzing-an-evaluator)
- [Analyzing an eval set](#analyzing-an-eval-set)
7. [Citation](#citation)
8. [Additional information](#additional-information)
- [Length-controlled win rates](#length-controlled-win-rates)
- [AlpacaEval 2.0](#alpacaeval-20)
- [Data Release](#data-release)
- [Differences with AlpacaFarm](#differences-with-alpacafarm)
- [Related work](#related-work)
- [Interpreting annotations](#interpreting-annotations)
- [Major updates](#major-updates)

</details>

# Overview


Evaluation of instruction-following models (e.g., ChatGPT) typically requires human interactions. This is
time-consuming, expensive, and hard to replicate. AlpacaEval in an LLM-based automatic evaluation that is fast, cheap,
replicable, and validated against 20K human annotations.
Expand All @@ -33,6 +80,9 @@ AlpacaEval provides the following:
- [**AlpacaEval dataset**](https://huggingface.co/datasets/tatsu-lab/alpaca_eval/blob/main/alpaca_eval.json): a simplification
of [AlpacaFarm's](https://github.com/tatsu-lab/alpaca_farm/tree/main) evaluation set, where "instructions" and "inputs" are merged into one field, and reference outputs are longer. [Details here](#data-release).




<details>
<summary><b>When to use and not use AlpacaEval?</b></summary>

Expand All @@ -52,36 +102,6 @@ Details in [limitations](#limitations).

</details>

<details open>
<summary><b>Table of Contents</b></summary>

1. [Quick Start](#quick-start)
2. [Leaderboards and how to interpret them](#leaderboards-and-how-to-interpret-them)
- [Models](#models)
- [Evaluators](#evaluators)
3. [Use-cases](#use-cases)
- [Evaluating a model](#evaluating-a-model)
- [Making a new leaderboard](#making-a-new-leaderboard)
- [Making a new evaluator](#making-a-new-evaluator)
4. [Contributing](#contributing)
- [Contributing a model](#contributing-a-model)
- [Contributing an evaluator](#contributing-an-evaluator)
- [Contributing an eval set](#contributing-an-eval-set)
- [Contributing a completion function](#contributing-a-completion-function)
5. [Limitations](#limitations)
6. [Analysis](#additional-analysis-and-plots)
- [Analyzing an evaluator](#analyzing-an-evaluator)
- [Analyzing an eval set](#analyzing-an-eval-set)
7. [Citation](#citation)
8. [Additional information](#additional-information)
- [AlpacaEval 2.0](#alpacaeval-20)
- [Data Release](#data-release)
- [Differences with AlpacaFarm](#differences-with-alpacafarm)
- [Related work](#related-work)
- [Interpreting annotations](#interpreting-annotations)
- [Major updates](#major-updates)

</details>

# Quick Start

Expand Down Expand Up @@ -605,7 +625,10 @@ see [here](https://github.com/tatsu-lab/alpaca_eval/tree/main/src/alpaca_eval/de

</details>

## Making a new leaderboard


<details>
<summary><h2 tabindex="-1" dir="auto">Making a new leaderboard</h2></summary>

<details>
<summary><code>>>> alpaca_eval make_leaderboard -- --help</code></summary>
Expand Down Expand Up @@ -681,7 +704,10 @@ where:
default, the reference outputs are the 003 outputs on AlpacaEval set.
- `annotators_config`: The path to the annotator's config file. Defaults to `alpaca_eval_gpt4`.

## Making a new evaluator
</details>

<details>
<summary><h2 tabindex="-1" dir="auto">Making a new evaluator</h2></summary>

<details>
<summary><code>>>> alpaca_eval analyze_evaluators -- --help</code></summary>
Expand Down Expand Up @@ -823,6 +849,8 @@ evaluation.
If you want a cheaper evaluation you can use a single seed using `--is_single_annotator True` which will skip the
estimation of bias and variance.

</details>

# Contributing

We are accepting PRs for new models, evaluators, and eval sets, in addition to bug fixes.
Expand All @@ -834,8 +862,7 @@ wish to ask help from the community.

To get started, please first fork the repo, and install the package from source `pip install -e .`

<details>
<summary><h2 tabindex="-1" dir="auto">Contributing a model</h2></summary>
## Contributing a model

First, you'll need to add a model config definition in the [models_configs](src/alpaca_eval/models_configs/) folder. As
an example, you can look at
Expand Down Expand Up @@ -893,8 +920,6 @@ A verified result in AlpacaEval indicates that a core maintainer has decoded the
Note that we will not re-evaluate the same model. Due to sampling variance, the results might slightly differ from your initial ones. We will replace your previous community results with the verified ones.


</details>

</details>

<details>
Expand Down Expand Up @@ -978,7 +1003,7 @@ Those can broadly be clustered into 3 categories:
gap between the open models and OpenAI models than other leaderboards (
e.g. [lmsys](https://lmsys.org/blog/2023-03-30-vicuna/)).

2. **Biases of automatic annotators**: the automatic annotators seem to have implicit biases. In particular, we found
2. **Biases of automatic annotators**: the raw automatic annotators seem to have implicit biases. In particular, we found
that they tend to prefer longer outputs and outputs that contain lists (e.g. 0.68 / 0.69 for `alpaca_eval_gpt4`
and 0.62 / 0.58 for `claude`).
Although we found that humans have similar biases (0.64 / 0.61), we believe that this could be more of a limitation
Expand All @@ -987,7 +1012,7 @@ Those can broadly be clustered into 3 categories:
of the output than its content (e.g. factuality).
Finally, we found that automatic evaluators tend to prefer outputs from models that are similar (likely trained on
the same data) as suggested by the big difference between ChatGPT/GPT4 on `claude`'s and `alpaca_eval_gpt4`'s
leaderboard.
leaderboard. Note that the length bias is partially mitigated in our length-controlled win-rates.
3. **Lack of safety evaluation**: importantly, AlpacaEval only evaluates the instruction-following capabilities of
models rather than the harm that they could cause (e.g. toxic behavior or bias). As a result the small gap between
current ChatGPT and the best open source models **should not** be interpreted as if that the latter are ready to be
Expand Down Expand Up @@ -1108,7 +1133,14 @@ colab notebook above.

# Citation

Please consider citing the repo if you used the automatic annotators, code, or results.
Please consider citing the following depending on what you are using and referring to:
- **Code, results, and general benchmark**: `alpaca_eval` (this repo). Specify whether you are using AlpacaEval or AlpacaEval 2.0. For length-controlled win-rates see below.
- **Length-controlled (LC) win rates**: `alpaca_eval_length`.
- **Human annotations**: `dubois2023alpacafarm` ([AlpacaFarm](https://arxiv.org/abs/2305.14387))
- **AlpacaEval evaluation set**: `alpaca_eval` and [self-instruct](https://github.com/yizhongw/self-instruct),
[open-assistant](https://huggingface.co/datasets/OpenAssistant/oasst1/viewer/OpenAssistant--oasst1/validation), [vicuna](https://lmsys.org/blog/2023-03-30-vicuna/), [koala](https://github.com/arnav-gudibande/koala-test-set), [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf/viewer/Anthropic--hh-rlhf/test).

Here are the bibtex entries:

```
@misc{alpaca_eval,
Expand All @@ -1121,9 +1153,16 @@ Please consider citing the repo if you used the automatic annotators, code, or r
}
```

Make sure to specify whether you are using AlpacaEval or AlpacaEval 2.0.
If you used our human annotation data, please also consider citing the [AlpacaFarm](https://arxiv.org/abs/2305.14387)
paper:
```
@misc{alpaca_eval_length,
author = {Yann Dubois and and Tatsunori B. Hashimoto },
title = {Length-Corrected AlpacaEval: A Simple Debiasing of Automatic Evaluators},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/tatsu-lab/alpaca_eval}}
}
```

```
@misc{dubois2023alpacafarm,
Expand All @@ -1136,12 +1175,25 @@ paper:
}
```

If you use the AlpacaEval evaluation set, please cite each of the constituent
datasets: [self-instruct](https://github.com/yizhongw/self-instruct),
[open-assistant](https://huggingface.co/datasets/OpenAssistant/oasst1/viewer/OpenAssistant--oasst1/validation), [vicuna](https://lmsys.org/blog/2023-03-30-vicuna/), [koala](https://github.com/arnav-gudibande/koala-test-set), [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf/viewer/Anthropic--hh-rlhf/test).

# More information

<details>
<summary><h2 tabindex="-1" dir="auto">Length-Controlled Win Rates</h2></summary>

Length controlled (LC) win-rates are a debiased version of the win-rates that control for the length of the outputs.

The main idea is that for each model we will fit a logistic regression to predict the preference of the autoannotator given: (1) the instruction, (2) the model, and (3) the difference of length between the baseline and model output.
Given such a logistic regression we can then try to predict the counterfactual "what would the preference be if the model's output had the same length as the baseline" by setting the length difference to 0.
By averaging over this length-controlled preference, we then obtain the length-controlled win-rate.
The exact form of the logistic regression is taken such that the interpretation of LC win rates is similar to the raw win rates, for example for any model `m1` and `m2` we have `win_rate(m1, m2) = 1 - win_rate(m2, m1) \in [0,100]` and `win_rate(m1, m1) = 0.5`.
Length controlled win-rates increase the correlation between AlpacaEval's leaderboard and Chat Arena from **0.93 to 0.98 Spearman correlation, while significantly decreasing the length gameability of the annotator**.
For more information and results about length controlled win-rates see [this notebook](https://github.com/tatsu-lab/alpaca_eval/blob/main/notebooks/length_correction.ipynb).

This idea of estimating the controlled direct effect, by predicting the outcome while conditioning on the mediator (the length difference), is common in statistical inference.

</details>


<details>
<summary><h2 tabindex="-1" dir="auto">AlpacaEval 2.0</h2></summary>

Expand Down Expand Up @@ -1307,6 +1359,7 @@ You can check the `raw_annotations["concise_explanation]` column in `annotations
<details>
<summary><h2 tabindex="-1" dir="auto">Major updates</h2></summary>

- 12th March 2024: updated to use length-controlled (LC) win rates. This is a debiased version of the win-rates that control for the length of the outputs.
- 3rd January 2024: updated to AlpacaEval 2.0, which uses GPT4-turbo as baseline and annotator.
- 2nd January 2024: added Azure API and more general way of setting client configs. See [here](https://github.com/tatsu-lab/alpaca_eval/tree/main/client_configs/README.md)
- 19th June 2023: add leaderboard `chatgpt_fn` that anyone can use (no waiting lists).
Expand Down
Loading