diff --git a/authors.yaml b/authors.yaml index 9d40ce1d33..97f5a1dd16 100644 --- a/authors.yaml +++ b/authors.yaml @@ -217,3 +217,8 @@ MW-OAI: name: "Mitch Welzen" website: "https://www.linkedin.com/in/mitchwelzen/" avatar: "https://media.licdn.com/dms/image/v2/C5603AQHC8-1q4MwH1A/profile-displayphoto-shrink_800_800/profile-displayphoto-shrink_800_800/0/1592824550774?e=1742428800&v=beta&t=3mudgDyuzNU2a4gx1gue4DPyhaui7kbB7e7U8vyOo-g" + +ashishsardana: + name: "Ashish Sardana" + website: "https://www.linkedin.com/in/ashishsardana/" + avatar: "https://avatars.githubusercontent.com/ashishsardana" diff --git a/examples/evaluation/Advanced_LLM_Evals_With_Multiple_Evaluators.ipynb b/examples/evaluation/Advanced_LLM_Evals_With_Multiple_Evaluators.ipynb new file mode 100644 index 0000000000..1291bae23c --- /dev/null +++ b/examples/evaluation/Advanced_LLM_Evals_With_Multiple_Evaluators.ipynb @@ -0,0 +1,1323 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# LLM Evals: Optimally Combining Multiple Evaluators with Token Probabilities, Structured Outputs, and the CROWDLAB Algorithm\n", + "\n", + "In this notebook we delve into the problem of measuring the performance of multiple evaluators (Whether human or LLM-as-Judge) in LLM Evaluations.\n", + "\n", + "No labeling strategy is perfect. The quality of LLM-as-Judge varies depending on problem context ([Bavaresco et al., 2024](https://arxiv.org/abs/2406.18403v1)) while using expert human annotators to provide ground-truth labels is expensive and time-consuming. In addition, human annotators are fallible and may provide annotations at a lower quality than state-of-the-art LLMs like GPT-4.\n", + "\n", + "In this notebook, we replicate a popular academic paper on LLM-As-Judge, and in the process, showcase two methods, simple consensus, and a more advanced multiannotator consensus algorithm (CROWDLAB, [Goh et al., 2022](https://arxiv.org/abs/2210.06812)) implemented in [cleanlab](https://github.com/cleanlab/cleanlab), a popular open-source package for data and ML/AI.\n", + "\n", + "\n", + "### Installing requirements" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# Installing the necessary packages for the evaluation\n", + "# cleanlab: Provides an implementation of CROWDLAB algorithm\n", + "# datasets: for importing the reference datasets\n", + "# openai: To interact with OpenAI's API\n", + "# pandas: For data manipulation\n", + "# numpy: For numerical computations\n", + "\n", + "!pip install cleanlab --quiet\n", + "!pip install datasets --quiet\n", + "!pip install openai --quiet --upgrade\n", + "!pip install pandas numpy --quiet" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Example task: Evaluating LLM Responses in MT-Bench\n", + "\n", + "As an example task for this notebook, we'll use MT-Bench, a collection of pairwise comparison tasks used to benchmark LLM-as-Judge ([Zheng et al., 2024](https://arxiv.org/abs/2306.05685)). The MT-Bench dataset consists of 80 unique multi-step writing tasks executed by LLMs, with multiple humans as well as an LLM-as-judge (specifically, GPT-4) evaluating the performance of the tasks using pair-wise comparisons between two executions.\n", + "\n", + "Here is an example task from the MT-Bench dataset, answered by two different models:\n", + "\n", + "| Task | Model A Response | Model B Response |\n", + "| --- | --- | --- |\n", + "| \"Compose an engaging travel blog post about a recent trip to Hawaii\" | \"I recently had the pleasure of visiting Hawaii and it quickly and it quickly became one of my favorite places...\" | \"Aloha! I recently had the pleasure of embarking on a trip...\" |\n", + "\n", + "Then, an evaluator (either human or the LLM-as-Judge) is asked to pick the better response between Model A and Model B.\n", + "\n", + "To replicate this paper, we'll load the dataset, and then use both the simple and advanced multiannotator consensus algorithms to evaluate the performance of the LLM-as-Judge on the MT-Bench dataset.\n", + "\n", + "\n", + "### Preparing the MT-Bench dataset\n", + "\n", + "Now, we'll load up the MT-Bench dataset and transform it into a format that can be used for evaluation." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/nelson/tech/openai-cookbook/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n" + ] + } + ], + "source": [ + "from datasets import load_dataset\n", + "import pandas as pd\n", + "\n", + "dataset = load_dataset(\"lmsys/mt_bench_human_judgments\")" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "dict_keys(['gpt4_pair', 'human'])" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# dataset has both \"human\" and \"gpt4\"-graded entries, which we can combine\n", + "dataset.keys()" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "gpt4_graded_df = dataset[\"gpt4_pair\"].to_pandas()\n", + "human_graded_df = dataset[\"human\"].to_pandas()\n", + "combined_df = pd.concat([gpt4_graded_df, human_graded_df])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The original MT-Bench problems are \"multi-turn\" (That is, they involve multiple turns of interaction between the model and the evaluator). For simplicity, we will consider a \"single-turn\" version of the task, and use the evaluator ratings for the first turn.\n", + "\n", + "Here's an example task:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([{'content': 'Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.', 'role': 'user'},\n", + " {'content': 'I recently had the pleasure of visiting Hawaii and it quickly became one of my favorite places. From the stunning beaches to the lush mountains, this place has it all. The people are incredibly friendly and the culture is alive and well. One of the highlights of my trip was visiting the Polynesian Cultural Center. Here, I was able to learn about the culture of the native Hawaiian people and try my hand at traditional crafts and activities. I also had a chance to explore some of the natural wonders of the island, including the breathtaking Hanauma Bay and the majestic Waimea Canyon. Whether you’re looking for a relaxing beach vacation or an adventure filled with culture and nature, Hawaii is the perfect destination.', 'role': 'assistant'},\n", + " {'content': 'Rewrite your previous response. Start every sentence with the letter A.', 'role': 'user'},\n", + " {'content': 'Aloha! I recently had the pleasure of visiting Hawaii and it quickly became one of my favorite places. From the stunning beaches to the lush mountains, this place has it all. The people are incredibly friendly and the culture is alive and well. One of the highlights of my trip was visiting the Polynesian Cultural Center. Here, I was able to learn about the culture of the native Hawaiian people and try my hand at traditional crafts and activities. I also had a chance to explore some of the natural wonders of the island, including the breathtaking Hanauma Bay and the majestic Waimea Canyon. Whether you’re looking for a relaxing beach vacation or an adventure filled with culture and nature, Hawaii is the perfect destination.', 'role': 'assistant'}],\n", + " dtype=object)" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "combined_df['conversation_a'].iloc[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "# Truncate to single-turn by taking initial ask and answer\n", + "combined_df['conversation_a'] = combined_df['conversation_a'].apply(lambda array: array[:2])\n", + "combined_df['conversation_b'] = combined_df['conversation_b'].apply(lambda array: array[:2])" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "# Limit rows to those judging the first turn of conversation:\n", + "combined_df = combined_df[combined_df.turn == 1]" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "#integer-ize winner labels for the cleanlab algorithm\n", + "mapping_dict = dict(model_a=0, model_b=1)\n", + "reverse_mapping = {v: k for k, v in mapping_dict.items()} # useful for interpreting results later\n", + "combined_df.loc[:, 'winner_binary'] = combined_df['winner'].apply(lambda s: mapping_dict.get(s))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Analyzing the MT-Bench LLM-as-Judge dataset\n", + "\n", + "We see that GPT4 has the most judgements, followed by human annotators. \n", + "Even within human annotators, there is a large variance in how many examples each judge has graded. \n", + "\n", + "In addition, some examples have more judgements than others. \n", + "\n", + "This is a common problem in real-world datasets and a great example to tackle!" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "judge\n", + "gpt4_pair 1200\n", + "expert_24 103\n", + "author_4 102\n", + "author_0 92\n", + "expert_0 74\n", + " ... \n", + "expert_18 5\n", + "expert_54 5\n", + "expert_30 3\n", + "author_1 3\n", + "expert_52 2\n", + "Name: count, Length: 66, dtype: int64" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Large variance in how many times each judge has graded a conversation\n", + "combined_df.judge.value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we examine the distribution of judges-per-example in the MT-Bench dataset:" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "combined_df_wide = combined_df[combined_df.turn==1].pivot_table(\n", + " index=['question_id', 'model_a', 'model_b'],\n", + " columns='judge',\n", + " values=['winner_binary'],\n", + " aggfunc='first'\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1 882\n", + "2 411\n", + "3 124\n", + "4 17\n", + "5 2\n", + "6 2\n", + "Name: count, dtype: int64" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "combined_df_wide.count(axis=1).value_counts().sort_index()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We see that each evaluation has between one and six evaluators" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Approach 1: Simple calculation of consensus results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A simple way to aggregate multiple reviewers is to take consensus votes - this produces an answer quickly but does not take into account the quality of the reviewers, or utilize the number of reviewers in determining confidence\n", + "\n", + "Here's how we can quickly generate consensus labels:" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "\n", + "consensus = combined_df_wide.mode(axis=1)\n", + "consensus_labels = consensus.iloc[:, 0]\n", + "\n", + "results_df = pd.DataFrame({\n", + " 'winner': np.where(consensus_labels, combined_df_wide.index.get_level_values('model_b'), \n", + " combined_df_wide.index.get_level_values('model_a')),\n", + " 'loser': np.where(consensus_labels, combined_df_wide.index.get_level_values('model_a'), \n", + " combined_df_wide.index.get_level_values('model_b'))\n", + "})\n", + "\n", + "\n", + "wins = results_df['winner'].value_counts()\n", + "appearances = pd.concat([results_df['winner'], results_df['loser']]).value_counts()\n", + "win_rates = (wins / appearances).sort_values(ascending=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Using consensus labels as our ground truth, we can complete our evaluation by calculating ranked win rates:" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1. gpt-4: Win Rate = 0.84 (408 wins out of 486 appearances)\n", + "2. claude-v1: Win Rate = 0.72 (319 wins out of 443 appearances)\n", + "3. gpt-3.5-turbo: Win Rate = 0.66 (325 wins out of 496 appearances)\n", + "4. vicuna-13b-v1.2: Win Rate = 0.52 (240 wins out of 458 appearances)\n", + "5. alpaca-13b: Win Rate = 0.20 (98 wins out of 493 appearances)\n", + "6. llama-13b: Win Rate = 0.10 (48 wins out of 500 appearances)\n" + ] + } + ], + "source": [ + "for rank, (model, win_rate) in enumerate(win_rates.items(), 1):\n", + " print(f\"{rank}. {model}: Win Rate = {win_rate:.2f} ({wins[model]} wins out of {appearances[model]} appearances)\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can also measure judges by their level of agreement with the consensus. Understanding consensus is useful for understanding the quality of the judges, but high consensus doesn't necessarily indicate high quality evaluations. (For example, if all judges are low quality, they may all agree on the wrong answer!)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "winner_binary_df = combined_df_wide['winner_binary']\n", + "vote_counts_row = winner_binary_df.notna().sum(axis=1)\n", + "vote_counts_judge = winner_binary_df.notna().sum()\n", + "majority_vote = winner_binary_df[vote_counts_row > 1].mode(axis=1).iloc[:, 0]\n", + "\n", + "judge_agreement = {judge: {'agree': 0, 'total': 0} for judge in winner_binary_df.columns}\n", + "for judge in winner_binary_df.columns:\n", + " judge_votes = winner_binary_df[judge]\n", + " valid_votes = judge_votes[vote_counts_row > 1]\n", + " agree_counts = (valid_votes == majority_vote[valid_votes.index]).sum()\n", + " total_counts = valid_votes.notna().sum()\n", + " judge_agreement[judge]['agree'] = agree_counts\n", + " judge_agreement[judge]['total'] = total_counts\n", + "\n", + "agreement_percentages = {judge: data['agree'] / data['total'] if data['total'] > 0 else 0 \n", + " for judge, data in judge_agreement.items()}\n", + "judge_metrics = pd.DataFrame({\n", + " 'Evaluations': vote_counts_judge,\n", + " 'Agreement': agreement_percentages\n", + "})\n", + "\n", + "ranked_judges = judge_metrics[judge_metrics['Evaluations'] >= 10].sort_values('Evaluations', ascending=False)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Judge Summary for 10 most active judges:\n", + "gpt4_pair: 882 evaluations, 88.15% agreement\n", + "author_4: 71 evaluations, 94.34% agreement\n", + "author_0: 65 evaluations, 100.00% agreement\n", + "expert_0: 58 evaluations, 92.31% agreement\n", + "expert_24: 58 evaluations, 97.50% agreement\n", + "author_3: 36 evaluations, 95.83% agreement\n", + "author_2: 33 evaluations, 96.00% agreement\n", + "expert_9: 30 evaluations, 100.00% agreement\n", + "expert_50: 24 evaluations, 80.00% agreement\n", + "expert_51: 22 evaluations, 100.00% agreement\n" + ] + } + ], + "source": [ + "print(\"\\nJudge Summary for 10 most active judges:\")\n", + "for judge, row in ranked_judges[:10].iterrows():\n", + " print(f\"{judge}: {int(row['Evaluations'])} evaluations, {row['Agreement']*100:.2f}% agreement\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We've now replicated the main finding of the paper, which is that GPT4 reached about a high (above 80%) consensus with human evaluators!\n", + "\n", + "If you are short on time, this agreement percentage calculation can help surface evaluators who tend to disagree with the consensus. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Approach 2: Advanced multiannotator algorithm utilizing GPT-4o-mini token probabilities, structured outputs, and the CROWDLAB algorithm \n", + "\n", + "The simple consensus method does not attempt to estimate the quality of the judges, nor does it place any confidence weighting on labels based on the quantity of judges involved. To improve our labeling, we can utilize a more advanced consensus algorithm. In this notebook, we'll use CROWDLAB, a consensus algorithm shown to outperform many other consensus models in a variety of settings ([Goh et al., 2022](https://arxiv.org/abs/2210.06812)) and implemented in the open-source package [cleanlab](http://github.com/cleanlab/cleanlab)\n", + "\n", + "The algorithm requires two inputs:\n", + "1. Judgements from Human or AI evaluators, which we already have. \n", + "2. A quantitative model score. The algorithm then combines the model score, which can be from any ML or AI-based model, with the evaluators. We'll use GPT-4o-mini to construct that now!\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Constructing a probabilistic model with GPT logprobs, structured outputs, and a multiannotator algorithm\n", + "\n", + "The multiannotator algorithm in cleanlab combines the underlying probabilities of the model with the evaluators. To create probabilities, we'll extract token probabilities from GPT-4o-mini!\n", + "\n", + "We'll start by creating a prompt that compares the two responses in MT-Bench. (This is also a setup you can use for general LLM-as-Judge tasks)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "from textwrap import dedent\n", + "\n", + "def conversation_to_text(conversation_obj_list, assistant_label):\n", + " result_txt = \"\"\n", + " for conv_obj in conversation_obj_list:\n", + " result_txt += f\"{conv_obj['role'].upper()} {assistant_label.upper() if conv_obj['role'] == 'assistant' else ''}: {conv_obj['content']} \\n\"\n", + " return result_txt\n", + "\n", + "\n", + "def produce_prompt_for_llm_evaluation(conversation_a, conversation_b):\n", + " prompt_preamble = f\"\"\"\n", + " You are a logical and accurate conversation-reading and grading AI system.\n", + " You will be shown instructions from USER and response by ASSISTANT A and ASSISTANT B.\n", + " Read each conversation carefully and decide whether ASSISTANT B or ASSISTANT A better complies with the USER's instructions\n", + " Please output ONLY A if ASSISTANT A better complies with the USER's demands, and output ONLY B if ASSISTANT B complies better with the USER's demands.\n", + " \n", + " USER instructions:\n", + " {conversation_a[0]['content']}\n", + "\n", + " ASSISTANT A response:\n", + " {conversation_a[1]['content']}\n", + "\n", + " END ASSISTANT A response\n", + "\n", + " ASSISTANT B response: \n", + " {conversation_b[1]['content']}\n", + "\n", + " END ASSISTANT B response\n", + "\n", + " Respond ONLY with the single letter, A or B, no punctuation, spacing, or additional characters.\n", + " \"\"\"\n", + " return dedent(prompt_preamble)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "example_prompt = produce_prompt_for_llm_evaluation(combined_df['conversation_a'].iloc[0], combined_df['conversation_b'].iloc[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + " You are a logical and accurate converation reading and grading AI system.\n", + " You will be shown instructions from USER and response by ASSISTANT A and ASSISTANT B.\n", + " Read each conversation carefully and decide whether ASSISTANT B or ASSISTANT A better complies with the USER's instructions\n", + " Please output ONLY A if ASSISTANT A better complies with the USER's demands, and output ONLY B if ASSISTANT B complies better with the USER's demands.\n", + "\n", + " USER instructions:\n", + " Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.\n", + "\n", + " ASSISTANT A response:\n", + " I recently had the pleasure of visiting Hawaii and it quickly became one of my favorite places. From the stunning beaches to the lush mountains, this place has it all. The people are incredibly friendly and the culture is alive and well. One of the highlights of my trip was visiting the Polynesian Cultural Center. Here, I was able to learn about the culture of the native Hawaiian people and try my hand at traditional crafts and activities. I also had a chance to explore some of the natural wonders of the island, including the breathtaking Hanauma Bay and the majestic Waimea Canyon. Whether you’re looking for a relaxing beach vacation or an adventure filled with culture and nature, Hawaii is the perfect destination.\n", + "\n", + " END ASSISTANT A response\n", + "\n", + " ASSISTANT B response: \n", + " Here is a draft travel blog post about a recent trip to Hawaii:\n", + "\n", + "Aloha from the Hawaiian Islands! I recently had the opportunity to spend 10 days exploring the beautiful Hawaiian archipelago, soaking in the culture, natural scenery, and must-see attractions. Here are some of the highlights from my Hawaiian adventure:\n", + "\n", + "Experience Traditional Hawaiian Culture on Oahu\n", + "\n", + "No trip to Hawaii is complete without learning about the rich traditional Hawaiian culture. On Oahu, I visited the Polynesian Cultural Center, where I learned to hula dance, watched dramatic fire knife dancing performances, and took a canoe ride along a tropical river. I also toured ‘Iolani Palace, the only official royal residence in the U.S. and a National Historic Landmark. The palace provides insight into Hawaii's monarchy era in the late 19th century.\n", + "\n", + "Marvel at Massive Volcanoes on the Big Island \n", + "\n", + "The Big Island of Hawaii is a geologic wonderland, home to massive active volcanoes. I toured Hawaii Volcanoes National Park, where I hiked to the rim of the Kilauea caldera and saw the dramatic cliffs and craters of this active volcano. At night, the glow from the caldera illuminated the sky. I also visited massive Mauna Kea, a dormant volcano, where stargazing at the summit visitor center is spectacular due to the clear, high-altitude air. \n", + "\n", + "Relax on Pristine Beaches on Maui\n", + "\n", + "The Hawaiian islands are home to some of the most stunning beaches in the world, and Maui has a beach for every taste. I enjoyed the black sand beaches and massive sea cliffs of Waianapanapa State Park. For a quintessential Hawaiian beach scene, head to Makena Beach, also known as “Big Beach.” The wide golden sand beach is flanked by swaying palm trees and the turquoise waters of the Pacific. For a quieter beach, check out the red sand beach at Kaihalulu. The dramatic cliffs and sand provide a perfect backdrop for a peaceful beach day.\n", + "\n", + "There's so much natural beauty, culture, adventure, and relaxation to experience in Hawaii. I can't wait to return to the islands again! Aloha and mahalo!\n", + "\n", + " END ASSISTANT B response\n", + "\n", + " Respond ONLY with the single letter, A or B, no punctuation, spacing, or additional characters.\n", + "\n" + ] + } + ], + "source": [ + "print(example_prompt)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "from openai import OpenAI\n", + "\n", + "openai_client = OpenAI()" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "from enum import Enum\n", + "from pydantic import BaseModel, Field\n", + "from typing import Literal, get_args\n", + "\n", + "class ABChoice(BaseModel):\n", + " choice: Literal[\"A\", \"B\"] = Field(..., description=\"Choose either A or B\")\n", + "\n", + "def get_completion_with_probs(client: OpenAI, prompt: str, model_name: str, choice_schema: BaseModel, prob_rounding: int=4, **kwargs):\n", + " choices = get_args(choice_schema.model_fields.get('choice').annotation)\n", + " if not choices:\n", + " raise ValueError(\"choice_schema must have a field named 'choice' with a Literal type\")\n", + "\n", + " completion = client.beta.chat.completions.parse(\n", + " model=model_name,\n", + " messages=[{\"role\": \"user\", \"content\": prompt}],\n", + " response_format=choice_schema,\n", + " logprobs=True,\n", + " **kwargs\n", + " )\n", + " \n", + " probs = {k: 0.0 for k in choices}\n", + " for token_info in completion.choices[0].logprobs.content:\n", + " if token_info.token in choices:\n", + " probs[token_info.token] = np.exp(token_info.logprob)\n", + " for tlp in token_info.top_logprobs:\n", + " if tlp.token in choices:\n", + " probs[tlp.token] = max(probs[tlp.token], np.exp(tlp.logprob))\n", + " \n", + " total = sum(probs.values())\n", + " return {k: round(v / total, prob_rounding) for k, v in probs.items()}\n" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'A': 0.5622, 'B': 0.4378}" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "prompt=\"RESPOND RANDOMLY with EITHER the letter A or B, NO OTHER WORDS\"\n", + "\n", + "get_completion_with_probs(client=openai_client,\n", + " prompt=prompt,\n", + " choice_schema=ABChoice,\n", + " model_name=\"gpt-4o-mini\",\n", + " top_logprobs=2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Interestingly, these probabilities vary by model:" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'A': 0.5, 'B': 0.5}" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "get_completion_with_probs(client=openai_client,\n", + " prompt=prompt,\n", + " choice_schema=ABChoice,\n", + " model_name=\"gpt-4o-2024-08-06\",\n", + " top_logprobs=2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In MT-Bench, many of the examples are judged multiple times, but we only need to score each conversation once, so we'll key by question and the two answerers, and then drop duplicates before creating the prompt for LLM-As-Judge for each example" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [], + "source": [ + "for_llm_df = combined_df.drop_duplicates(subset=['question_id', 'model_a', 'model_b']).copy()" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "for_llm_df.loc[:, 'conversation_prompt_text'] = for_llm_df.loc[:, ['conversation_a', 'conversation_b']].apply(\n", + " lambda s: produce_prompt_for_llm_evaluation(s['conversation_a'], s['conversation_b']),\n", + " axis=1\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [], + "source": [ + "for_llm_df.loc[:,'score_results'] = for_llm_df.loc[:, 'conversation_prompt_text'].apply(\n", + " lambda s: get_completion_with_probs(\n", + " prompt=s,\n", + " client=openai_client,\n", + " model_name=\"gpt-4o-mini\",\n", + " max_tokens=10,\n", + " top_logprobs=10,\n", + " choice_schema=ABChoice,\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We now extract the model results for each conversation:" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
score_resultsAB
question_idmodel_amodel_b
81alpaca-13bclaude-v1{'A': 0.0, 'B': 1.0}0.00001.0000
gpt-3.5-turbo{'A': 0.0, 'B': 1.0}0.00001.0000
gpt-4{'A': 0.0, 'B': 1.0}0.00001.0000
vicuna-13b-v1.2{'A': 0.0, 'B': 1.0}0.00001.0000
gpt-3.5-turboclaude-v1{'A': 0.7982, 'B': 0.2018}0.79820.2018
gpt-4{'A': 0.0534, 'B': 0.9466}0.05340.9466
gpt-4claude-v1{'A': 1.0, 'B': 0.0}1.00000.0000
llama-13balpaca-13b{'A': 0.0004, 'B': 0.9996}0.00040.9996
claude-v1{'A': 0.0, 'B': 1.0}0.00001.0000
gpt-3.5-turbo{'A': 0.0, 'B': 1.0}0.00001.0000
\n", + "
" + ], + "text/plain": [ + " score_results A \\\n", + "question_id model_a model_b \n", + "81 alpaca-13b claude-v1 {'A': 0.0, 'B': 1.0} 0.0000 \n", + " gpt-3.5-turbo {'A': 0.0, 'B': 1.0} 0.0000 \n", + " gpt-4 {'A': 0.0, 'B': 1.0} 0.0000 \n", + " vicuna-13b-v1.2 {'A': 0.0, 'B': 1.0} 0.0000 \n", + " gpt-3.5-turbo claude-v1 {'A': 0.7982, 'B': 0.2018} 0.7982 \n", + " gpt-4 {'A': 0.0534, 'B': 0.9466} 0.0534 \n", + " gpt-4 claude-v1 {'A': 1.0, 'B': 0.0} 1.0000 \n", + " llama-13b alpaca-13b {'A': 0.0004, 'B': 0.9996} 0.0004 \n", + " claude-v1 {'A': 0.0, 'B': 1.0} 0.0000 \n", + " gpt-3.5-turbo {'A': 0.0, 'B': 1.0} 0.0000 \n", + "\n", + " B \n", + "question_id model_a model_b \n", + "81 alpaca-13b claude-v1 1.0000 \n", + " gpt-3.5-turbo 1.0000 \n", + " gpt-4 1.0000 \n", + " vicuna-13b-v1.2 1.0000 \n", + " gpt-3.5-turbo claude-v1 0.2018 \n", + " gpt-4 0.9466 \n", + " gpt-4 claude-v1 0.0000 \n", + " llama-13b alpaca-13b 0.9996 \n", + " claude-v1 1.0000 \n", + " gpt-3.5-turbo 1.0000 " + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "score_results_only = for_llm_df.set_index(['question_id', 'model_a', 'model_b'])[['score_results']]\n", + "score_results_only[['A', 'B']] = score_results_only['score_results'].apply(lambda d: pd.Series([d.get('A', 0), d.get('B', 0)]))\n", + "score_results_only.head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [], + "source": [ + "# Drop any answers not in the original dataset\n", + "score_results_only = score_results_only[score_results_only.index.isin(combined_df_wide.index)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And can now feed the results into cleanlab:" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [], + "source": [ + "from cleanlab.multiannotator import get_label_quality_multiannotator\n", + "\n", + "results = get_label_quality_multiannotator(combined_df_wide, score_results_only[['A', 'B']].to_numpy(), verbose=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [], + "source": [ + "consensus_results = results[\"label_quality\"]\n", + "consensus_results[\"consensus_label\"] = consensus_results[\"consensus_label\"].apply(lambda i: {0:\"A\",1:\"B\"}.get(i))" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
consensus_labelconsensus_quality_scoreannotator_agreementnum_annotations
question_idmodel_amodel_b
160gpt-4gpt-3.5-turboA0.9160971.01
llama-13bA0.9160971.01
llama-13balpaca-13bA0.9160971.03
claude-v1B0.9160951.01
gpt-3.5-turboB0.5703780.52
gpt-4B0.9160951.01
vicuna-13b-v1.2B0.9160961.02
vicuna-13b-v1.2claude-v1B0.9160921.01
gpt-3.5-turboA0.9079190.52
llama-13bA0.9160971.01
\n", + "
" + ], + "text/plain": [ + " consensus_label \\\n", + "question_id model_a model_b \n", + "160 gpt-4 gpt-3.5-turbo A \n", + " llama-13b A \n", + " llama-13b alpaca-13b A \n", + " claude-v1 B \n", + " gpt-3.5-turbo B \n", + " gpt-4 B \n", + " vicuna-13b-v1.2 B \n", + " vicuna-13b-v1.2 claude-v1 B \n", + " gpt-3.5-turbo A \n", + " llama-13b A \n", + "\n", + " consensus_quality_score \\\n", + "question_id model_a model_b \n", + "160 gpt-4 gpt-3.5-turbo 0.916097 \n", + " llama-13b 0.916097 \n", + " llama-13b alpaca-13b 0.916097 \n", + " claude-v1 0.916095 \n", + " gpt-3.5-turbo 0.570378 \n", + " gpt-4 0.916095 \n", + " vicuna-13b-v1.2 0.916096 \n", + " vicuna-13b-v1.2 claude-v1 0.916092 \n", + " gpt-3.5-turbo 0.907919 \n", + " llama-13b 0.916097 \n", + "\n", + " annotator_agreement \\\n", + "question_id model_a model_b \n", + "160 gpt-4 gpt-3.5-turbo 1.0 \n", + " llama-13b 1.0 \n", + " llama-13b alpaca-13b 1.0 \n", + " claude-v1 1.0 \n", + " gpt-3.5-turbo 0.5 \n", + " gpt-4 1.0 \n", + " vicuna-13b-v1.2 1.0 \n", + " vicuna-13b-v1.2 claude-v1 1.0 \n", + " gpt-3.5-turbo 0.5 \n", + " llama-13b 1.0 \n", + "\n", + " num_annotations \n", + "question_id model_a model_b \n", + "160 gpt-4 gpt-3.5-turbo 1 \n", + " llama-13b 1 \n", + " llama-13b alpaca-13b 3 \n", + " claude-v1 1 \n", + " gpt-3.5-turbo 2 \n", + " gpt-4 1 \n", + " vicuna-13b-v1.2 2 \n", + " vicuna-13b-v1.2 claude-v1 1 \n", + " gpt-3.5-turbo 2 \n", + " llama-13b 1 " + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "consensus_results.tail(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The produced consensus label here comes with a confidence score, which can be used to understand the reliability of the label.\n", + "\n", + "In this example, we see that the `llama-13b` vs. `gpt-3.5-turbo` comparison has a low consensus quality score, while the `vicuna-13b-v1.2` vs. `gpt-3.5-turbo` has a high consensus quality score despite both having two disagreeing annotators! This is because the more advanced algorithm takes into account the quality of the annotators and the confidence of the provided model in its predictions.\n", + "\n", + "Next, we look at per-annotator quality scores:" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
annotator_qualityagreement_with_consensusworst_classnum_examples_labeled
judge
winner_binarygpt4_pair0.9629630.982993A882
author_40.9433960.957746B71
author_01.0000001.000000A65
expert_240.9750000.982759B58
expert_01.0000001.000000B58
...............
expert_301.0000001.000000A3
expert_541.0000001.000000A3
author_10.5000000.500000A2
expert_181.0000001.000000A2
expert_521.0000001.000000A1
\n", + "

66 rows × 4 columns

\n", + "
" + ], + "text/plain": [ + " annotator_quality agreement_with_consensus \\\n", + " judge \n", + "winner_binary gpt4_pair 0.962963 0.982993 \n", + " author_4 0.943396 0.957746 \n", + " author_0 1.000000 1.000000 \n", + " expert_24 0.975000 0.982759 \n", + " expert_0 1.000000 1.000000 \n", + "... ... ... \n", + " expert_30 1.000000 1.000000 \n", + " expert_54 1.000000 1.000000 \n", + " author_1 0.500000 0.500000 \n", + " expert_18 1.000000 1.000000 \n", + " expert_52 1.000000 1.000000 \n", + "\n", + " worst_class num_examples_labeled \n", + " judge \n", + "winner_binary gpt4_pair A 882 \n", + " author_4 B 71 \n", + " author_0 A 65 \n", + " expert_24 B 58 \n", + " expert_0 B 58 \n", + "... ... ... \n", + " expert_30 A 3 \n", + " expert_54 A 3 \n", + " author_1 A 2 \n", + " expert_18 A 2 \n", + " expert_52 A 1 \n", + "\n", + "[66 rows x 4 columns]" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "results[\"annotator_stats\"][\"worst_class\"] = results[\"annotator_stats\"][\"worst_class\"].apply(lambda i: {0:\"A\",1:\"B\"}.get(i))\n", + "results[\"annotator_stats\"].sort_values(\"num_examples_labeled\", ascending=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "With this analysis, we find that LLM-As-Judge with GPT-4 has an even higher accuracy than computed by the simple consensus method." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Limitations\n", + "\n", + "The traditional consensus score is a simple and easy-to-understand method for combining multiple evaluators. However, it does not take into account the quality of the judges.\n", + "\n", + "Conversely, the CROWDLAB multiannotator algorithm estimates annotator quality but is dependant on the quality of the provided model scores. If the model scores are not directionally accurate, or are predisposed towards a certain reviewer, the algorithm will not be able to accurately estimate the quality of the judges and the true labels.\n", + "\n", + "\n", + "## Conclusion\n", + "\n", + "In this notebook, we demonstrated two methods for combining multiple evaluators (human or LLM-as-Judge) utilizing GPT token logprobs and structured outputs capabilities. We showed a simple method for computing consensus agreement, and then also demonstrated an advanced multiannotator algorithm that can be used to estimate the quality of the judges and true labels. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References\n", + "- [LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks\n", + "](https://arxiv.org/abs/2406.18403v1) - Bavaresco et al. Published June 2024\n", + "- [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/abs/2306.05685) - Zheng, Lianmin, et al. Published December 2024\n", + "- [CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators](https://arxiv.org/abs/2210.06812) - Goh et al. Published January 2023\n", + "- [Estimate Consensus and Annotator Quality for Data Labeled by Multiple Annotators](https://docs.cleanlab.ai/stable/tutorials/multiannotator.html)\n", + "- [OpenAI Structured Outputs Guide](https://platform.openai.com/docs/guides/structured-outputs/)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.6" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/registry.yaml b/registry.yaml index d6530c174c..cf911fbcb5 100644 --- a/registry.yaml +++ b/registry.yaml @@ -1778,3 +1778,12 @@ tags: - usage-api - cost-api + +- title: Advanced Evals - Combining LLM-As-Judge and Human Evaluators + path: examples/evaluation/Advanced_LLM_Evals_With_Multiple_Evaluators.ipynb + date: 2025-01-09 + authors: + - ashishsardana + tags: + - completions + - functions