diff --git a/posts/benchmarking/_final_benchmarking.md b/posts/benchmarking/_final_benchmarking.md
new file mode 100644
index 0000000..1e6bd59
--- /dev/null
+++ b/posts/benchmarking/_final_benchmarking.md
@@ -0,0 +1,1007 @@
+# Benchmarking Gemini Models on using Ragas
+
+In this tutorial, we'll benchmark Gemini models on the AllenAI QASPER dataset using Ragas metrics for the Academic Question Answering task.
+
+### About the Dataset
+
+QASPER (Question Answering over Scientific Papers) is a dataset consisting of 5,049 questions based on 1,585 NLP research papers. Annotators created these questions from titles and abstracts, with answers extracted from the full paper texts. It is designed to challenge document-level reasoning and support research in academic question answering.
+
+Data Collection Process:
+1. Paper Selection: NLP domain papers from arXiv (LaTeX format) were selected from the S2ORC corpus.
+2. Question Writing: Annotators wrote realistic, information-seeking questions based only on paper titles and abstracts.
+3. Answer Annotation: Different annotators reviewed the entire paper to identify answers, selecting minimal relevant evidence (texts, tables, figures).
+
+![Data collection Process of QASPER Dataset](qasper_data_collection.png)
+
+
+Link to the [Dataset](https://huggingface.co/datasets/allenai/qasper) and further details about QASPER can be found [here](https://huggingface.co/datasets/allenai/qasper). 
+
+
+## Loading Dataset
+
+For demonstration purposes, we'll use a subset of 10 examples from the validation split:
+
+
+```python
+from datasets import load_dataset
+import pandas as pd
+import numpy as np
+from tqdm.auto import tqdm
+
+dataset = load_dataset("allenai/qasper", split="validation[:10]")
+dataset
+```
+Output
+```
+Dataset({
+    features: ['id', 'title', 'abstract', 'full_text', 'qas', 'figures_and_tables'],
+    num_rows: 10
+})
+```
+
+
+## Processing Dataset
+
+Since our goal is to benchmark the model’s performance on academic question-answering tasks, we need the responses generated by LLMs based on the full text of each research paper. We extract the full text from the dataset’s "full_text" column and format it into markdown, clearly organizing sections and paragraphs for readability and context.
+
+To create question-answer pairs for evaluation, we use the dataset’s "qas" column, which provides questions and corresponding answers in three formats: extractive spans, yes/no responses, or free-form answers. We then combine these formats into a single "golden answer" column, which serves as the ground truth for assessing model performance.
+
+
+```python
+def convert_full_text_to_markdown(full_text_dict):
+    """
+    Converts a full_text dictionary into a markdown-formatted string.
+
+    Expected keys:
+      - "section_name": list of section titles.
+      - "paragraphs": list of lists of paragraphs corresponding to each section.
+
+    Each section becomes a markdown header (##) followed by its paragraphs.
+    """
+    sections = full_text_dict.get("section_name", [])
+    paragraphs = full_text_dict.get("paragraphs", [])
+
+    markdown_lines = []
+    for section, paragraph in zip(sections, paragraphs):
+        markdown_lines.append(f"## {section}")
+        markdown_lines.append("")  # Blank line
+        markdown_lines.append("\n".join(map(str, paragraph)))
+        markdown_lines.append("")  # End of section
+        markdown_lines.append("")  # Extra blank line for separation
+    return "\n".join(markdown_lines)
+```
+
+
+```python
+def combine_responses(row):
+    """
+    Combines 'extractive_spans', 'yes_no', and 'free_form_answer'
+    into one single string. Skips components that are missing.
+    """
+    responses = []
+    if pd.notna(row.get("extractive_spans")):
+        if isinstance(row["extractive_spans"], list):
+            responses.append(" ".join(map(str, row["extractive_spans"])))
+        else:
+            responses.append(str(row["extractive_spans"]))
+    if pd.notna(row.get("yes_no")):
+        responses.append(str(row["yes_no"]))
+    if pd.notna(row.get("free_form_answer")):
+        responses.append(str(row["free_form_answer"]))
+    return "\n".join(responses) if responses else np.nan
+```
+
+
+```python
+def preprocess_hf_dataset(hf_ds):
+    """
+    Processes a HuggingFace dataset split into a cleaned Pandas DataFrame.
+
+    Steps:
+      1. For each sample, convert 'full_text' to a markdown string.
+      2. For every QA pair in the sample, extract the question and first answer.
+      3. Build lists for answers, questions, and full_text (duplicated per question).
+      4. Create a DataFrame from the collected data.
+      5. Clean columns by replacing empty lists/strings with NaN and joining lists.
+      6. Combine the answer components into a single 'golden response'.
+
+    The function uses nested tqdm progress bars for real-time feedback.
+
+    Returns:
+        pd.DataFrame: The preprocessed DataFrame.
+    """
+    answers_list = []  # Stores the first answer for each question
+    questions_list = []  # Stores each question text
+    full_text_list = []  # Stores the formatted full text per QA pair
+
+    # Outer loop: iterate over samples with progress bar
+    for sample in tqdm(hf_ds, desc="Processing samples", unit="sample"):
+        # Convert full text once per sample
+        formatted_text = convert_full_text_to_markdown(sample["full_text"])
+        # Create a list of QA pairs
+        qa_pairs = list(zip(sample["qas"]["question"], sample["qas"]["answers"]))
+
+        # Inner loop: iterate over each QA pair with its own progress bar
+        for question, answer_set in tqdm(
+            qa_pairs, desc="Processing QAs", total=len(qa_pairs), leave=False, unit="qa"
+        ):
+            answers_list.append(answer_set["answer"][0])
+            questions_list.append(question)
+            full_text_list.append(formatted_text)
+
+    # Create DataFrame from the collected data
+    df = pd.DataFrame(answers_list)
+    df["question"] = questions_list
+    df["full_text"] = full_text_list
+
+    # Data Cleaning: Replace empty lists/strings with NaN and join lists if needed
+    df["extractive_spans"] = df["extractive_spans"].apply(
+        lambda x: np.nan if isinstance(x, list) and len(x) == 0 else x
+    )
+    df["free_form_answer"] = df["free_form_answer"].apply(
+        lambda x: np.nan if isinstance(x, str) and x.strip() == "" else x
+    )
+    df["yes_no"] = df["yes_no"].apply(lambda x: np.nan if x is None else x)
+    df["extractive_spans"] = df["extractive_spans"].apply(
+        lambda x: "\n".join(x) if isinstance(x, list) else x
+    )
+
+    # Combine the answer components into a single 'golden response'
+    df["golden response"] = df.apply(lambda row: combine_responses(row), axis=1)
+
+    return df
+```
+
+
+```python
+processed_dataset = preprocess_hf_dataset(dataset)
+processed_dataset.head()
+```
+```
+Processing samples: 100%|██████████| 10/10 [00:00<00:00, 208.37sample/s]
+```
+<div>
+<style scoped>
+    .dataframe tbody tr th:only-of-type {
+        vertical-align: middle;
+    }
+
+    .dataframe tbody tr th {
+        vertical-align: top;
+    }
+
+    .dataframe thead th {
+        text-align: right;
+    }
+</style>
+<table border="1">
+  <thead>
+    <tr style="text-align: right;">
+      <th></th>
+      <th>unanswerable</th>
+      <th>extractive_spans</th>
+      <th>yes_no</th>
+      <th>free_form_answer</th>
+      <th>evidence</th>
+      <th>highlighted_evidence</th>
+      <th>question</th>
+      <th>full_text</th>
+      <th>golden response</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th>0</th>
+      <td>False</td>
+      <td>BIBREF19\nBIBREF20</td>
+      <td>NaN</td>
+      <td>NaN</td>
+      <td>[Table TABREF19 and TABREF26 report zero-shot ...</td>
+      <td>[We compare our approaches with related approa...</td>
+      <td>which multilingual approaches do they compare ...</td>
+      <td>## Introduction\n\nAlthough Neural Machine Tra...</td>
+      <td>BIBREF19\nBIBREF20</td>
+    </tr>
+    <tr>
+      <th>1</th>
+      <td>False</td>
+      <td>pivoting\npivoting$_{\rm m}$</td>
+      <td>NaN</td>
+      <td>NaN</td>
+      <td>[Table TABREF19 and TABREF26 report zero-shot ...</td>
+      <td>[We compare our approaches with related approa...</td>
+      <td>what are the pivot-based baselines?</td>
+      <td>## Introduction\n\nAlthough Neural Machine Tra...</td>
+      <td>pivoting\npivoting$_{\rm m}$</td>
+    </tr>
+    <tr>
+      <th>2</th>
+      <td>False</td>
+      <td>Europarl\nMultiUN</td>
+      <td>NaN</td>
+      <td>NaN</td>
+      <td>[We evaluate our cross-lingual pre-training ba...</td>
+      <td>[We evaluate our cross-lingual pre-training ba...</td>
+      <td>which datasets did they experiment with?</td>
+      <td>## Introduction\n\nAlthough Neural Machine Tra...</td>
+      <td>Europarl\nMultiUN</td>
+    </tr>
+    <tr>
+      <th>3</th>
+      <td>False</td>
+      <td>NaN</td>
+      <td>NaN</td>
+      <td>De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...</td>
+      <td>[For MultiUN corpus, we use four languages: En...</td>
+      <td>[For MultiUN corpus, we use four languages: En...</td>
+      <td>what language pairs are explored?</td>
+      <td>## Introduction\n\nAlthough Neural Machine Tra...</td>
+      <td>De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...</td>
+    </tr>
+    <tr>
+      <th>4</th>
+      <td>False</td>
+      <td>Stanford NER\nspaCy 2.0 \nrecurrent model with...</td>
+      <td>NaN</td>
+      <td>NaN</td>
+      <td>[In this section we describe a number of exper...</td>
+      <td>[In this section we describe a number of exper...</td>
+      <td>what ner models were evaluated?</td>
+      <td>## Introduction\n\nNamed entity recognition is...</td>
+      <td>Stanford NER\nspaCy 2.0 \nrecurrent model with...</td>
+    </tr>
+  </tbody>
+</table>
+</div>
+
+
+
+## Generating Responses from Gemini Models
+
+To generate responses using the Gemini model, we’ll first need to instantiate the Google GenAI client. We will define a prompt template that will be used when generating responses.
+
+
+```python
+import os
+from google import genai
+from dotenv import load_dotenv
+
+load_dotenv()
+
+client = genai.Client(api_key=os.getenv("GOOGLE_API_KEY"))
+
+qa_prompt = (
+    f"Context information is below.\n"
+    "---------------------\n"
+    "{context_str}\n"
+    "---------------------\n"
+    "Given the context information and not prior knowledge, "
+    "answer the query.\n"
+    "If you cannot find answer to the query, just say that it cannot be answered.\n"
+    "Query: {query_str}\n"
+    "Answer: "
+)
+```
+
+### Gemini 2.0 Falsh
+
+
+```python
+from async_executor import AsyncExecutor
+
+async def query_gemini_2(query_str: str, context_str: str):
+    formatted_prompt = qa_prompt.format(context_str=context_str, query_str=query_str)
+    response = await client.aio.models.generate_content(
+        model="gemini-2.0-flash", contents=formatted_prompt
+    )
+    return response.text
+
+# Create an instance of the asynchronous executor
+executor = AsyncExecutor(
+    desc="LLM Processing",
+    show_progress=True,
+    raise_exceptions=False,
+)
+
+for idx in range(processed_dataset.shape[0]):
+    query = processed_dataset.iloc[idx]["question"]
+    context = processed_dataset.iloc[idx]["full_text"]
+    executor.submit(query_gemini_2, query, context)
+
+processed_dataset["gemini_2_flash_responses"] = executor.results()
+```
+```
+LLM Processing: 100%|██████████| 30/30 [00:04<00:00,  7.20it/s]
+```
+
+### Gemini 1.5 Falsh
+
+
+```python
+from async_executor import AsyncExecutor
+
+async def query_gemini_1_5(query_str: str, context_str: str):
+    formatted_prompt = qa_prompt.format(context_str=context_str, query_str=query_str)
+    response = await client.aio.models.generate_content(
+        model="gemini-1.5-flash", contents=formatted_prompt
+    )
+    return response.text
+
+# Create a new instance of the asynchronous executor
+executor = AsyncExecutor(
+    desc="LLM Processing",
+    show_progress=True,
+    raise_exceptions=False,
+)
+
+for idx in range(processed_dataset.shape[0]):
+    query = processed_dataset.iloc[idx]["question"]
+    context = processed_dataset.iloc[idx]["full_text"]
+    executor.submit(query_gemini_1_5, query, context)
+
+processed_dataset["gemini_1_5_flash_responses"] = executor.results()
+```
+```
+LLM Processing: 100%|██████████| 30/30 [00:05<00:00,  5.94it/s]
+```
+
+
+```python
+processed_dataset.head()
+```
+<div>
+<style scoped>
+    .dataframe tbody tr th:only-of-type {
+        vertical-align: middle;
+    }
+
+    .dataframe tbody tr th {
+        vertical-align: top;
+    }
+
+    .dataframe thead th {
+        text-align: right;
+    }
+</style>
+<table border="1">
+  <thead>
+    <tr style="text-align: right;">
+      <th></th>
+      <th>unanswerable</th>
+      <th>extractive_spans</th>
+      <th>yes_no</th>
+      <th>free_form_answer</th>
+      <th>evidence</th>
+      <th>highlighted_evidence</th>
+      <th>question</th>
+      <th>full_text</th>
+      <th>golden response</th>
+      <th>gemini_2_flash_responses</th>
+      <th>gemini_1_5_flash_responses</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th>0</th>
+      <td>False</td>
+      <td>BIBREF19\nBIBREF20</td>
+      <td>NaN</td>
+      <td>NaN</td>
+      <td>[Table TABREF19 and TABREF26 report zero-shot ...</td>
+      <td>[We compare our approaches with related approa...</td>
+      <td>which multilingual approaches do they compare ...</td>
+      <td>## Introduction\n\nAlthough Neural Machine Tra...</td>
+      <td>BIBREF19\nBIBREF20</td>
+      <td>The text mentions comparison with Multilingual...</td>
+      <td>The paper compares its approach with multiling...</td>
+    </tr>
+    <tr>
+      <th>1</th>
+      <td>False</td>
+      <td>pivoting\npivoting$_{\rm m}$</td>
+      <td>NaN</td>
+      <td>NaN</td>
+      <td>[Table TABREF19 and TABREF26 report zero-shot ...</td>
+      <td>[We compare our approaches with related approa...</td>
+      <td>what are the pivot-based baselines?</td>
+      <td>## Introduction\n\nAlthough Neural Machine Tra...</td>
+      <td>pivoting\npivoting$_{\rm m}$</td>
+      <td>The pivot-based baselines are pivoting and piv...</td>
+      <td>The provided text mentions two types of pivot-...</td>
+    </tr>
+    <tr>
+      <th>2</th>
+      <td>False</td>
+      <td>Europarl\nMultiUN</td>
+      <td>NaN</td>
+      <td>NaN</td>
+      <td>[We evaluate our cross-lingual pre-training ba...</td>
+      <td>[We evaluate our cross-lingual pre-training ba...</td>
+      <td>which datasets did they experiment with?</td>
+      <td>## Introduction\n\nAlthough Neural Machine Tra...</td>
+      <td>Europarl\nMultiUN</td>
+      <td>They experimented with the Europarl and MultiU...</td>
+      <td>The experiments used two public datasets: Euro...</td>
+    </tr>
+    <tr>
+      <th>3</th>
+      <td>False</td>
+      <td>NaN</td>
+      <td>NaN</td>
+      <td>De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...</td>
+      <td>[For MultiUN corpus, we use four languages: En...</td>
+      <td>[For MultiUN corpus, we use four languages: En...</td>
+      <td>what language pairs are explored?</td>
+      <td>## Introduction\n\nAlthough Neural Machine Tra...</td>
+      <td>De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...</td>
+      <td>The language pairs explored in this paper are:...</td>
+      <td>The paper explores the following language pair...</td>
+    </tr>
+    <tr>
+      <th>4</th>
+      <td>False</td>
+      <td>Stanford NER\nspaCy 2.0 \nrecurrent model with...</td>
+      <td>NaN</td>
+      <td>NaN</td>
+      <td>[In this section we describe a number of exper...</td>
+      <td>[In this section we describe a number of exper...</td>
+      <td>what ner models were evaluated?</td>
+      <td>## Introduction\n\nNamed entity recognition is...</td>
+      <td>Stanford NER\nspaCy 2.0 \nrecurrent model with...</td>
+      <td>Based on the provided text, the following NER ...</td>
+      <td>Stanford NER, spaCy 2.0, and a recurrent model...</td>
+    </tr>
+  </tbody>
+</table>
+</div>
+
+
+
+## Defining Metrics For Evaluation
+
+We are benchmarking a question-answering task and we want to ensure that each question is answered properly and accurately. To achieve this, we use the following metrics from Ragas you find the complete list of metrics [here]()
+
+- [Answer Accuracy](): Measures how closely a response matches the reference answer.
+- [Answer Correctness](): Assesses the alignment between the generated answer and the reference answer.
+- [Factual Correctness]():Checks if all statements in a response are supported by the reference answer.
+
+For each question, we know whether it can be answered from the provided context, and we want to verify if the model correctly identifies when it cannot. For this purpose, we define a custom binary metric using [AspectCritique]().
+
+
+```python
+from ragas.metrics import AnswerAccuracy, AnswerCorrectness, FactualCorrectness, AspectCritic
+import getpass
+import os
+
+from ragas.llms import LangchainLLMWrapper
+from langchain_openai import ChatOpenAI
+
+if "OPENAI_API_KEY" not in os.environ:
+    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")
+
+evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
+
+aspect_critic = AspectCritic(
+    name="unanswerable",
+    definition="Return 1 if the query cannot be answered by the provided context, otherwise return 0.",
+    llm=evaluator_llm,
+)
+
+metrics = [
+    AnswerAccuracy(llm=evaluator_llm),
+    AnswerCorrectness(llm=evaluator_llm, weights=[1, 0]),
+    aspect_critic,
+    FactualCorrectness(llm=evaluator_llm),
+]
+```
+
+## Benchmarking on Ragas Metrics
+
+We format the processed data into a Ragas-compatible EvaluationDataset, then apply the metrics to evaluate model performance, more information on it can be found [here](). We’ll construct the EvaluationDataset using the questions and the golden answer responses generated by the Gemini models from our processed dataset.
+
+### Gemini 2.0 Falsh
+
+We'll create EvaluationDataset for the Gemini 2.0 Flash.
+
+
+```python
+from ragas.dataset_schema import EvaluationDataset
+
+dataset_list = []
+
+for i in range(processed_dataset.shape[0]):
+    sample = {
+        "user_input": (
+            "" if pd.isna(processed_dataset.iloc[i].get("question")) else processed_dataset.iloc[i].get("question")
+        ),
+        "reference": (
+            ""
+            if pd.isna(processed_dataset.iloc[i].get("golden response"))
+            else processed_dataset.iloc[i].get("golden response")
+        ),
+        "response": (
+            ""
+            if pd.isna(processed_dataset["gemini_2_flash_responses"].iloc[i])
+            else processed_dataset["gemini_2_flash_responses"].iloc[i]
+        ),
+    }
+    dataset_list.append(sample)
+
+gemini_2_dataset = EvaluationDataset.from_list(dataset_list)
+gemini_2_dataset.to_pandas().head()
+```
+<div>
+<style scoped>
+    .dataframe tbody tr th:only-of-type {
+        vertical-align: middle;
+    }
+
+    .dataframe tbody tr th {
+        vertical-align: top;
+    }
+
+    .dataframe thead th {
+        text-align: right;
+    }
+</style>
+<table border="1">
+  <thead>
+    <tr style="text-align: right;">
+      <th></th>
+      <th>user_input</th>
+      <th>response</th>
+      <th>reference</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th>0</th>
+      <td>which multilingual approaches do they compare ...</td>
+      <td>The text mentions comparison with Multilingual...</td>
+      <td>BIBREF19\nBIBREF20</td>
+    </tr>
+    <tr>
+      <th>1</th>
+      <td>what are the pivot-based baselines?</td>
+      <td>The pivot-based baselines are pivoting and piv...</td>
+      <td>pivoting\npivoting$_{\rm m}$</td>
+    </tr>
+    <tr>
+      <th>2</th>
+      <td>which datasets did they experiment with?</td>
+      <td>They experimented with the Europarl and MultiU...</td>
+      <td>Europarl\nMultiUN</td>
+    </tr>
+    <tr>
+      <th>3</th>
+      <td>what language pairs are explored?</td>
+      <td>The language pairs explored in this paper are:...</td>
+      <td>De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...</td>
+    </tr>
+    <tr>
+      <th>4</th>
+      <td>what ner models were evaluated?</td>
+      <td>Based on the provided text, the following NER ...</td>
+      <td>Stanford NER\nspaCy 2.0 \nrecurrent model with...</td>
+    </tr>
+  </tbody>
+</table>
+</div>
+
+
+
+Now, let’s evaluate the responses of Gemini 2.0 Falsh.
+
+
+```python
+from ragas import evaluate
+
+gemini_2_flash_score = evaluate(dataset=gemini_2_dataset, metrics=metrics)
+gemini_2_flash_score.to_pandas().head()
+```
+```
+Evaluating: 100%|██████████| 120/120 [00:49<00:00,  2.44it/s]
+```
+<div>
+<style scoped>
+    .dataframe tbody tr th:only-of-type {
+        vertical-align: middle;
+    }
+
+    .dataframe tbody tr th {
+        vertical-align: top;
+    }
+
+    .dataframe thead th {
+        text-align: right;
+    }
+</style>
+<table border="1">
+  <thead>
+    <tr style="text-align: right;">
+      <th></th>
+      <th>user_input</th>
+      <th>response</th>
+      <th>reference</th>
+      <th>nv_accuracy</th>
+      <th>answer_correctness</th>
+      <th>unanswerable</th>
+      <th>factual_correctness(mode=f1)</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th>0</th>
+      <td>which multilingual approaches do they compare ...</td>
+      <td>The text mentions comparison with Multilingual...</td>
+      <td>BIBREF19\nBIBREF20</td>
+      <td>0.25</td>
+      <td>0.400000</td>
+      <td>0</td>
+      <td>0.5</td>
+    </tr>
+    <tr>
+      <th>1</th>
+      <td>what are the pivot-based baselines?</td>
+      <td>The pivot-based baselines are pivoting and piv...</td>
+      <td>pivoting\npivoting$_{\rm m}$</td>
+      <td>0.25</td>
+      <td>0.000000</td>
+      <td>0</td>
+      <td>0.0</td>
+    </tr>
+    <tr>
+      <th>2</th>
+      <td>which datasets did they experiment with?</td>
+      <td>They experimented with the Europarl and MultiU...</td>
+      <td>Europarl\nMultiUN</td>
+      <td>1.00</td>
+      <td>1.000000</td>
+      <td>0</td>
+      <td>0.0</td>
+    </tr>
+    <tr>
+      <th>3</th>
+      <td>what language pairs are explored?</td>
+      <td>The language pairs explored in this paper are:...</td>
+      <td>De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...</td>
+      <td>0.25</td>
+      <td>0.545455</td>
+      <td>0</td>
+      <td>0.0</td>
+    </tr>
+    <tr>
+      <th>4</th>
+      <td>what ner models were evaluated?</td>
+      <td>Based on the provided text, the following NER ...</td>
+      <td>Stanford NER\nspaCy 2.0 \nrecurrent model with...</td>
+      <td>0.50</td>
+      <td>0.600000</td>
+      <td>0</td>
+      <td>0.0</td>
+    </tr>
+  </tbody>
+</table>
+</div>
+
+
+
+A completely optional step, if you want to upload the evaluation results to your Ragas app, you can run the command below.You can learn more about Ragas app here.
+
+
+```python
+gemini_2_flash_score.upload()
+```
+
+### Gemini 1.5 Flash
+
+Next, we’ll follow similar steps for Gemini 1.5 Flash as well.
+
+We’ll generate the evaluation dataset for the Gemini 1.5 Flash responses and then perform the same evaluation on it's responses.
+
+
+```python
+from ragas.dataset_schema import EvaluationDataset
+
+dataset_list = []
+
+for i in range(processed_dataset.shape[0]):
+    sample = {
+        "user_input": (
+            "" if pd.isna(processed_dataset.iloc[i].get("question")) else processed_dataset.iloc[i].get("question")
+        ),
+        "reference": (
+            ""
+            if pd.isna(processed_dataset.iloc[i].get("golden response"))
+            else processed_dataset.iloc[i].get("golden response")
+        ),
+        "response": (
+            ""
+            if pd.isna(processed_dataset["gemini_1_5_flash_responses"].iloc[i])
+            else processed_dataset["gemini_1_5_flash_responses"].iloc[i]
+        ),
+    }
+    dataset_list.append(sample)
+
+gemini_1_5_dataset = EvaluationDataset.from_list(dataset_list)
+gemini_1_5_dataset.to_pandas().head()
+```
+
+<div>
+<style scoped>
+    .dataframe tbody tr th:only-of-type {
+        vertical-align: middle;
+    }
+
+    .dataframe tbody tr th {
+        vertical-align: top;
+    }
+
+    .dataframe thead th {
+        text-align: right;
+    }
+</style>
+<table border="1">
+  <thead>
+    <tr style="text-align: right;">
+      <th></th>
+      <th>user_input</th>
+      <th>response</th>
+      <th>reference</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th>0</th>
+      <td>which multilingual approaches do they compare ...</td>
+      <td>The paper compares its approach with multiling...</td>
+      <td>BIBREF19\nBIBREF20</td>
+    </tr>
+    <tr>
+      <th>1</th>
+      <td>what are the pivot-based baselines?</td>
+      <td>The provided text mentions two types of pivot-...</td>
+      <td>pivoting\npivoting$_{\rm m}$</td>
+    </tr>
+    <tr>
+      <th>2</th>
+      <td>which datasets did they experiment with?</td>
+      <td>The experiments used two public datasets: Euro...</td>
+      <td>Europarl\nMultiUN</td>
+    </tr>
+    <tr>
+      <th>3</th>
+      <td>what language pairs are explored?</td>
+      <td>The paper explores the following language pair...</td>
+      <td>De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...</td>
+    </tr>
+    <tr>
+      <th>4</th>
+      <td>what ner models were evaluated?</td>
+      <td>Stanford NER, spaCy 2.0, and a recurrent model...</td>
+      <td>Stanford NER\nspaCy 2.0 \nrecurrent model with...</td>
+    </tr>
+  </tbody>
+</table>
+</div>
+
+
+
+
+```python
+from ragas import evaluate
+
+gemini_1_5_flash_score = evaluate(dataset=gemini_1_5_dataset, metrics=metrics)
+gemini_1_5_flash_score.to_pandas().head()
+```
+```
+Evaluating: 100%|██████████| 120/120 [01:02<00:00,  1.93it/s]
+```
+
+<div>
+<style scoped>
+    .dataframe tbody tr th:only-of-type {
+        vertical-align: middle;
+    }
+
+    .dataframe tbody tr th {
+        vertical-align: top;
+    }
+
+    .dataframe thead th {
+        text-align: right;
+    }
+</style>
+<table border="1">
+  <thead>
+    <tr style="text-align: right;">
+      <th></th>
+      <th>user_input</th>
+      <th>response</th>
+      <th>reference</th>
+      <th>nv_accuracy</th>
+      <th>answer_correctness</th>
+      <th>unanswerable</th>
+      <th>factual_correctness(mode=f1)</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th>0</th>
+      <td>which multilingual approaches do they compare ...</td>
+      <td>The paper compares its approach with multiling...</td>
+      <td>BIBREF19\nBIBREF20</td>
+      <td>0.25</td>
+      <td>0.400000</td>
+      <td>0</td>
+      <td>0.00</td>
+    </tr>
+    <tr>
+      <th>1</th>
+      <td>what are the pivot-based baselines?</td>
+      <td>The provided text mentions two types of pivot-...</td>
+      <td>pivoting\npivoting$_{\rm m}$</td>
+      <td>0.25</td>
+      <td>0.181818</td>
+      <td>0</td>
+      <td>0.18</td>
+    </tr>
+    <tr>
+      <th>2</th>
+      <td>which datasets did they experiment with?</td>
+      <td>The experiments used two public datasets: Euro...</td>
+      <td>Europarl\nMultiUN</td>
+      <td>1.00</td>
+      <td>0.800000</td>
+      <td>0</td>
+      <td>0.00</td>
+    </tr>
+    <tr>
+      <th>3</th>
+      <td>what language pairs are explored?</td>
+      <td>The paper explores the following language pair...</td>
+      <td>De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...</td>
+      <td>0.00</td>
+      <td>0.533333</td>
+      <td>0</td>
+      <td>0.00</td>
+    </tr>
+    <tr>
+      <th>4</th>
+      <td>what ner models were evaluated?</td>
+      <td>Stanford NER, spaCy 2.0, and a recurrent model...</td>
+      <td>Stanford NER\nspaCy 2.0 \nrecurrent model with...</td>
+      <td>0.50</td>
+      <td>0.571429</td>
+      <td>0</td>
+      <td>0.00</td>
+    </tr>
+  </tbody>
+</table>
+</div>
+
+
+
+## Comparing the Results
+
+Now that we have completed our evaluations, let’s compare how both models performed on acadmic question answering.
+
+
+```python
+def print__results(result):
+    result = result._repr_dict
+    print("Response Accuracy:", result.get("nv_accuracy"))
+    print("Answer Correctness:", result.get("answer_correctness"))
+    print("Factual Correctness:", result.get("factual_correctness(mode=f1)"))
+
+print__results(gemini_1_5_flash_score)
+```
+Output
+```
+Response Accuracy: 0.5416666666666666
+Answer Correctness: 0.47723550201811066
+Factual Correctness: 0.2533333333333333
+```
+
+
+```python
+print__results(gemini_2_flash_score)
+```
+Output
+```
+Response Accuracy: 0.5666666666666667
+Answer Correctness: 0.48055486996663466
+Factual Correctness: 0.23633333333333334
+```
+
+Gemini 2.0 Flash performs slightly better overall.
+
+Let’s see how well the models performed on classifying if a given question can be answered with the provided text. 
+
+For this, we’ll use the result from the “unanswerable” metric and compare it with the original ground truth from the “unanswerable” column in our pre-processed dataset.
+
+
+```python
+from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score
+
+
+def print_metrics(actuals, preds, model_name="Model", zero_division_value=0):
+    """
+    Prints common classification metrics for a given set of actual and predicted values.
+
+    Parameters:
+        actuals (array-like): Ground truth labels.
+        preds (array-like): Predicted labels.
+        model_name (str): Name of the model for display purposes.
+        zero_division_value (int or str): Sets the value to return when there is a zero division.
+                                          Options: 0, 1, or "warn" (default is 0 here).
+    """
+    print(f"Metrics for {model_name}:")
+    print("Accuracy:", accuracy_score(actuals, preds))
+    print(
+        "Precision:", precision_score(actuals, preds, zero_division=zero_division_value)
+    )
+    print("Recall:", recall_score(actuals, preds, zero_division=zero_division_value))
+    print("F1 Score:", f1_score(actuals, preds, zero_division=zero_division_value))
+    print("\nClassification Report:")
+    print(classification_report(actuals, preds, zero_division=zero_division_value))
+    
+gemini_1_5_flash_prediction = gemini_1_5_flash_score["unanswerable"]
+gemini_2_flash_prediction = gemini_2_flash_score["unanswerable"]
+groundtruth = processed_dataset["unanswerable"].astype(int)
+
+print_metrics(groundtruth, gemini_2_flash_prediction, model_name="Gemini 2 Flash")
+```
+
+Output
+```
+Metrics for Gemini 2 Flash:
+Accuracy: 0.9333333333333333
+Precision: 0.5
+Recall: 1.0
+F1 Score: 0.6666666666666666
+
+Classification Report:
+              precision    recall  f1-score   support
+
+           0       1.00      0.93      0.96        28
+           1       0.50      1.00      0.67         2
+
+    accuracy                           0.93        30
+   macro avg       0.75      0.96      0.81        30
+weighted avg       0.97      0.93      0.94        30
+```    
+
+```python
+print_metrics(groundtruth, gemini_1_5_flash_prediction, model_name="Gemini 1.5 Flash")
+```
+Output
+```
+Metrics for Gemini 1.5 Flash:
+Accuracy: 0.9
+Precision: 0.3333333333333333
+Recall: 0.5
+F1 Score: 0.4
+    
+Classification Report:
+              precision    recall  f1-score   support
+
+           0       0.96      0.93      0.95        28
+           1       0.33      0.50      0.40         2
+
+    accuracy                           0.90        30
+   macro avg       0.65      0.71      0.67        30
+weighted avg       0.92      0.90      0.91        30
+```    
+
+
+Gemini 2.0 Flash also outperforms Gemini 1.5 Flash in identifying unanswerable questions.
+
+## What's Next
+
+You can benchmark your model on any dataset using Ragas metrics as long as the dataset is formatted according to Ragas EvaluationDatase. Try benchmarking your models on a variety of established benchmarking datasets.
+- [PubMedQA](https://huggingface.co/datasets/qiaojin/PubMedQA)
+- [MultiHopRAG](https://huggingface.co/datasets/yixuantt/MultiHopRAG)
+- [ms_marco](https://huggingface.co/datasets/microsoft/ms_marco)
+
+And many more.
diff --git a/posts/benchmarking/async_executor.py b/posts/benchmarking/async_executor.py
new file mode 100644
index 0000000..3f540fb
--- /dev/null
+++ b/posts/benchmarking/async_executor.py
@@ -0,0 +1,137 @@
+# custom_async_executor.py
+from __future__ import annotations
+import asyncio
+import time
+import logging
+from typing import Callable, Any, List, Tuple
+from dataclasses import dataclass, field
+import nest_asyncio
+from tqdm import tqdm
+
+# Apply nest_asyncio to allow nested event loops (e.g., in Jupyter)
+nest_asyncio.apply()
+
+logger = logging.getLogger(__name__)
+
+
+def is_event_loop_running() -> bool:
+    try:
+        loop = asyncio.get_running_loop()
+    except RuntimeError:
+        return False
+    else:
+        return loop.is_running()
+
+
+class RateLimiter:
+    """
+    An asynchronous rate limiter that enforces a minimum interval between calls.
+    For example, with max_calls_per_minute=1250, it ensures that calls are spaced by ~0.048 seconds.
+    """
+
+    def __init__(self, max_calls_per_minute: int):
+        self.interval = 60.0 / max_calls_per_minute
+        self.last_call = 0.0
+        self.lock = asyncio.Lock()
+
+    async def acquire(self):
+        async with self.lock:
+            now = time.monotonic()
+            elapsed = now - self.last_call
+            wait_time = self.interval - elapsed
+            if wait_time > 0:
+                await asyncio.sleep(wait_time)
+            self.last_call = time.monotonic()
+
+
+@dataclass
+class AsyncExecutor:
+    """
+    An asynchronous executor similar in usage to the one in the evaluate function.
+
+    Attributes:
+        desc: Description for the progress bar.
+        show_progress: Whether to display a progress bar.
+        raise_exceptions: Whether to propagate exceptions.
+        max_calls_per_minute: API rate limit to enforce.
+    """
+
+    desc: str = "Evaluating"
+    show_progress: bool = True
+    raise_exceptions: bool = False
+    max_calls_per_minute: int = 1250
+    jobs: List[Tuple[Callable[..., Any], tuple, dict, int]] = field(
+        default_factory=list, repr=False
+    )
+    job_counter: int = 0
+    rate_limiter: RateLimiter = field(init=False)
+
+    def __post_init__(self):
+        self.rate_limiter = RateLimiter(self.max_calls_per_minute)
+
+    def wrap_callable_with_index(
+        self, func: Callable[..., Any], index: int
+    ) -> Callable[..., Any]:
+        """
+        Wraps an asynchronous callable so that it enforces the rate limit,
+        and if an error occurs, it waits for an increasing delay (fallback)
+        before retrying the function call indefinitely.
+        """
+        async def wrapped(*args, **kwargs) -> Tuple[int, Any]:
+            retry_delay = 10  # initial delay in seconds
+            while True:
+                try:
+                    # Enforce the API rate limit before executing the function
+                    await self.rate_limiter.acquire()
+                    result = await func(*args, **kwargs)
+                    return index, result
+                except Exception as e:
+                    if self.raise_exceptions:
+                        raise e
+                    else:
+                        logger.error(
+                            "Error in job %d: %s. Retrying in %d seconds...",
+                            index, e, retry_delay
+                        )
+                        # Wait asynchronously before retrying
+                        await asyncio.sleep(retry_delay)
+                        retry_delay += 5  # Increase delay for subsequent retries
+        return wrapped
+
+    def submit(self, func: Callable[..., Any], *args, **kwargs):
+        """
+        Submit an asynchronous job to the executor.
+        """
+        wrapped_func = self.wrap_callable_with_index(func, self.job_counter)
+        self.jobs.append((wrapped_func, args, kwargs, self.job_counter))
+        self.job_counter += 1
+
+    async def _run_jobs(self) -> List[Any]:
+        tasks = []
+        # Create asyncio tasks for each job
+        for wrapped_func, args, kwargs, index in self.jobs:
+            tasks.append(asyncio.create_task(wrapped_func(*args, **kwargs)))
+
+        results = [None] * len(tasks)
+        if self.show_progress:
+            pbar = tqdm(total=len(tasks), desc=self.desc)
+            for completed in asyncio.as_completed(tasks):
+                index, result = await completed
+                results[index] = result
+                pbar.update(1)
+            pbar.close()
+        else:
+            for completed in asyncio.as_completed(tasks):
+                index, result = await completed
+                results[index] = result
+        return results
+
+    def results(self) -> List[Any]:
+        """
+        Execute all submitted asynchronous jobs and return their results
+        in the order they were submitted.
+
+        Thanks to nest_asyncio, this method can be used inside a Jupyter Notebook.
+        """
+        # If an event loop is already running, nest_asyncio allows asyncio.run() to work.
+        return asyncio.run(self._run_jobs())
diff --git a/posts/benchmarking/benchmarking.png b/posts/benchmarking/benchmarking.png
new file mode 100644
index 0000000..cfa8282
Binary files /dev/null and b/posts/benchmarking/benchmarking.png differ
diff --git a/posts/benchmarking/benchmarking_notebook.ipynb b/posts/benchmarking/benchmarking_notebook.ipynb
new file mode 100644
index 0000000..0329bbd
--- /dev/null
+++ b/posts/benchmarking/benchmarking_notebook.ipynb
@@ -0,0 +1,1639 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Dataset Link: https://huggingface.co/datasets/allenai/qasper\n",
+    "\n",
+    "While leaderboards and reports offer insights into overall model performance, they don't reveal how a model handles your specific needs. The Gen AI evaluation service helps you define your own evaluation criteria, ensuring a clear understanding of how well generative AI models and applications align with your unique use case."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Note: you may need to restart the kernel to use updated packages.\n"
+     ]
+    }
+   ],
+   "source": [
+    "%pip install google-genai -q"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Benchmarking Gemini Models on using Ragas"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this tutorial, we will see how we can benchmark the Gemini models on the AllenAI's QASPER dataset using the RAGAS metrics on Question Answering Task. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![Data collection Process of QASPER Dataset](qasper_data_collection.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## \n",
+    "\n",
+    "For the sake of demonstration, we will use only a subset of the whole dataset. You can perform benchmarking using the complete dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Dataset({\n",
+       "    features: ['id', 'title', 'abstract', 'full_text', 'qas', 'figures_and_tables'],\n",
+       "    num_rows: 100\n",
+       "})"
+      ]
+     },
+     "execution_count": 25,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from datasets import load_dataset\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "from tqdm.auto import tqdm\n",
+    "\n",
+    "dataset = load_dataset(\"allenai/qasper\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](benchmarking.png)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def convert_full_text_to_markdown(full_text_dict):\n",
+    "    \"\"\"\n",
+    "    Converts a full_text dictionary into a markdown-formatted string.\n",
+    "\n",
+    "    Expected keys:\n",
+    "      - \"section_name\": list of section titles.\n",
+    "      - \"paragraphs\": list of lists of paragraphs corresponding to each section.\n",
+    "\n",
+    "    Each section becomes a markdown header (##) followed by its paragraphs.\n",
+    "    \"\"\"\n",
+    "    sections = full_text_dict.get(\"section_name\", [])\n",
+    "    paragraphs = full_text_dict.get(\"paragraphs\", [])\n",
+    "\n",
+    "    markdown_lines = []\n",
+    "    for section, paragraph in zip(sections, paragraphs):\n",
+    "        markdown_lines.append(f\"## {section}\")\n",
+    "        markdown_lines.append(\"\")  # Blank line\n",
+    "        markdown_lines.append(\"\\n\".join(map(str, paragraph)))\n",
+    "        markdown_lines.append(\"\")  # End of section\n",
+    "        markdown_lines.append(\"\")  # Extra blank line for separation\n",
+    "    return \"\\n\".join(markdown_lines)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def combine_responses(row):\n",
+    "    \"\"\"\n",
+    "    Combines 'extractive_spans', 'yes_no', and 'free_form_answer'\n",
+    "    into one single string. Skips components that are missing.\n",
+    "    \"\"\"\n",
+    "    responses = []\n",
+    "    if pd.notna(row.get(\"extractive_spans\")):\n",
+    "        if isinstance(row[\"extractive_spans\"], list):\n",
+    "            responses.append(\" \".join(map(str, row[\"extractive_spans\"])))\n",
+    "        else:\n",
+    "            responses.append(str(row[\"extractive_spans\"]))\n",
+    "    if pd.notna(row.get(\"yes_no\")):\n",
+    "        responses.append(str(row[\"yes_no\"]))\n",
+    "    if pd.notna(row.get(\"free_form_answer\")):\n",
+    "        responses.append(str(row[\"free_form_answer\"]))\n",
+    "    return \"\\n\".join(responses) if responses else np.nan"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def preprocess_hf_dataset(hf_ds):\n",
+    "    \"\"\"\n",
+    "    Processes a HuggingFace dataset split into a cleaned Pandas DataFrame.\n",
+    "\n",
+    "    Steps:\n",
+    "      1. For each sample, convert 'full_text' to a markdown string.\n",
+    "      2. For every QA pair in the sample, extract the question and first answer.\n",
+    "      3. Build lists for answers, questions, and full_text (duplicated per question).\n",
+    "      4. Create a DataFrame from the collected data.\n",
+    "      5. Clean columns by replacing empty lists/strings with NaN and joining lists.\n",
+    "      6. Combine the answer components into a single 'golden response'.\n",
+    "\n",
+    "    The function uses nested tqdm progress bars for real-time feedback.\n",
+    "\n",
+    "    Returns:\n",
+    "        pd.DataFrame: The preprocessed DataFrame.\n",
+    "    \"\"\"\n",
+    "    answers_list = []  # Stores the first answer for each question\n",
+    "    questions_list = []  # Stores each question text\n",
+    "    full_text_list = []  # Stores the formatted full text per QA pair\n",
+    "\n",
+    "    # Outer loop: iterate over samples with progress bar\n",
+    "    for sample in tqdm(hf_ds, desc=\"Processing samples\", unit=\"sample\"):\n",
+    "        # Convert full text once per sample\n",
+    "        formatted_text = convert_full_text_to_markdown(sample[\"full_text\"])\n",
+    "        # Create a list of QA pairs\n",
+    "        qa_pairs = list(zip(sample[\"qas\"][\"question\"], sample[\"qas\"][\"answers\"]))\n",
+    "\n",
+    "        # Inner loop: iterate over each QA pair with its own progress bar\n",
+    "        for question, answer_set in tqdm(\n",
+    "            qa_pairs, desc=\"Processing QAs\", total=len(qa_pairs), leave=False, unit=\"qa\"\n",
+    "        ):\n",
+    "            answers_list.append(answer_set[\"answer\"][0])\n",
+    "            questions_list.append(question)\n",
+    "            full_text_list.append(formatted_text)\n",
+    "\n",
+    "    # Create DataFrame from the collected data\n",
+    "    df = pd.DataFrame(answers_list)\n",
+    "    df[\"question\"] = questions_list\n",
+    "    df[\"full_text\"] = full_text_list\n",
+    "\n",
+    "    # Data Cleaning: Replace empty lists/strings with NaN and join lists if needed\n",
+    "    df[\"extractive_spans\"] = df[\"extractive_spans\"].apply(\n",
+    "        lambda x: np.nan if isinstance(x, list) and len(x) == 0 else x\n",
+    "    )\n",
+    "    df[\"free_form_answer\"] = df[\"free_form_answer\"].apply(\n",
+    "        lambda x: np.nan if isinstance(x, str) and x.strip() == \"\" else x\n",
+    "    )\n",
+    "    df[\"yes_no\"] = df[\"yes_no\"].apply(lambda x: np.nan if x is None else x)\n",
+    "    df[\"extractive_spans\"] = df[\"extractive_spans\"].apply(\n",
+    "        lambda x: \"\\n\".join(x) if isinstance(x, list) else x\n",
+    "    )\n",
+    "\n",
+    "    # Combine the answer components into a single 'golden response'\n",
+    "    df[\"golden response\"] = df.apply(lambda row: combine_responses(row), axis=1)\n",
+    "\n",
+    "    return df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_ds = dataset[\"train\"]\n",
+    "validation_ds = dataset[\"validation\"]\n",
+    "test_ds = dataset[\"test\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Processing samples: 100%|██████████| 888/888 [00:04<00:00, 211.20sample/s]\n",
+      "Processing samples: 100%|██████████| 281/281 [00:01<00:00, 199.77sample/s]\n",
+      "Processing samples: 100%|██████████| 416/416 [00:02<00:00, 198.84sample/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "train_df = preprocess_hf_dataset(train_ds)\n",
+    "validation_df = preprocess_hf_dataset(validation_ds)\n",
+    "test_df = preprocess_hf_dataset(test_ds)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_index.llms.google_genai import GoogleGenAI\n",
+    "from dotenv import load_dotenv\n",
+    "\n",
+    "load_dotenv()\n",
+    "\n",
+    "gemini_2 = GoogleGenAI(\n",
+    "    model=\"gemini-2.0-flash\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "AI learns patterns from data to make predictions or decisions.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "import os\n",
+    "from google import genai\n",
+    "from dotenv import load_dotenv\n",
+    "\n",
+    "load_dotenv()\n",
+    "\n",
+    "client = genai.Client(api_key=os.getenv(\"GOOGLE_API_KEY\"))\n",
+    "\n",
+    "response = client.models.generate_content(\n",
+    "    model=\"gemini-2.0-flash\", contents=\"Explain how AI works in a few words\"\n",
+    ")\n",
+    "print(response.text)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'which multilingual approaches do they compare with?'"
+      ]
+     },
+     "execution_count": 21,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "idx = 0\n",
+    "query = validation_df.iloc[idx][\"question\"]\n",
+    "context = validation_df.iloc[idx][\"full_text\"]\n",
+    "query"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 59,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'The paper compares its approach with multilingual NMT (MNMT) from  BIBREF19.  Another comparison is made against a pivoting method that uses MNMT (pivoting<sub>m</sub>), which uses MNMT to translate source to pivot and then to target.\\n'"
+      ]
+     },
+     "execution_count": 59,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "context_str = context\n",
+    "query_str = query\n",
+    "\n",
+    "\n",
+    "qa_prompt = (\n",
+    "    f\"Context information is below.\\n\"\n",
+    "    \"---------------------\\n\"\n",
+    "    \"{context_str}\\n\"\n",
+    "    \"---------------------\\n\"\n",
+    "    \"Given the context information and not prior knowledge, \"\n",
+    "    \"answer the query.\\n\"\n",
+    "    \"If you cannot answer the query, just say that it cannot be answered.\\n\"\n",
+    "    \"Query: {query_str}\\n\"\n",
+    "    \"Answer: \"\n",
+    ")\n",
+    "\n",
+    "formatted_prompt = qa_prompt.format(context_str=context_str, query_str=query_str)\n",
+    "response = gemini_2.complete(formatted_prompt)\n",
+    "response.text"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "They compare their approaches with Multilingual NMT (MNMT) described in BIBREF19 and BIBREF22.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "from google import genai\n",
+    "\n",
+    "client = genai.Client()\n",
+    "\n",
+    "context_str = context\n",
+    "query_str = query\n",
+    "\n",
+    "qa_prompt = (\n",
+    "    f\"Context information is below.\\n\"\n",
+    "    \"---------------------\\n\"\n",
+    "    \"{context_str}\\n\"\n",
+    "    \"---------------------\\n\"\n",
+    "    \"Given the context information and not prior knowledge, \"\n",
+    "    \"answer the query.\\n\"\n",
+    "    \"If you cannot answer the query, just say that it cannot be answered.\\n\"\n",
+    "    \"Query: {query_str}\\n\"\n",
+    "    \"Answer: \"\n",
+    ")\n",
+    "\n",
+    "formatted_prompt = qa_prompt.format(context_str=context_str, query_str=query_str)\n",
+    "\n",
+    "\n",
+    "response = await client.aio.models.generate_content(\n",
+    "    model='gemini-2.0-flash', contents=formatted_prompt\n",
+    ")\n",
+    "print(response.text)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'BIBREF19\\nBIBREF20'"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "validation_df.iloc[idx][\"golden response\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from async_executor import AsyncExecutor\n",
+    "\n",
+    "gemini_2 = GoogleGenAI(\n",
+    "    model=\"gemini-2.0-flash\",\n",
+    ")\n",
+    "\n",
+    "async def query_llm(query_str: str, context_str: str):\n",
+    "    formatted_prompt = qa_prompt.format(context_str=context_str, query_str=query_str)\n",
+    "    response = await gemini_2.acomplete(formatted_prompt)\n",
+    "    return response\n",
+    "\n",
+    "\n",
+    "# Create an instance of the asynchronous executor\n",
+    "executor = AsyncExecutor(\n",
+    "    desc=\"LLM Processing\",\n",
+    "    show_progress=True,\n",
+    "    raise_exceptions=False,\n",
+    "    max_calls_per_minute=1250,\n",
+    ")\n",
+    "\n",
+    "df = validation_df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "LLM Processing: 100%|██████████| 1005/1005 [00:52<00:00, 19.13it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "for idx in range(df.shape[0]):\n",
+    "    query = df.iloc[idx][\"question\"]\n",
+    "    context = df.iloc[idx][\"full_text\"]\n",
+    "    executor.submit(query_llm, query, context)\n",
+    "\n",
+    "# Execute the jobs and get the results in order\n",
+    "validation_responses = executor.results()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>user_input</th>\n",
+       "      <th>response</th>\n",
+       "      <th>reference</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>which multilingual approaches do they compare ...</td>\n",
+       "      <td>They compare their approaches with Multilingua...</td>\n",
+       "      <td>BIBREF19\\nBIBREF20</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>what are the pivot-based baselines?</td>\n",
+       "      <td>The pivot-based method is used as a baseline. ...</td>\n",
+       "      <td>pivoting\\npivoting$_{\\rm m}$</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>which datasets did they experiment with?</td>\n",
+       "      <td>They experimented with two public datasets: Eu...</td>\n",
+       "      <td>Europarl\\nMultiUN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>what language pairs are explored?</td>\n",
+       "      <td>The language pairs explored in this paper are:...</td>\n",
+       "      <td>De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>what ner models were evaluated?</td>\n",
+       "      <td>Stanford NER, spaCy 2.0, and a recurrent model...</td>\n",
+       "      <td>Stanford NER\\nspaCy 2.0 \\nrecurrent model with...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>...</th>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1000</th>\n",
+       "      <td>What approaches do they use towards text analy...</td>\n",
+       "      <td>Based on the provided text, the approaches use...</td>\n",
+       "      <td>Domain experts and fellow researchers can prov...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1001</th>\n",
+       "      <td>What dataset do they use for analysis?</td>\n",
+       "      <td>The context information mentions using data fr...</td>\n",
+       "      <td></td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1002</th>\n",
+       "      <td>Do they demonstrate why interdisciplinary insi...</td>\n",
+       "      <td>Yes, the text explicitly states that interdisc...</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1003</th>\n",
+       "      <td>What background do they have?</td>\n",
+       "      <td>The authors are scholars from very different d...</td>\n",
+       "      <td></td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1004</th>\n",
+       "      <td>What kind of issues (that are not on the foref...</td>\n",
+       "      <td>The article aims to shed light on thorny issue...</td>\n",
+       "      <td>identifying the questions we wish to explore\\n...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>1005 rows × 3 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                             user_input  \\\n",
+       "0     which multilingual approaches do they compare ...   \n",
+       "1                   what are the pivot-based baselines?   \n",
+       "2              which datasets did they experiment with?   \n",
+       "3                     what language pairs are explored?   \n",
+       "4                       what ner models were evaluated?   \n",
+       "...                                                 ...   \n",
+       "1000  What approaches do they use towards text analy...   \n",
+       "1001             What dataset do they use for analysis?   \n",
+       "1002  Do they demonstrate why interdisciplinary insi...   \n",
+       "1003                      What background do they have?   \n",
+       "1004  What kind of issues (that are not on the foref...   \n",
+       "\n",
+       "                                               response  \\\n",
+       "0     They compare their approaches with Multilingua...   \n",
+       "1     The pivot-based method is used as a baseline. ...   \n",
+       "2     They experimented with two public datasets: Eu...   \n",
+       "3     The language pairs explored in this paper are:...   \n",
+       "4     Stanford NER, spaCy 2.0, and a recurrent model...   \n",
+       "...                                                 ...   \n",
+       "1000  Based on the provided text, the approaches use...   \n",
+       "1001  The context information mentions using data fr...   \n",
+       "1002  Yes, the text explicitly states that interdisc...   \n",
+       "1003  The authors are scholars from very different d...   \n",
+       "1004  The article aims to shed light on thorny issue...   \n",
+       "\n",
+       "                                              reference  \n",
+       "0                                    BIBREF19\\nBIBREF20  \n",
+       "1                          pivoting\\npivoting$_{\\rm m}$  \n",
+       "2                                     Europarl\\nMultiUN  \n",
+       "3     De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...  \n",
+       "4     Stanford NER\\nspaCy 2.0 \\nrecurrent model with...  \n",
+       "...                                                 ...  \n",
+       "1000  Domain experts and fellow researchers can prov...  \n",
+       "1001                                                     \n",
+       "1002                                              False  \n",
+       "1003                                                     \n",
+       "1004  identifying the questions we wish to explore\\n...  \n",
+       "\n",
+       "[1005 rows x 3 columns]"
+      ]
+     },
+     "execution_count": 26,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from ragas.dataset_schema import EvaluationDataset\n",
+    "\n",
+    "dataset_list = []\n",
+    "\n",
+    "for i in range(df.shape[0]):\n",
+    "    sample = {\n",
+    "        \"user_input\": (\n",
+    "            \"\" if pd.isna(df.iloc[i].get(\"question\")) else df.iloc[i].get(\"question\")\n",
+    "        ),\n",
+    "        \"reference\": (\n",
+    "            \"\"\n",
+    "            if pd.isna(df.iloc[i].get(\"golden response\"))\n",
+    "            else df.iloc[i].get(\"golden response\")\n",
+    "        ),\n",
+    "        \"response\": (\n",
+    "            \"\"\n",
+    "            if pd.isna(validation_responses[i].text)\n",
+    "            else validation_responses[i].text\n",
+    "        ),\n",
+    "    }\n",
+    "    dataset_list.append(sample)\n",
+    "\n",
+    "dataset = EvaluationDataset.from_list(dataset_list)\n",
+    "dataset.to_pandas()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from ragas.metrics import (\n",
+    "    AnswerAccuracy,\n",
+    "    AnswerCorrectness,\n",
+    "    FactualCorrectness,\n",
+    "    AspectCritic,\n",
+    ")\n",
+    "import getpass\n",
+    "import os\n",
+    "\n",
+    "from ragas.llms import LangchainLLMWrapper\n",
+    "from langchain_openai import ChatOpenAI\n",
+    "\n",
+    "if \"OPENAI_API_KEY\" not in os.environ:\n",
+    "    os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"Enter your OpenAI API key: \")\n",
+    "\n",
+    "evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=\"gpt-4o-mini\"))\n",
+    "\n",
+    "aspect_critic = AspectCritic(\n",
+    "    name=\"unanswerable\",\n",
+    "    definition=\"Return 1 if the query cannot be answered by the provided context, otherwise return 0.\",\n",
+    "    llm=evaluator_llm,\n",
+    ")\n",
+    "\n",
+    "metrics = [\n",
+    "    AnswerAccuracy(llm=evaluator_llm),\n",
+    "    AnswerCorrectness(llm=evaluator_llm, weights=[1, 0]),\n",
+    "    aspect_critic,\n",
+    "    FactualCorrectness(llm=evaluator_llm),\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'They compare their approaches with Multilingual NMT (MNMT) from BIBREF19 and BIBREF22.\\n'"
+      ]
+     },
+     "execution_count": 21,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "validation_responses[0].text"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Evaluating: 100%|██████████| 4020/4020 [25:38<00:00,  2.61it/s] \n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>user_input</th>\n",
+       "      <th>response</th>\n",
+       "      <th>reference</th>\n",
+       "      <th>nv_accuracy</th>\n",
+       "      <th>answer_correctness</th>\n",
+       "      <th>unanswerable</th>\n",
+       "      <th>factual_correctness(mode=f1)</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>which multilingual approaches do they compare ...</td>\n",
+       "      <td>They compare their approaches with Multilingua...</td>\n",
+       "      <td>BIBREF19\\nBIBREF20</td>\n",
+       "      <td>0.25</td>\n",
+       "      <td>0.5</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.67</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>what are the pivot-based baselines?</td>\n",
+       "      <td>The pivot-based method is used as a baseline. ...</td>\n",
+       "      <td>pivoting\\npivoting$_{\\rm m}$</td>\n",
+       "      <td>0.50</td>\n",
+       "      <td>0.8</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.00</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>which datasets did they experiment with?</td>\n",
+       "      <td>They experimented with two public datasets: Eu...</td>\n",
+       "      <td>Europarl\\nMultiUN</td>\n",
+       "      <td>1.00</td>\n",
+       "      <td>0.8</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.40</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>what language pairs are explored?</td>\n",
+       "      <td>The language pairs explored in this paper are:...</td>\n",
+       "      <td>De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.00</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>what ner models were evaluated?</td>\n",
+       "      <td>Stanford NER, spaCy 2.0, and a recurrent model...</td>\n",
+       "      <td>Stanford NER\\nspaCy 2.0 \\nrecurrent model with...</td>\n",
+       "      <td>0.50</td>\n",
+       "      <td>0.8</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.00</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>...</th>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1000</th>\n",
+       "      <td>What approaches do they use towards text analy...</td>\n",
+       "      <td>Based on the provided text, the approaches use...</td>\n",
+       "      <td>Domain experts and fellow researchers can prov...</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.00</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1001</th>\n",
+       "      <td>What dataset do they use for analysis?</td>\n",
+       "      <td>The context information mentions using data fr...</td>\n",
+       "      <td></td>\n",
+       "      <td>0.50</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.00</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1002</th>\n",
+       "      <td>Do they demonstrate why interdisciplinary insi...</td>\n",
+       "      <td>Yes, the text explicitly states that interdisc...</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.00</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1003</th>\n",
+       "      <td>What background do they have?</td>\n",
+       "      <td>The authors are scholars from very different d...</td>\n",
+       "      <td></td>\n",
+       "      <td>0.50</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.00</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1004</th>\n",
+       "      <td>What kind of issues (that are not on the foref...</td>\n",
+       "      <td>The article aims to shed light on thorny issue...</td>\n",
+       "      <td>identifying the questions we wish to explore\\n...</td>\n",
+       "      <td>0.25</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.00</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>1005 rows × 7 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                             user_input  \\\n",
+       "0     which multilingual approaches do they compare ...   \n",
+       "1                   what are the pivot-based baselines?   \n",
+       "2              which datasets did they experiment with?   \n",
+       "3                     what language pairs are explored?   \n",
+       "4                       what ner models were evaluated?   \n",
+       "...                                                 ...   \n",
+       "1000  What approaches do they use towards text analy...   \n",
+       "1001             What dataset do they use for analysis?   \n",
+       "1002  Do they demonstrate why interdisciplinary insi...   \n",
+       "1003                      What background do they have?   \n",
+       "1004  What kind of issues (that are not on the foref...   \n",
+       "\n",
+       "                                               response  \\\n",
+       "0     They compare their approaches with Multilingua...   \n",
+       "1     The pivot-based method is used as a baseline. ...   \n",
+       "2     They experimented with two public datasets: Eu...   \n",
+       "3     The language pairs explored in this paper are:...   \n",
+       "4     Stanford NER, spaCy 2.0, and a recurrent model...   \n",
+       "...                                                 ...   \n",
+       "1000  Based on the provided text, the approaches use...   \n",
+       "1001  The context information mentions using data fr...   \n",
+       "1002  Yes, the text explicitly states that interdisc...   \n",
+       "1003  The authors are scholars from very different d...   \n",
+       "1004  The article aims to shed light on thorny issue...   \n",
+       "\n",
+       "                                              reference  nv_accuracy  \\\n",
+       "0                                    BIBREF19\\nBIBREF20         0.25   \n",
+       "1                          pivoting\\npivoting$_{\\rm m}$         0.50   \n",
+       "2                                     Europarl\\nMultiUN         1.00   \n",
+       "3     De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...         0.00   \n",
+       "4     Stanford NER\\nspaCy 2.0 \\nrecurrent model with...         0.50   \n",
+       "...                                                 ...          ...   \n",
+       "1000  Domain experts and fellow researchers can prov...         0.00   \n",
+       "1001                                                            0.50   \n",
+       "1002                                              False         0.00   \n",
+       "1003                                                            0.50   \n",
+       "1004  identifying the questions we wish to explore\\n...         0.25   \n",
+       "\n",
+       "      answer_correctness  unanswerable  factual_correctness(mode=f1)  \n",
+       "0                    0.5             0                          0.67  \n",
+       "1                    0.8             0                          0.00  \n",
+       "2                    0.8             0                          0.40  \n",
+       "3                    1.0             0                          0.00  \n",
+       "4                    0.8             0                          0.00  \n",
+       "...                  ...           ...                           ...  \n",
+       "1000                 0.0             0                          0.00  \n",
+       "1001                 0.0             0                          0.00  \n",
+       "1002                 0.0             0                          0.00  \n",
+       "1003                 0.0             1                          0.00  \n",
+       "1004                 0.0             0                          0.00  \n",
+       "\n",
+       "[1005 rows x 7 columns]"
+      ]
+     },
+     "execution_count": 27,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from ragas import evaluate\n",
+    "\n",
+    "gemini_2_score = evaluate(dataset=dataset, metrics=metrics)\n",
+    "gemini_2_score.to_pandas()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gemini_2_score"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A completely optional step, if you want to upload the evaluation results to your Ragas app, you can run the command below.You can learn more about Ragas app here."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Evaluation results uploaded! View at https://app.ragas.io/dashboard/alignment/evaluation/908c34a5-3996-4703-8eae-a7daf210c6d7\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'https://app.ragas.io/dashboard/alignment/evaluation/908c34a5-3996-4703-8eae-a7daf210c6d7'"
+      ]
+     },
+     "execution_count": 28,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "gemini_2_score.upload()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "preds = gemini_2_score[\"unanswerable\"]\n",
+    "actuals = validation_df[\"unanswerable\"].astype(int)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Accuracy: 0.844776119402985\n",
+      "Precision: 0.31736526946107785\n",
+      "Recall: 0.5578947368421052\n",
+      "F1 Score: 0.40458015267175573\n",
+      "\n",
+      "Classification Report:\n",
+      "              precision    recall  f1-score   support\n",
+      "\n",
+      "           0       0.95      0.87      0.91       910\n",
+      "           1       0.32      0.56      0.40        95\n",
+      "\n",
+      "    accuracy                           0.84      1005\n",
+      "   macro avg       0.63      0.72      0.66      1005\n",
+      "weighted avg       0.89      0.84      0.86      1005\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "from sklearn.metrics import (\n",
+    "    classification_report,\n",
+    "    accuracy_score,\n",
+    "    precision_score,\n",
+    "    recall_score,\n",
+    "    f1_score,\n",
+    ")\n",
+    "\n",
+    "# Calculate and print basic metrics\n",
+    "print(\"Accuracy:\", accuracy_score(actuals, preds))\n",
+    "print(\"Precision:\", precision_score(actuals, preds))\n",
+    "print(\"Recall:\", recall_score(actuals, preds))\n",
+    "print(\"F1 Score:\", f1_score(actuals, preds))\n",
+    "\n",
+    "# Generate and print the classification report\n",
+    "print(\"\\nClassification Report:\")\n",
+    "print(classification_report(actuals, preds))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Benchmarking Gemini 1.5 Flash"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gemini_1_5 = GoogleGenAI(\n",
+    "    model=\"gemini-1.5-flash\",\n",
+    ")\n",
+    "\n",
+    "\n",
+    "async def query_llm(query_str: str, context_str: str):\n",
+    "    formatted_prompt = qa_prompt.format(context_str=context_str, query_str=query_str)\n",
+    "    response = await gemini_1_5.acomplete(formatted_prompt)\n",
+    "    return response\n",
+    "\n",
+    "\n",
+    "# Create an instance of the asynchronous executor\n",
+    "executor = AsyncExecutor(\n",
+    "    desc=\"Querying LLM\",\n",
+    "    show_progress=True,\n",
+    "    raise_exceptions=False,\n",
+    "    max_calls_per_minute=1250,\n",
+    ")\n",
+    "\n",
+    "df = validation_df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for idx in range(df.shape[0]):\n",
+    "    query = df.iloc[idx][\"question\"]\n",
+    "    context = df.iloc[idx][\"full_text\"]\n",
+    "    executor.submit(query_llm, query, context)\n",
+    "\n",
+    "# Execute the jobs and get the results in order\n",
+    "validation_responses = executor.results()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 65,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>user_input</th>\n",
+       "      <th>response</th>\n",
+       "      <th>reference</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>which multilingual approaches do they compare ...</td>\n",
+       "      <td>The paper compares its approach with multiling...</td>\n",
+       "      <td>BIBREF19\\nBIBREF20</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>what are the pivot-based baselines?</td>\n",
+       "      <td>The provided text mentions two types of pivot-...</td>\n",
+       "      <td>pivoting\\npivoting$_{\\rm m}$</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>which datasets did they experiment with?</td>\n",
+       "      <td>The experiments were conducted on two public d...</td>\n",
+       "      <td>Europarl\\nMultiUN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>what language pairs are explored?</td>\n",
+       "      <td>The paper explores the following language pair...</td>\n",
+       "      <td>De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>what ner models were evaluated?</td>\n",
+       "      <td>Stanford NER, spaCy 2.0, and a recurrent model...</td>\n",
+       "      <td>Stanford NER\\nspaCy 2.0 \\nrecurrent model with...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>...</th>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1000</th>\n",
+       "      <td>What approaches do they use towards text analy...</td>\n",
+       "      <td>The authors utilize several approaches to text...</td>\n",
+       "      <td>Domain experts and fellow researchers can prov...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1001</th>\n",
+       "      <td>What dataset do they use for analysis?</td>\n",
+       "      <td>The primary dataset used for analysis in the p...</td>\n",
+       "      <td></td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1002</th>\n",
+       "      <td>Do they demonstrate why interdisciplinary insi...</td>\n",
+       "      <td>Yes, the authors demonstrate the importance of...</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1003</th>\n",
+       "      <td>What background do they have?</td>\n",
+       "      <td>The authors have diverse disciplinary backgrou...</td>\n",
+       "      <td></td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1004</th>\n",
+       "      <td>What kind of issues (that are not on the foref...</td>\n",
+       "      <td>The authors tackle thorny issues related to th...</td>\n",
+       "      <td>identifying the questions we wish to explore\\n...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>1005 rows × 3 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                             user_input  \\\n",
+       "0     which multilingual approaches do they compare ...   \n",
+       "1                   what are the pivot-based baselines?   \n",
+       "2              which datasets did they experiment with?   \n",
+       "3                     what language pairs are explored?   \n",
+       "4                       what ner models were evaluated?   \n",
+       "...                                                 ...   \n",
+       "1000  What approaches do they use towards text analy...   \n",
+       "1001             What dataset do they use for analysis?   \n",
+       "1002  Do they demonstrate why interdisciplinary insi...   \n",
+       "1003                      What background do they have?   \n",
+       "1004  What kind of issues (that are not on the foref...   \n",
+       "\n",
+       "                                               response  \\\n",
+       "0     The paper compares its approach with multiling...   \n",
+       "1     The provided text mentions two types of pivot-...   \n",
+       "2     The experiments were conducted on two public d...   \n",
+       "3     The paper explores the following language pair...   \n",
+       "4     Stanford NER, spaCy 2.0, and a recurrent model...   \n",
+       "...                                                 ...   \n",
+       "1000  The authors utilize several approaches to text...   \n",
+       "1001  The primary dataset used for analysis in the p...   \n",
+       "1002  Yes, the authors demonstrate the importance of...   \n",
+       "1003  The authors have diverse disciplinary backgrou...   \n",
+       "1004  The authors tackle thorny issues related to th...   \n",
+       "\n",
+       "                                              reference  \n",
+       "0                                    BIBREF19\\nBIBREF20  \n",
+       "1                          pivoting\\npivoting$_{\\rm m}$  \n",
+       "2                                     Europarl\\nMultiUN  \n",
+       "3     De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...  \n",
+       "4     Stanford NER\\nspaCy 2.0 \\nrecurrent model with...  \n",
+       "...                                                 ...  \n",
+       "1000  Domain experts and fellow researchers can prov...  \n",
+       "1001                                                     \n",
+       "1002                                              False  \n",
+       "1003                                                     \n",
+       "1004  identifying the questions we wish to explore\\n...  \n",
+       "\n",
+       "[1005 rows x 3 columns]"
+      ]
+     },
+     "execution_count": 65,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from ragas.dataset_schema import EvaluationDataset\n",
+    "\n",
+    "dataset_list = []\n",
+    "\n",
+    "for i in range(df.shape[0]):\n",
+    "    sample = {\n",
+    "        \"user_input\": (\n",
+    "            \"\" if pd.isna(df.iloc[i].get(\"question\")) else df.iloc[i].get(\"question\")\n",
+    "        ),\n",
+    "        \"reference\": (\n",
+    "            \"\"\n",
+    "            if pd.isna(df.iloc[i].get(\"golden response\"))\n",
+    "            else df.iloc[i].get(\"golden response\")\n",
+    "        ),\n",
+    "        \"response\": (\n",
+    "            \"\"\n",
+    "            if pd.isna(validation_responses[i].text)\n",
+    "            else validation_responses[i].text\n",
+    "        ),\n",
+    "    }\n",
+    "    dataset_list.append(sample)\n",
+    "\n",
+    "dataset = EvaluationDataset.from_list(dataset_list)\n",
+    "dataset.to_pandas()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 67,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Evaluating: 100%|██████████| 4020/4020 [27:40<00:00,  2.42it/s] \n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>user_input</th>\n",
+       "      <th>response</th>\n",
+       "      <th>reference</th>\n",
+       "      <th>nv_accuracy</th>\n",
+       "      <th>answer_correctness</th>\n",
+       "      <th>unanswerable</th>\n",
+       "      <th>factual_correctness(mode=f1)</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>which multilingual approaches do they compare ...</td>\n",
+       "      <td>The paper compares its approach with multiling...</td>\n",
+       "      <td>BIBREF19\\nBIBREF20</td>\n",
+       "      <td>0.25</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>what are the pivot-based baselines?</td>\n",
+       "      <td>The provided text mentions two types of pivot-...</td>\n",
+       "      <td>pivoting\\npivoting$_{\\rm m}$</td>\n",
+       "      <td>0.25</td>\n",
+       "      <td>0.500000</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>which datasets did they experiment with?</td>\n",
+       "      <td>The experiments were conducted on two public d...</td>\n",
+       "      <td>Europarl\\nMultiUN</td>\n",
+       "      <td>1.00</td>\n",
+       "      <td>1.000000</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>what language pairs are explored?</td>\n",
+       "      <td>The paper explores the following language pair...</td>\n",
+       "      <td>De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.250000</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>what ner models were evaluated?</td>\n",
+       "      <td>Stanford NER, spaCy 2.0, and a recurrent model...</td>\n",
+       "      <td>Stanford NER\\nspaCy 2.0 \\nrecurrent model with...</td>\n",
+       "      <td>0.50</td>\n",
+       "      <td>0.571429</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>...</th>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1000</th>\n",
+       "      <td>What approaches do they use towards text analy...</td>\n",
+       "      <td>The authors utilize several approaches to text...</td>\n",
+       "      <td>Domain experts and fellow researchers can prov...</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1001</th>\n",
+       "      <td>What dataset do they use for analysis?</td>\n",
+       "      <td>The primary dataset used for analysis in the p...</td>\n",
+       "      <td></td>\n",
+       "      <td>1.00</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1002</th>\n",
+       "      <td>Do they demonstrate why interdisciplinary insi...</td>\n",
+       "      <td>Yes, the authors demonstrate the importance of...</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1003</th>\n",
+       "      <td>What background do they have?</td>\n",
+       "      <td>The authors have diverse disciplinary backgrou...</td>\n",
+       "      <td></td>\n",
+       "      <td>0.75</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1004</th>\n",
+       "      <td>What kind of issues (that are not on the foref...</td>\n",
+       "      <td>The authors tackle thorny issues related to th...</td>\n",
+       "      <td>identifying the questions we wish to explore\\n...</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.0</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>1005 rows × 7 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                             user_input  \\\n",
+       "0     which multilingual approaches do they compare ...   \n",
+       "1                   what are the pivot-based baselines?   \n",
+       "2              which datasets did they experiment with?   \n",
+       "3                     what language pairs are explored?   \n",
+       "4                       what ner models were evaluated?   \n",
+       "...                                                 ...   \n",
+       "1000  What approaches do they use towards text analy...   \n",
+       "1001             What dataset do they use for analysis?   \n",
+       "1002  Do they demonstrate why interdisciplinary insi...   \n",
+       "1003                      What background do they have?   \n",
+       "1004  What kind of issues (that are not on the foref...   \n",
+       "\n",
+       "                                               response  \\\n",
+       "0     The paper compares its approach with multiling...   \n",
+       "1     The provided text mentions two types of pivot-...   \n",
+       "2     The experiments were conducted on two public d...   \n",
+       "3     The paper explores the following language pair...   \n",
+       "4     Stanford NER, spaCy 2.0, and a recurrent model...   \n",
+       "...                                                 ...   \n",
+       "1000  The authors utilize several approaches to text...   \n",
+       "1001  The primary dataset used for analysis in the p...   \n",
+       "1002  Yes, the authors demonstrate the importance of...   \n",
+       "1003  The authors have diverse disciplinary backgrou...   \n",
+       "1004  The authors tackle thorny issues related to th...   \n",
+       "\n",
+       "                                              reference  nv_accuracy  \\\n",
+       "0                                    BIBREF19\\nBIBREF20         0.25   \n",
+       "1                          pivoting\\npivoting$_{\\rm m}$         0.25   \n",
+       "2                                     Europarl\\nMultiUN         1.00   \n",
+       "3     De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...         0.00   \n",
+       "4     Stanford NER\\nspaCy 2.0 \\nrecurrent model with...         0.50   \n",
+       "...                                                 ...          ...   \n",
+       "1000  Domain experts and fellow researchers can prov...         0.00   \n",
+       "1001                                                            1.00   \n",
+       "1002                                              False         0.00   \n",
+       "1003                                                            0.75   \n",
+       "1004  identifying the questions we wish to explore\\n...         0.00   \n",
+       "\n",
+       "      answer_correctness  unanswerable  factual_correctness(mode=f1)  \n",
+       "0               0.000000             0                           0.0  \n",
+       "1               0.500000             0                           0.0  \n",
+       "2               1.000000             0                           0.0  \n",
+       "3               0.250000             0                           0.0  \n",
+       "4               0.571429             0                           0.0  \n",
+       "...                  ...           ...                           ...  \n",
+       "1000            0.000000             0                           0.0  \n",
+       "1001            0.000000             0                           0.0  \n",
+       "1002            0.000000             0                           0.0  \n",
+       "1003            0.000000             0                           0.0  \n",
+       "1004            0.000000             0                           0.0  \n",
+       "\n",
+       "[1005 rows x 7 columns]"
+      ]
+     },
+     "execution_count": 67,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from ragas import evaluate\n",
+    "\n",
+    "gemini_1_5_score = evaluate(dataset=dataset, metrics=metrics)\n",
+    "gemini_1_5_score.to_pandas()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 72,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'nv_accuracy': 0.4724, 'answer_correctness': 0.3366, 'unanswerable': 0.1841, 'factual_correctness(mode=f1)': 0.2269}"
+      ]
+     },
+     "execution_count": 72,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "gemini_1_5_score"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "preds = gemini_1_5_score[\"unanswerable\"]\n",
+    "actuals = validation_df[\"unanswerable\"].astype(int)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Accuracy: 0.83681592039801\n",
+      "Precision: 0.31351351351351353\n",
+      "Recall: 0.6105263157894737\n",
+      "F1 Score: 0.4142857142857143\n",
+      "\n",
+      "Classification Report:\n",
+      "              precision    recall  f1-score   support\n",
+      "\n",
+      "           0       0.95      0.86      0.91       910\n",
+      "           1       0.31      0.61      0.41        95\n",
+      "\n",
+      "    accuracy                           0.84      1005\n",
+      "   macro avg       0.63      0.74      0.66      1005\n",
+      "weighted avg       0.89      0.84      0.86      1005\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "from sklearn.metrics import (\n",
+    "    classification_report,\n",
+    "    accuracy_score,\n",
+    "    precision_score,\n",
+    "    recall_score,\n",
+    "    f1_score,\n",
+    ")\n",
+    "\n",
+    "# Calculate and print basic metrics\n",
+    "print(\"Accuracy:\", accuracy_score(actuals, preds))\n",
+    "print(\"Precision:\", precision_score(actuals, preds))\n",
+    "print(\"Recall:\", recall_score(actuals, preds))\n",
+    "print(\"F1 Score:\", f1_score(actuals, preds))\n",
+    "\n",
+    "# Generate and print the classification report\n",
+    "print(\"\\nClassification Report:\")\n",
+    "print(classification_report(actuals, preds))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 71,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Evaluation results uploaded! View at https://app.ragas.io/dashboard/alignment/evaluation/2a3849ff-b142-4440-9c13-42f5fda332c9\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'https://app.ragas.io/dashboard/alignment/evaluation/2a3849ff-b142-4440-9c13-42f5fda332c9'"
+      ]
+     },
+     "execution_count": 71,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "gemini_1_5_score.upload()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Comparing the Results"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Next Steps"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you follow the steps like above you can Benchmark any model using Ragas Metrics you will you need to figure out How to convert the benchmark dataset to ragas EvaluationDataset then select the metrics of you choice and using then use the evaluate function."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "fixci",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/posts/benchmarking/final_benchmarking.ipynb b/posts/benchmarking/final_benchmarking.ipynb
new file mode 100644
index 0000000..014dd4c
--- /dev/null
+++ b/posts/benchmarking/final_benchmarking.ipynb
@@ -0,0 +1,1582 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Benchmarking Gemini Models on using Ragas"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this tutorial, we'll benchmark Gemini models on the AllenAI QASPER dataset using Ragas metrics for the Academic Question Answering task."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### About the Dataset\n",
+    "\n",
+    "QASPER (Question Answering over Scientific Papers) is a dataset consisting of 5,049 questions based on 1,585 NLP research papers. Annotators created these questions from titles and abstracts, with answers extracted from the full paper texts. It is designed to challenge document-level reasoning and support research in academic question answering.\n",
+    "\n",
+    "Data Collection Process:\n",
+    "1. Paper Selection: NLP domain papers from arXiv (LaTeX format) were selected from the S2ORC corpus.\n",
+    "2. Question Writing: Annotators wrote realistic, information-seeking questions based only on paper titles and abstracts.\n",
+    "3. Answer Annotation: Different annotators reviewed the entire paper to identify answers, selecting minimal relevant evidence (texts, tables, figures).\n",
+    "\n",
+    "![Data collection Process of QASPER Dataset](qasper_data_collection.png)\n",
+    "\n",
+    "\n",
+    "Link to the [Dataset](https://huggingface.co/datasets/allenai/qasper) and further details about QASPER can be found [here](https://huggingface.co/datasets/allenai/qasper). \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Loading Dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For demonstration purposes, we'll use a subset of 10 examples from the validation split:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Dataset({\n",
+       "    features: ['id', 'title', 'abstract', 'full_text', 'qas', 'figures_and_tables'],\n",
+       "    num_rows: 10\n",
+       "})"
+      ]
+     },
+     "execution_count": 24,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from datasets import load_dataset\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "from tqdm.auto import tqdm\n",
+    "\n",
+    "dataset = load_dataset(\"allenai/qasper\", split=\"validation[:10]\")\n",
+    "dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Processing Dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Since our goal is to benchmark the model’s performance on academic question-answering tasks, we need the responses generated by LLMs based on the full text of each research paper. We extract the full text from the dataset’s \"full_text\" column and format it into markdown, clearly organizing sections and paragraphs for readability and context.\n",
+    "\n",
+    "To create question-answer pairs for evaluation, we use the dataset’s \"qas\" column, which provides questions and corresponding answers in three formats: extractive spans, yes/no responses, or free-form answers. We then combine these formats into a single \"golden answer\" column, which serves as the ground truth for assessing model performance."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def convert_full_text_to_markdown(full_text_dict):\n",
+    "    \"\"\"\n",
+    "    Converts a full_text dictionary into a markdown-formatted string.\n",
+    "\n",
+    "    Expected keys:\n",
+    "      - \"section_name\": list of section titles.\n",
+    "      - \"paragraphs\": list of lists of paragraphs corresponding to each section.\n",
+    "\n",
+    "    Each section becomes a markdown header (##) followed by its paragraphs.\n",
+    "    \"\"\"\n",
+    "    sections = full_text_dict.get(\"section_name\", [])\n",
+    "    paragraphs = full_text_dict.get(\"paragraphs\", [])\n",
+    "\n",
+    "    markdown_lines = []\n",
+    "    for section, paragraph in zip(sections, paragraphs):\n",
+    "        markdown_lines.append(f\"## {section}\")\n",
+    "        markdown_lines.append(\"\")  # Blank line\n",
+    "        markdown_lines.append(\"\\n\".join(map(str, paragraph)))\n",
+    "        markdown_lines.append(\"\")  # End of section\n",
+    "        markdown_lines.append(\"\")  # Extra blank line for separation\n",
+    "    return \"\\n\".join(markdown_lines)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def combine_responses(row):\n",
+    "    \"\"\"\n",
+    "    Combines 'extractive_spans', 'yes_no', and 'free_form_answer'\n",
+    "    into one single string. Skips components that are missing.\n",
+    "    \"\"\"\n",
+    "    responses = []\n",
+    "    if pd.notna(row.get(\"extractive_spans\")):\n",
+    "        if isinstance(row[\"extractive_spans\"], list):\n",
+    "            responses.append(\" \".join(map(str, row[\"extractive_spans\"])))\n",
+    "        else:\n",
+    "            responses.append(str(row[\"extractive_spans\"]))\n",
+    "    if pd.notna(row.get(\"yes_no\")):\n",
+    "        responses.append(str(row[\"yes_no\"]))\n",
+    "    if pd.notna(row.get(\"free_form_answer\")):\n",
+    "        responses.append(str(row[\"free_form_answer\"]))\n",
+    "    return \"\\n\".join(responses) if responses else np.nan"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def preprocess_hf_dataset(hf_ds):\n",
+    "    \"\"\"\n",
+    "    Processes a HuggingFace dataset split into a cleaned Pandas DataFrame.\n",
+    "\n",
+    "    Steps:\n",
+    "      1. For each sample, convert 'full_text' to a markdown string.\n",
+    "      2. For every QA pair in the sample, extract the question and first answer.\n",
+    "      3. Build lists for answers, questions, and full_text (duplicated per question).\n",
+    "      4. Create a DataFrame from the collected data.\n",
+    "      5. Clean columns by replacing empty lists/strings with NaN and joining lists.\n",
+    "      6. Combine the answer components into a single 'golden response'.\n",
+    "\n",
+    "    The function uses nested tqdm progress bars for real-time feedback.\n",
+    "\n",
+    "    Returns:\n",
+    "        pd.DataFrame: The preprocessed DataFrame.\n",
+    "    \"\"\"\n",
+    "    answers_list = []  # Stores the first answer for each question\n",
+    "    questions_list = []  # Stores each question text\n",
+    "    full_text_list = []  # Stores the formatted full text per QA pair\n",
+    "\n",
+    "    # Outer loop: iterate over samples with progress bar\n",
+    "    for sample in tqdm(hf_ds, desc=\"Processing samples\", unit=\"sample\"):\n",
+    "        # Convert full text once per sample\n",
+    "        formatted_text = convert_full_text_to_markdown(sample[\"full_text\"])\n",
+    "        # Create a list of QA pairs\n",
+    "        qa_pairs = list(zip(sample[\"qas\"][\"question\"], sample[\"qas\"][\"answers\"]))\n",
+    "\n",
+    "        # Inner loop: iterate over each QA pair with its own progress bar\n",
+    "        for question, answer_set in tqdm(\n",
+    "            qa_pairs, desc=\"Processing QAs\", total=len(qa_pairs), leave=False, unit=\"qa\"\n",
+    "        ):\n",
+    "            answers_list.append(answer_set[\"answer\"][0])\n",
+    "            questions_list.append(question)\n",
+    "            full_text_list.append(formatted_text)\n",
+    "\n",
+    "    # Create DataFrame from the collected data\n",
+    "    df = pd.DataFrame(answers_list)\n",
+    "    df[\"question\"] = questions_list\n",
+    "    df[\"full_text\"] = full_text_list\n",
+    "\n",
+    "    # Data Cleaning: Replace empty lists/strings with NaN and join lists if needed\n",
+    "    df[\"extractive_spans\"] = df[\"extractive_spans\"].apply(\n",
+    "        lambda x: np.nan if isinstance(x, list) and len(x) == 0 else x\n",
+    "    )\n",
+    "    df[\"free_form_answer\"] = df[\"free_form_answer\"].apply(\n",
+    "        lambda x: np.nan if isinstance(x, str) and x.strip() == \"\" else x\n",
+    "    )\n",
+    "    df[\"yes_no\"] = df[\"yes_no\"].apply(lambda x: np.nan if x is None else x)\n",
+    "    df[\"extractive_spans\"] = df[\"extractive_spans\"].apply(\n",
+    "        lambda x: \"\\n\".join(x) if isinstance(x, list) else x\n",
+    "    )\n",
+    "\n",
+    "    # Combine the answer components into a single 'golden response'\n",
+    "    df[\"golden response\"] = df.apply(lambda row: combine_responses(row), axis=1)\n",
+    "\n",
+    "    return df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Processing samples: 100%|██████████| 10/10 [00:00<00:00, 208.37sample/s]\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>unanswerable</th>\n",
+       "      <th>extractive_spans</th>\n",
+       "      <th>yes_no</th>\n",
+       "      <th>free_form_answer</th>\n",
+       "      <th>evidence</th>\n",
+       "      <th>highlighted_evidence</th>\n",
+       "      <th>question</th>\n",
+       "      <th>full_text</th>\n",
+       "      <th>golden response</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>False</td>\n",
+       "      <td>BIBREF19\\nBIBREF20</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>[Table TABREF19 and TABREF26 report zero-shot ...</td>\n",
+       "      <td>[We compare our approaches with related approa...</td>\n",
+       "      <td>which multilingual approaches do they compare ...</td>\n",
+       "      <td>## Introduction\\n\\nAlthough Neural Machine Tra...</td>\n",
+       "      <td>BIBREF19\\nBIBREF20</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>False</td>\n",
+       "      <td>pivoting\\npivoting$_{\\rm m}$</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>[Table TABREF19 and TABREF26 report zero-shot ...</td>\n",
+       "      <td>[We compare our approaches with related approa...</td>\n",
+       "      <td>what are the pivot-based baselines?</td>\n",
+       "      <td>## Introduction\\n\\nAlthough Neural Machine Tra...</td>\n",
+       "      <td>pivoting\\npivoting$_{\\rm m}$</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>False</td>\n",
+       "      <td>Europarl\\nMultiUN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>[We evaluate our cross-lingual pre-training ba...</td>\n",
+       "      <td>[We evaluate our cross-lingual pre-training ba...</td>\n",
+       "      <td>which datasets did they experiment with?</td>\n",
+       "      <td>## Introduction\\n\\nAlthough Neural Machine Tra...</td>\n",
+       "      <td>Europarl\\nMultiUN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>False</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...</td>\n",
+       "      <td>[For MultiUN corpus, we use four languages: En...</td>\n",
+       "      <td>[For MultiUN corpus, we use four languages: En...</td>\n",
+       "      <td>what language pairs are explored?</td>\n",
+       "      <td>## Introduction\\n\\nAlthough Neural Machine Tra...</td>\n",
+       "      <td>De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>False</td>\n",
+       "      <td>Stanford NER\\nspaCy 2.0 \\nrecurrent model with...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>[In this section we describe a number of exper...</td>\n",
+       "      <td>[In this section we describe a number of exper...</td>\n",
+       "      <td>what ner models were evaluated?</td>\n",
+       "      <td>## Introduction\\n\\nNamed entity recognition is...</td>\n",
+       "      <td>Stanford NER\\nspaCy 2.0 \\nrecurrent model with...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   unanswerable                                   extractive_spans yes_no  \\\n",
+       "0         False                                 BIBREF19\\nBIBREF20    NaN   \n",
+       "1         False                       pivoting\\npivoting$_{\\rm m}$    NaN   \n",
+       "2         False                                  Europarl\\nMultiUN    NaN   \n",
+       "3         False                                                NaN    NaN   \n",
+       "4         False  Stanford NER\\nspaCy 2.0 \\nrecurrent model with...    NaN   \n",
+       "\n",
+       "                                    free_form_answer  \\\n",
+       "0                                                NaN   \n",
+       "1                                                NaN   \n",
+       "2                                                NaN   \n",
+       "3  De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...   \n",
+       "4                                                NaN   \n",
+       "\n",
+       "                                            evidence  \\\n",
+       "0  [Table TABREF19 and TABREF26 report zero-shot ...   \n",
+       "1  [Table TABREF19 and TABREF26 report zero-shot ...   \n",
+       "2  [We evaluate our cross-lingual pre-training ba...   \n",
+       "3  [For MultiUN corpus, we use four languages: En...   \n",
+       "4  [In this section we describe a number of exper...   \n",
+       "\n",
+       "                                highlighted_evidence  \\\n",
+       "0  [We compare our approaches with related approa...   \n",
+       "1  [We compare our approaches with related approa...   \n",
+       "2  [We evaluate our cross-lingual pre-training ba...   \n",
+       "3  [For MultiUN corpus, we use four languages: En...   \n",
+       "4  [In this section we describe a number of exper...   \n",
+       "\n",
+       "                                            question  \\\n",
+       "0  which multilingual approaches do they compare ...   \n",
+       "1                what are the pivot-based baselines?   \n",
+       "2           which datasets did they experiment with?   \n",
+       "3                  what language pairs are explored?   \n",
+       "4                    what ner models were evaluated?   \n",
+       "\n",
+       "                                           full_text  \\\n",
+       "0  ## Introduction\\n\\nAlthough Neural Machine Tra...   \n",
+       "1  ## Introduction\\n\\nAlthough Neural Machine Tra...   \n",
+       "2  ## Introduction\\n\\nAlthough Neural Machine Tra...   \n",
+       "3  ## Introduction\\n\\nAlthough Neural Machine Tra...   \n",
+       "4  ## Introduction\\n\\nNamed entity recognition is...   \n",
+       "\n",
+       "                                     golden response  \n",
+       "0                                 BIBREF19\\nBIBREF20  \n",
+       "1                       pivoting\\npivoting$_{\\rm m}$  \n",
+       "2                                  Europarl\\nMultiUN  \n",
+       "3  De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...  \n",
+       "4  Stanford NER\\nspaCy 2.0 \\nrecurrent model with...  "
+      ]
+     },
+     "execution_count": 35,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "processed_dataset = preprocess_hf_dataset(dataset)\n",
+    "processed_dataset.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Generating Responses from Gemini Models"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To generate responses using the Gemini model, we’ll first need to instantiate the Google GenAI client. We will define a prompt template that will be used when generating responses."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from google import genai\n",
+    "from dotenv import load_dotenv\n",
+    "\n",
+    "load_dotenv()\n",
+    "\n",
+    "client = genai.Client(api_key=os.getenv(\"GOOGLE_API_KEY\"))\n",
+    "\n",
+    "qa_prompt = (\n",
+    "    f\"Context information is below.\\n\"\n",
+    "    \"---------------------\\n\"\n",
+    "    \"{context_str}\\n\"\n",
+    "    \"---------------------\\n\"\n",
+    "    \"Given the context information and not prior knowledge, \"\n",
+    "    \"answer the query.\\n\"\n",
+    "    \"If you cannot find answer to the query, just say that it cannot be answered.\\n\"\n",
+    "    \"Query: {query_str}\\n\"\n",
+    "    \"Answer: \"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Gemini 2.0 Falsh"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "LLM Processing: 100%|██████████| 30/30 [00:04<00:00,  7.20it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "from async_executor import AsyncExecutor\n",
+    "\n",
+    "async def query_gemini_2(query_str: str, context_str: str):\n",
+    "    formatted_prompt = qa_prompt.format(context_str=context_str, query_str=query_str)\n",
+    "    response = await client.aio.models.generate_content(\n",
+    "        model=\"gemini-2.0-flash\", contents=formatted_prompt\n",
+    "    )\n",
+    "    return response.text\n",
+    "\n",
+    "# Create an instance of the asynchronous executor\n",
+    "executor = AsyncExecutor(\n",
+    "    desc=\"LLM Processing\",\n",
+    "    show_progress=True,\n",
+    "    raise_exceptions=False,\n",
+    ")\n",
+    "\n",
+    "for idx in range(processed_dataset.shape[0]):\n",
+    "    query = processed_dataset.iloc[idx][\"question\"]\n",
+    "    context = processed_dataset.iloc[idx][\"full_text\"]\n",
+    "    executor.submit(query_gemini_2, query, context)\n",
+    "\n",
+    "processed_dataset[\"gemini_2_flash_responses\"] = executor.results()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Gemini 1.5 Falsh"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "LLM Processing: 100%|██████████| 30/30 [00:05<00:00,  5.94it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "from async_executor import AsyncExecutor\n",
+    "\n",
+    "async def query_gemini_1_5(query_str: str, context_str: str):\n",
+    "    formatted_prompt = qa_prompt.format(context_str=context_str, query_str=query_str)\n",
+    "    response = await client.aio.models.generate_content(\n",
+    "        model=\"gemini-1.5-flash\", contents=formatted_prompt\n",
+    "    )\n",
+    "    return response.text\n",
+    "\n",
+    "# Create a new instance of the asynchronous executor\n",
+    "executor = AsyncExecutor(\n",
+    "    desc=\"LLM Processing\",\n",
+    "    show_progress=True,\n",
+    "    raise_exceptions=False,\n",
+    ")\n",
+    "\n",
+    "for idx in range(processed_dataset.shape[0]):\n",
+    "    query = processed_dataset.iloc[idx][\"question\"]\n",
+    "    context = processed_dataset.iloc[idx][\"full_text\"]\n",
+    "    executor.submit(query_gemini_1_5, query, context)\n",
+    "\n",
+    "processed_dataset[\"gemini_1_5_flash_responses\"] = executor.results()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>unanswerable</th>\n",
+       "      <th>extractive_spans</th>\n",
+       "      <th>yes_no</th>\n",
+       "      <th>free_form_answer</th>\n",
+       "      <th>evidence</th>\n",
+       "      <th>highlighted_evidence</th>\n",
+       "      <th>question</th>\n",
+       "      <th>full_text</th>\n",
+       "      <th>golden response</th>\n",
+       "      <th>gemini_2_flash_responses</th>\n",
+       "      <th>gemini_1_5_flash_responses</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>False</td>\n",
+       "      <td>BIBREF19\\nBIBREF20</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>[Table TABREF19 and TABREF26 report zero-shot ...</td>\n",
+       "      <td>[We compare our approaches with related approa...</td>\n",
+       "      <td>which multilingual approaches do they compare ...</td>\n",
+       "      <td>## Introduction\\n\\nAlthough Neural Machine Tra...</td>\n",
+       "      <td>BIBREF19\\nBIBREF20</td>\n",
+       "      <td>The text mentions comparison with Multilingual...</td>\n",
+       "      <td>The paper compares its approach with multiling...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>False</td>\n",
+       "      <td>pivoting\\npivoting$_{\\rm m}$</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>[Table TABREF19 and TABREF26 report zero-shot ...</td>\n",
+       "      <td>[We compare our approaches with related approa...</td>\n",
+       "      <td>what are the pivot-based baselines?</td>\n",
+       "      <td>## Introduction\\n\\nAlthough Neural Machine Tra...</td>\n",
+       "      <td>pivoting\\npivoting$_{\\rm m}$</td>\n",
+       "      <td>The pivot-based baselines are pivoting and piv...</td>\n",
+       "      <td>The provided text mentions two types of pivot-...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>False</td>\n",
+       "      <td>Europarl\\nMultiUN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>[We evaluate our cross-lingual pre-training ba...</td>\n",
+       "      <td>[We evaluate our cross-lingual pre-training ba...</td>\n",
+       "      <td>which datasets did they experiment with?</td>\n",
+       "      <td>## Introduction\\n\\nAlthough Neural Machine Tra...</td>\n",
+       "      <td>Europarl\\nMultiUN</td>\n",
+       "      <td>They experimented with the Europarl and MultiU...</td>\n",
+       "      <td>The experiments used two public datasets: Euro...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>False</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...</td>\n",
+       "      <td>[For MultiUN corpus, we use four languages: En...</td>\n",
+       "      <td>[For MultiUN corpus, we use four languages: En...</td>\n",
+       "      <td>what language pairs are explored?</td>\n",
+       "      <td>## Introduction\\n\\nAlthough Neural Machine Tra...</td>\n",
+       "      <td>De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...</td>\n",
+       "      <td>The language pairs explored in this paper are:...</td>\n",
+       "      <td>The paper explores the following language pair...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>False</td>\n",
+       "      <td>Stanford NER\\nspaCy 2.0 \\nrecurrent model with...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>[In this section we describe a number of exper...</td>\n",
+       "      <td>[In this section we describe a number of exper...</td>\n",
+       "      <td>what ner models were evaluated?</td>\n",
+       "      <td>## Introduction\\n\\nNamed entity recognition is...</td>\n",
+       "      <td>Stanford NER\\nspaCy 2.0 \\nrecurrent model with...</td>\n",
+       "      <td>Based on the provided text, the following NER ...</td>\n",
+       "      <td>Stanford NER, spaCy 2.0, and a recurrent model...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   unanswerable                                   extractive_spans yes_no  \\\n",
+       "0         False                                 BIBREF19\\nBIBREF20    NaN   \n",
+       "1         False                       pivoting\\npivoting$_{\\rm m}$    NaN   \n",
+       "2         False                                  Europarl\\nMultiUN    NaN   \n",
+       "3         False                                                NaN    NaN   \n",
+       "4         False  Stanford NER\\nspaCy 2.0 \\nrecurrent model with...    NaN   \n",
+       "\n",
+       "                                    free_form_answer  \\\n",
+       "0                                                NaN   \n",
+       "1                                                NaN   \n",
+       "2                                                NaN   \n",
+       "3  De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...   \n",
+       "4                                                NaN   \n",
+       "\n",
+       "                                            evidence  \\\n",
+       "0  [Table TABREF19 and TABREF26 report zero-shot ...   \n",
+       "1  [Table TABREF19 and TABREF26 report zero-shot ...   \n",
+       "2  [We evaluate our cross-lingual pre-training ba...   \n",
+       "3  [For MultiUN corpus, we use four languages: En...   \n",
+       "4  [In this section we describe a number of exper...   \n",
+       "\n",
+       "                                highlighted_evidence  \\\n",
+       "0  [We compare our approaches with related approa...   \n",
+       "1  [We compare our approaches with related approa...   \n",
+       "2  [We evaluate our cross-lingual pre-training ba...   \n",
+       "3  [For MultiUN corpus, we use four languages: En...   \n",
+       "4  [In this section we describe a number of exper...   \n",
+       "\n",
+       "                                            question  \\\n",
+       "0  which multilingual approaches do they compare ...   \n",
+       "1                what are the pivot-based baselines?   \n",
+       "2           which datasets did they experiment with?   \n",
+       "3                  what language pairs are explored?   \n",
+       "4                    what ner models were evaluated?   \n",
+       "\n",
+       "                                           full_text  \\\n",
+       "0  ## Introduction\\n\\nAlthough Neural Machine Tra...   \n",
+       "1  ## Introduction\\n\\nAlthough Neural Machine Tra...   \n",
+       "2  ## Introduction\\n\\nAlthough Neural Machine Tra...   \n",
+       "3  ## Introduction\\n\\nAlthough Neural Machine Tra...   \n",
+       "4  ## Introduction\\n\\nNamed entity recognition is...   \n",
+       "\n",
+       "                                     golden response  \\\n",
+       "0                                 BIBREF19\\nBIBREF20   \n",
+       "1                       pivoting\\npivoting$_{\\rm m}$   \n",
+       "2                                  Europarl\\nMultiUN   \n",
+       "3  De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...   \n",
+       "4  Stanford NER\\nspaCy 2.0 \\nrecurrent model with...   \n",
+       "\n",
+       "                            gemini_2_flash_responses  \\\n",
+       "0  The text mentions comparison with Multilingual...   \n",
+       "1  The pivot-based baselines are pivoting and piv...   \n",
+       "2  They experimented with the Europarl and MultiU...   \n",
+       "3  The language pairs explored in this paper are:...   \n",
+       "4  Based on the provided text, the following NER ...   \n",
+       "\n",
+       "                          gemini_1_5_flash_responses  \n",
+       "0  The paper compares its approach with multiling...  \n",
+       "1  The provided text mentions two types of pivot-...  \n",
+       "2  The experiments used two public datasets: Euro...  \n",
+       "3  The paper explores the following language pair...  \n",
+       "4  Stanford NER, spaCy 2.0, and a recurrent model...  "
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "processed_dataset.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Defining Metrics For Evaluation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We are benchmarking a question-answering task and we want to ensure that each question is answered properly and accurately. To achieve this, we use the following metrics from Ragas you find the complete list of metrics [here]()\n",
+    "\n",
+    "- [Answer Accuracy](): Measures how closely a response matches the reference answer.\n",
+    "- [Answer Correctness](): Assesses the alignment between the generated answer and the reference answer.\n",
+    "- [Factual Correctness]():Checks if all statements in a response are supported by the reference answer.\n",
+    "\n",
+    "For each question, we know whether it can be answered from the provided context, and we want to verify if the model correctly identifies when it cannot. For this purpose, we define a custom binary metric using [AspectCritique]()."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from ragas.metrics import AnswerAccuracy, AnswerCorrectness, FactualCorrectness, AspectCritic\n",
+    "import getpass\n",
+    "import os\n",
+    "\n",
+    "from ragas.llms import LangchainLLMWrapper\n",
+    "from langchain_openai import ChatOpenAI\n",
+    "\n",
+    "if \"OPENAI_API_KEY\" not in os.environ:\n",
+    "    os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"Enter your OpenAI API key: \")\n",
+    "\n",
+    "evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=\"gpt-4o-mini\"))\n",
+    "\n",
+    "aspect_critic = AspectCritic(\n",
+    "    name=\"unanswerable\",\n",
+    "    definition=\"Return 1 if the query cannot be answered by the provided context, otherwise return 0.\",\n",
+    "    llm=evaluator_llm,\n",
+    ")\n",
+    "\n",
+    "metrics = [\n",
+    "    AnswerAccuracy(llm=evaluator_llm),\n",
+    "    AnswerCorrectness(llm=evaluator_llm, weights=[1, 0]),\n",
+    "    aspect_critic,\n",
+    "    FactualCorrectness(llm=evaluator_llm),\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Benchmarking on Ragas Metrics"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We format the processed data into a Ragas-compatible EvaluationDataset, then apply the metrics to evaluate model performance, more information on it can be found [here](). We’ll construct the EvaluationDataset using the questions and the golden answer responses generated by the Gemini models from our processed dataset."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Gemini 2.0 Falsh"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We'll create EvaluationDataset for the Gemini 2.0 Flash."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>user_input</th>\n",
+       "      <th>response</th>\n",
+       "      <th>reference</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>which multilingual approaches do they compare ...</td>\n",
+       "      <td>The text mentions comparison with Multilingual...</td>\n",
+       "      <td>BIBREF19\\nBIBREF20</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>what are the pivot-based baselines?</td>\n",
+       "      <td>The pivot-based baselines are pivoting and piv...</td>\n",
+       "      <td>pivoting\\npivoting$_{\\rm m}$</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>which datasets did they experiment with?</td>\n",
+       "      <td>They experimented with the Europarl and MultiU...</td>\n",
+       "      <td>Europarl\\nMultiUN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>what language pairs are explored?</td>\n",
+       "      <td>The language pairs explored in this paper are:...</td>\n",
+       "      <td>De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>what ner models were evaluated?</td>\n",
+       "      <td>Based on the provided text, the following NER ...</td>\n",
+       "      <td>Stanford NER\\nspaCy 2.0 \\nrecurrent model with...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                          user_input  \\\n",
+       "0  which multilingual approaches do they compare ...   \n",
+       "1                what are the pivot-based baselines?   \n",
+       "2           which datasets did they experiment with?   \n",
+       "3                  what language pairs are explored?   \n",
+       "4                    what ner models were evaluated?   \n",
+       "\n",
+       "                                            response  \\\n",
+       "0  The text mentions comparison with Multilingual...   \n",
+       "1  The pivot-based baselines are pivoting and piv...   \n",
+       "2  They experimented with the Europarl and MultiU...   \n",
+       "3  The language pairs explored in this paper are:...   \n",
+       "4  Based on the provided text, the following NER ...   \n",
+       "\n",
+       "                                           reference  \n",
+       "0                                 BIBREF19\\nBIBREF20  \n",
+       "1                       pivoting\\npivoting$_{\\rm m}$  \n",
+       "2                                  Europarl\\nMultiUN  \n",
+       "3  De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...  \n",
+       "4  Stanford NER\\nspaCy 2.0 \\nrecurrent model with...  "
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from ragas.dataset_schema import EvaluationDataset\n",
+    "\n",
+    "dataset_list = []\n",
+    "\n",
+    "for i in range(processed_dataset.shape[0]):\n",
+    "    sample = {\n",
+    "        \"user_input\": (\n",
+    "            \"\" if pd.isna(processed_dataset.iloc[i].get(\"question\")) else processed_dataset.iloc[i].get(\"question\")\n",
+    "        ),\n",
+    "        \"reference\": (\n",
+    "            \"\"\n",
+    "            if pd.isna(processed_dataset.iloc[i].get(\"golden response\"))\n",
+    "            else processed_dataset.iloc[i].get(\"golden response\")\n",
+    "        ),\n",
+    "        \"response\": (\n",
+    "            \"\"\n",
+    "            if pd.isna(processed_dataset[\"gemini_2_flash_responses\"].iloc[i])\n",
+    "            else processed_dataset[\"gemini_2_flash_responses\"].iloc[i]\n",
+    "        ),\n",
+    "    }\n",
+    "    dataset_list.append(sample)\n",
+    "\n",
+    "gemini_2_dataset = EvaluationDataset.from_list(dataset_list)\n",
+    "gemini_2_dataset.to_pandas().head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, let’s evaluate the responses of Gemini 2.0 Falsh."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Evaluating: 100%|██████████| 120/120 [00:49<00:00,  2.44it/s]\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>user_input</th>\n",
+       "      <th>response</th>\n",
+       "      <th>reference</th>\n",
+       "      <th>nv_accuracy</th>\n",
+       "      <th>answer_correctness</th>\n",
+       "      <th>unanswerable</th>\n",
+       "      <th>factual_correctness(mode=f1)</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>which multilingual approaches do they compare ...</td>\n",
+       "      <td>The text mentions comparison with Multilingual...</td>\n",
+       "      <td>BIBREF19\\nBIBREF20</td>\n",
+       "      <td>0.25</td>\n",
+       "      <td>0.400000</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.5</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>what are the pivot-based baselines?</td>\n",
+       "      <td>The pivot-based baselines are pivoting and piv...</td>\n",
+       "      <td>pivoting\\npivoting$_{\\rm m}$</td>\n",
+       "      <td>0.25</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>which datasets did they experiment with?</td>\n",
+       "      <td>They experimented with the Europarl and MultiU...</td>\n",
+       "      <td>Europarl\\nMultiUN</td>\n",
+       "      <td>1.00</td>\n",
+       "      <td>1.000000</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>what language pairs are explored?</td>\n",
+       "      <td>The language pairs explored in this paper are:...</td>\n",
+       "      <td>De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...</td>\n",
+       "      <td>0.25</td>\n",
+       "      <td>0.545455</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>what ner models were evaluated?</td>\n",
+       "      <td>Based on the provided text, the following NER ...</td>\n",
+       "      <td>Stanford NER\\nspaCy 2.0 \\nrecurrent model with...</td>\n",
+       "      <td>0.50</td>\n",
+       "      <td>0.600000</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.0</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                          user_input  \\\n",
+       "0  which multilingual approaches do they compare ...   \n",
+       "1                what are the pivot-based baselines?   \n",
+       "2           which datasets did they experiment with?   \n",
+       "3                  what language pairs are explored?   \n",
+       "4                    what ner models were evaluated?   \n",
+       "\n",
+       "                                            response  \\\n",
+       "0  The text mentions comparison with Multilingual...   \n",
+       "1  The pivot-based baselines are pivoting and piv...   \n",
+       "2  They experimented with the Europarl and MultiU...   \n",
+       "3  The language pairs explored in this paper are:...   \n",
+       "4  Based on the provided text, the following NER ...   \n",
+       "\n",
+       "                                           reference  nv_accuracy  \\\n",
+       "0                                 BIBREF19\\nBIBREF20         0.25   \n",
+       "1                       pivoting\\npivoting$_{\\rm m}$         0.25   \n",
+       "2                                  Europarl\\nMultiUN         1.00   \n",
+       "3  De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...         0.25   \n",
+       "4  Stanford NER\\nspaCy 2.0 \\nrecurrent model with...         0.50   \n",
+       "\n",
+       "   answer_correctness  unanswerable  factual_correctness(mode=f1)  \n",
+       "0            0.400000             0                           0.5  \n",
+       "1            0.000000             0                           0.0  \n",
+       "2            1.000000             0                           0.0  \n",
+       "3            0.545455             0                           0.0  \n",
+       "4            0.600000             0                           0.0  "
+      ]
+     },
+     "execution_count": 13,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from ragas import evaluate\n",
+    "\n",
+    "gemini_2_flash_score = evaluate(dataset=gemini_2_dataset, metrics=metrics)\n",
+    "gemini_2_flash_score.to_pandas().head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A completely optional step, if you want to upload the evaluation results to your Ragas app, you can run the command below.You can learn more about Ragas app here."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Evaluation results uploaded! View at https://app.ragas.io/dashboard/alignment/evaluation/0e83b23d-ceb6-49cf-b8b2-eec951b43417\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'https://app.ragas.io/dashboard/alignment/evaluation/0e83b23d-ceb6-49cf-b8b2-eec951b43417'"
+      ]
+     },
+     "execution_count": 14,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "gemini_2_flash_score.upload()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Gemini 1.5 Flash"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, we’ll follow similar steps for Gemini 1.5 Flash as well.\n",
+    "\n",
+    "We’ll generate the evaluation dataset for the Gemini 1.5 Flash responses and then perform the same evaluation on it's responses."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>user_input</th>\n",
+       "      <th>response</th>\n",
+       "      <th>reference</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>which multilingual approaches do they compare ...</td>\n",
+       "      <td>The paper compares its approach with multiling...</td>\n",
+       "      <td>BIBREF19\\nBIBREF20</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>what are the pivot-based baselines?</td>\n",
+       "      <td>The provided text mentions two types of pivot-...</td>\n",
+       "      <td>pivoting\\npivoting$_{\\rm m}$</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>which datasets did they experiment with?</td>\n",
+       "      <td>The experiments used two public datasets: Euro...</td>\n",
+       "      <td>Europarl\\nMultiUN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>what language pairs are explored?</td>\n",
+       "      <td>The paper explores the following language pair...</td>\n",
+       "      <td>De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>what ner models were evaluated?</td>\n",
+       "      <td>Stanford NER, spaCy 2.0, and a recurrent model...</td>\n",
+       "      <td>Stanford NER\\nspaCy 2.0 \\nrecurrent model with...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                          user_input  \\\n",
+       "0  which multilingual approaches do they compare ...   \n",
+       "1                what are the pivot-based baselines?   \n",
+       "2           which datasets did they experiment with?   \n",
+       "3                  what language pairs are explored?   \n",
+       "4                    what ner models were evaluated?   \n",
+       "\n",
+       "                                            response  \\\n",
+       "0  The paper compares its approach with multiling...   \n",
+       "1  The provided text mentions two types of pivot-...   \n",
+       "2  The experiments used two public datasets: Euro...   \n",
+       "3  The paper explores the following language pair...   \n",
+       "4  Stanford NER, spaCy 2.0, and a recurrent model...   \n",
+       "\n",
+       "                                           reference  \n",
+       "0                                 BIBREF19\\nBIBREF20  \n",
+       "1                       pivoting\\npivoting$_{\\rm m}$  \n",
+       "2                                  Europarl\\nMultiUN  \n",
+       "3  De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...  \n",
+       "4  Stanford NER\\nspaCy 2.0 \\nrecurrent model with...  "
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from ragas.dataset_schema import EvaluationDataset\n",
+    "\n",
+    "dataset_list = []\n",
+    "\n",
+    "for i in range(processed_dataset.shape[0]):\n",
+    "    sample = {\n",
+    "        \"user_input\": (\n",
+    "            \"\" if pd.isna(processed_dataset.iloc[i].get(\"question\")) else processed_dataset.iloc[i].get(\"question\")\n",
+    "        ),\n",
+    "        \"reference\": (\n",
+    "            \"\"\n",
+    "            if pd.isna(processed_dataset.iloc[i].get(\"golden response\"))\n",
+    "            else processed_dataset.iloc[i].get(\"golden response\")\n",
+    "        ),\n",
+    "        \"response\": (\n",
+    "            \"\"\n",
+    "            if pd.isna(processed_dataset[\"gemini_1_5_flash_responses\"].iloc[i])\n",
+    "            else processed_dataset[\"gemini_1_5_flash_responses\"].iloc[i]\n",
+    "        ),\n",
+    "    }\n",
+    "    dataset_list.append(sample)\n",
+    "\n",
+    "gemini_1_5_dataset = EvaluationDataset.from_list(dataset_list)\n",
+    "gemini_1_5_dataset.to_pandas().head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Evaluating: 100%|██████████| 120/120 [01:02<00:00,  1.93it/s]\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>user_input</th>\n",
+       "      <th>response</th>\n",
+       "      <th>reference</th>\n",
+       "      <th>nv_accuracy</th>\n",
+       "      <th>answer_correctness</th>\n",
+       "      <th>unanswerable</th>\n",
+       "      <th>factual_correctness(mode=f1)</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>which multilingual approaches do they compare ...</td>\n",
+       "      <td>The paper compares its approach with multiling...</td>\n",
+       "      <td>BIBREF19\\nBIBREF20</td>\n",
+       "      <td>0.25</td>\n",
+       "      <td>0.400000</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.00</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>what are the pivot-based baselines?</td>\n",
+       "      <td>The provided text mentions two types of pivot-...</td>\n",
+       "      <td>pivoting\\npivoting$_{\\rm m}$</td>\n",
+       "      <td>0.25</td>\n",
+       "      <td>0.181818</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.18</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>which datasets did they experiment with?</td>\n",
+       "      <td>The experiments used two public datasets: Euro...</td>\n",
+       "      <td>Europarl\\nMultiUN</td>\n",
+       "      <td>1.00</td>\n",
+       "      <td>0.800000</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.00</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>what language pairs are explored?</td>\n",
+       "      <td>The paper explores the following language pair...</td>\n",
+       "      <td>De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.533333</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.00</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>what ner models were evaluated?</td>\n",
+       "      <td>Stanford NER, spaCy 2.0, and a recurrent model...</td>\n",
+       "      <td>Stanford NER\\nspaCy 2.0 \\nrecurrent model with...</td>\n",
+       "      <td>0.50</td>\n",
+       "      <td>0.571429</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.00</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                          user_input  \\\n",
+       "0  which multilingual approaches do they compare ...   \n",
+       "1                what are the pivot-based baselines?   \n",
+       "2           which datasets did they experiment with?   \n",
+       "3                  what language pairs are explored?   \n",
+       "4                    what ner models were evaluated?   \n",
+       "\n",
+       "                                            response  \\\n",
+       "0  The paper compares its approach with multiling...   \n",
+       "1  The provided text mentions two types of pivot-...   \n",
+       "2  The experiments used two public datasets: Euro...   \n",
+       "3  The paper explores the following language pair...   \n",
+       "4  Stanford NER, spaCy 2.0, and a recurrent model...   \n",
+       "\n",
+       "                                           reference  nv_accuracy  \\\n",
+       "0                                 BIBREF19\\nBIBREF20         0.25   \n",
+       "1                       pivoting\\npivoting$_{\\rm m}$         0.25   \n",
+       "2                                  Europarl\\nMultiUN         1.00   \n",
+       "3  De-En, En-Fr, Fr-En, En-Es, Ro-En, En-De, Ar-E...         0.00   \n",
+       "4  Stanford NER\\nspaCy 2.0 \\nrecurrent model with...         0.50   \n",
+       "\n",
+       "   answer_correctness  unanswerable  factual_correctness(mode=f1)  \n",
+       "0            0.400000             0                          0.00  \n",
+       "1            0.181818             0                          0.18  \n",
+       "2            0.800000             0                          0.00  \n",
+       "3            0.533333             0                          0.00  \n",
+       "4            0.571429             0                          0.00  "
+      ]
+     },
+     "execution_count": 17,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from ragas import evaluate\n",
+    "\n",
+    "gemini_1_5_flash_score = evaluate(dataset=gemini_1_5_dataset, metrics=metrics)\n",
+    "gemini_1_5_flash_score.to_pandas().head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Comparing the Results"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that we have completed our evaluations, let’s compare how both models performed on acadmic question answering."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def print__results(result):\n",
+    "    result = result._repr_dict\n",
+    "    print(\"Response Accuracy:\", result.get(\"nv_accuracy\"))\n",
+    "    print(\"Answer Correctness:\", result.get(\"answer_correctness\"))\n",
+    "    print(\"Factual Correctness:\", result.get(\"factual_correctness(mode=f1)\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Response Accuracy: 0.5416666666666666\n",
+      "Answer Correctness: 0.47723550201811066\n",
+      "Factual Correctness: 0.2533333333333333\n"
+     ]
+    }
+   ],
+   "source": [
+    "print__results(gemini_1_5_flash_score)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Response Accuracy: 0.5666666666666667\n",
+      "Answer Correctness: 0.48055486996663466\n",
+      "Factual Correctness: 0.23633333333333334\n"
+     ]
+    }
+   ],
+   "source": [
+    "print__results(gemini_2_flash_score)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Gemini 2.0 Flash performs slightly better overall."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let’s now see how well the models performed on classifying if a given question can be answered with the provided text. \n",
+    "\n",
+    "For this, we’ll use the result from the “unanswerable” metric and compare it with the original ground truth from the “unanswerable” column in our pre-processed dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score\n",
+    "\n",
+    "\n",
+    "def print_metrics(actuals, preds, model_name=\"Model\", zero_division_value=0):\n",
+    "    \"\"\"\n",
+    "    Prints common classification metrics for a given set of actual and predicted values.\n",
+    "\n",
+    "    Parameters:\n",
+    "        actuals (array-like): Ground truth labels.\n",
+    "        preds (array-like): Predicted labels.\n",
+    "        model_name (str): Name of the model for display purposes.\n",
+    "        zero_division_value (int or str): Sets the value to return when there is a zero division.\n",
+    "                                          Options: 0, 1, or \"warn\" (default is 0 here).\n",
+    "    \"\"\"\n",
+    "    print(f\"Metrics for {model_name}:\")\n",
+    "    print(\"Accuracy:\", accuracy_score(actuals, preds))\n",
+    "    print(\n",
+    "        \"Precision:\", precision_score(actuals, preds, zero_division=zero_division_value)\n",
+    "    )\n",
+    "    print(\"Recall:\", recall_score(actuals, preds, zero_division=zero_division_value))\n",
+    "    print(\"F1 Score:\", f1_score(actuals, preds, zero_division=zero_division_value))\n",
+    "    print(\"\\nClassification Report:\")\n",
+    "    print(classification_report(actuals, preds, zero_division=zero_division_value))\n",
+    "    \n",
+    "gemini_1_5_flash_prediction = gemini_1_5_flash_score[\"unanswerable\"]\n",
+    "gemini_2_flash_prediction = gemini_2_flash_score[\"unanswerable\"]\n",
+    "groundtruth = processed_dataset[\"unanswerable\"].astype(int)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Metrics for Gemini 2 Flash:\n",
+      "Accuracy: 0.9333333333333333\n",
+      "Precision: 0.5\n",
+      "Recall: 1.0\n",
+      "F1 Score: 0.6666666666666666\n",
+      "\n",
+      "Classification Report:\n",
+      "              precision    recall  f1-score   support\n",
+      "\n",
+      "           0       1.00      0.93      0.96        28\n",
+      "           1       0.50      1.00      0.67         2\n",
+      "\n",
+      "    accuracy                           0.93        30\n",
+      "   macro avg       0.75      0.96      0.81        30\n",
+      "weighted avg       0.97      0.93      0.94        30\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print_metrics(groundtruth, gemini_2_flash_prediction, model_name=\"Gemini 2 Flash\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Metrics for Gemini 1.5 Flash:\n",
+      "Accuracy: 0.9\n",
+      "Precision: 0.3333333333333333\n",
+      "Recall: 0.5\n",
+      "F1 Score: 0.4\n",
+      "\n",
+      "Classification Report:\n",
+      "              precision    recall  f1-score   support\n",
+      "\n",
+      "           0       0.96      0.93      0.95        28\n",
+      "           1       0.33      0.50      0.40         2\n",
+      "\n",
+      "    accuracy                           0.90        30\n",
+      "   macro avg       0.65      0.71      0.67        30\n",
+      "weighted avg       0.92      0.90      0.91        30\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print_metrics(groundtruth, gemini_1_5_flash_prediction, model_name=\"Gemini 1.5 Flash\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Gemini 2.0 Flash also outperforms Gemini 1.5 Flash in identifying unanswerable questions."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## What's Next\n",
+    "\n",
+    "You can benchmark your model on any dataset using Ragas metrics as long as the dataset is formatted according to Ragas EvaluationDatase. You can try benchmarking your models on a variety of established benchmarking datasets.\n",
+    "- [PubMedQA](https://huggingface.co/datasets/qiaojin/PubMedQA)\n",
+    "- [MultiHopRAG](https://huggingface.co/datasets/yixuantt/MultiHopRAG)\n",
+    "- [ms_marco](https://huggingface.co/datasets/microsoft/ms_marco)\n",
+    "\n",
+    "And many more."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "fixci",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/posts/benchmarking/qasper_data_collection.png b/posts/benchmarking/qasper_data_collection.png
new file mode 100644
index 0000000..cfa8282
Binary files /dev/null and b/posts/benchmarking/qasper_data_collection.png differ