Run Longevity Benchmark #268

HansJarchow · 2025-01-28T18:50:35Z

Updates

(27.01.24)

I’ve made a few additional changes.

`conftest.py`:

Here, I implemented model support for Groq models and LM Studio models.

`benchmark_utils.py`:

Only minor changes to load_judgement_dataset.

`load_dataset.py`:

I integrated the handling of the different system prompts into _expand_longevity_test_cases.

`test_longevity_judge_responses_simultan.py`:

The test method test_generate_rag_responses has been added here.

`benchmark_longevity_data.yaml`:

I added the system prompts and copied the previously separated RAG test cases into this file. All test cases are now in one YAML file.

`prompts.yaml`:

I added all additional requirements and updated the prompts to the latest version.

Additionally, I attempted to create an evaluation script Longevity_Data_Analysis.ipynb based on the generated example data:

generate_rag_responses_llama-3.2-3b-instruct_response.csv
generate_responses_llama-3.2-3b-instruct_response.csv
generate_responses_qwen2.5-14b-instruct_response.csv
judge_responses.csv
judge_behavior_response.csv

I also tried to determine how often the models are cautious, i.e., how often they suggest consulting a doctor, and how often they are definitive, i.e., providing a recommendation without deferring to healthcare professionals.

Updates

(28.01.24)

I’ve made a few more minor changes to Longevity_Data_Analysis.ipynb.

For this PR:

Since this is primarily about running the benchmark followed by evaluation, here are the tasks ahead:

Model selection
Preparation and execution of the benchmark/judgement (e.g., how many responses per question, how often should each response be evaluated by the judge?)
more or less pleasant: statistical evaluation (binary data; but if judged multiple times, also values between "0" and "1," i.e., continuous; setting thresholds for operations such as McNemar's, for example?)
Visualization

add minor changes to judgement data analysis

ee266f1

HansJarchow requested a review from slobentanzer January 28, 2025 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run Longevity Benchmark #268

Run Longevity Benchmark #268

HansJarchow commented Jan 28, 2025 •

edited

Loading

Run Longevity Benchmark #268

Are you sure you want to change the base?

Run Longevity Benchmark #268

Conversation

HansJarchow commented Jan 28, 2025 • edited Loading

Updates

conftest.py:

benchmark_utils.py:

load_dataset.py:

test_longevity_judge_responses_simultan.py:

benchmark_longevity_data.yaml:

prompts.yaml:

Updates

For this PR:

HansJarchow commented Jan 28, 2025 •

edited

Loading

`conftest.py`:

`benchmark_utils.py`:

`load_dataset.py`:

`test_longevity_judge_responses_simultan.py`:

`benchmark_longevity_data.yaml`:

`prompts.yaml`: