Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run Longevity Benchmark #268

Open
wants to merge 1 commit into
base: longevity-benchmark
Choose a base branch
from

Conversation

HansJarchow
Copy link
Collaborator

@HansJarchow HansJarchow commented Jan 28, 2025

Updates

(27.01.24)

I’ve made a few additional changes.

conftest.py:

Here, I implemented model support for Groq models and LM Studio models.

benchmark_utils.py:

Only minor changes to load_judgement_dataset.

load_dataset.py:

I integrated the handling of the different system prompts into _expand_longevity_test_cases.

test_longevity_judge_responses_simultan.py:

The test method test_generate_rag_responses has been added here.

benchmark_longevity_data.yaml:

I added the system prompts and copied the previously separated RAG test cases into this file. All test cases are now in one YAML file.

prompts.yaml:

I added all additional requirements and updated the prompts to the latest version.

Additionally, I attempted to create an evaluation script Longevity_Data_Analysis.ipynb based on the generated example data:

  • generate_rag_responses_llama-3.2-3b-instruct_response.csv
  • generate_responses_llama-3.2-3b-instruct_response.csv
  • generate_responses_qwen2.5-14b-instruct_response.csv
  • judge_responses.csv
  • judge_behavior_response.csv

I also tried to determine how often the models are cautious, i.e., how often they suggest consulting a doctor, and how often they are definitive, i.e., providing a recommendation without deferring to healthcare professionals.

Updates

(28.01.24)

I’ve made a few more minor changes to Longevity_Data_Analysis.ipynb.

For this PR:

Since this is primarily about running the benchmark followed by evaluation, here are the tasks ahead:

  • Model selection
  • Preparation and execution of the benchmark/judgement (e.g., how many responses per question, how often should each response be evaluated by the judge?)
  • more or less pleasant: statistical evaluation (binary data; but if judged multiple times, also values between "0" and "1," i.e., continuous; setting thresholds for operations such as McNemar's, for example?)
  • Visualization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant