Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Updates
(27.01.24)
I’ve made a few additional changes.
conftest.py
:Here, I implemented model support for
Groq models
andLM Studio models
.benchmark_utils.py
:Only minor changes to
load_judgement_dataset
.load_dataset.py
:I integrated the handling of the different system prompts into
_expand_longevity_test_cases
.test_longevity_judge_responses_simultan.py
:The test method
test_generate_rag_responses
has been added here.benchmark_longevity_data.yaml
:I added the system prompts and copied the previously separated
RAG
test cases into this file. All test cases are now in oneYAML
file.prompts.yaml
:I added all additional requirements and updated the prompts to the latest version.
Additionally, I attempted to create an evaluation script
Longevity_Data_Analysis.ipynb
based on the generated example data:generate_rag_responses_llama-3.2-3b-instruct_response.csv
generate_responses_llama-3.2-3b-instruct_response.csv
generate_responses_qwen2.5-14b-instruct_response.csv
judge_responses.csv
judge_behavior_response.csv
I also tried to determine how often the models are
cautious
, i.e., how often they suggest consulting a doctor, and how often they aredefinitive
, i.e., providing a recommendation without deferring to healthcare professionals.Updates
(28.01.24)
I’ve made a few more minor changes to
Longevity_Data_Analysis.ipynb
.For this PR:
Since this is primarily about running the benchmark followed by evaluation, here are the tasks ahead: