-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numeric Extraction Benchmarking Framework #70
Conversation
changed the merge base to the existing |
@nilskre I guess we need to go over this case and see how it compares to the automated/encrypted workflow (after the hackathon) |
Yes, would be nice to evaluate, whether we can bring these two developments together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Due to a 503 on http://llm.biocypher.org/ I was not able to test it practically.
I think that there is still some work left in making it more generally applicable. At the moment it is tailored towards this one dataset.
But I think coming up with a concept that is generally applicable (and to adapt this case), should probably be its own follow-up task (and PR).
json_data = [json.loads(line) for line in response.iter_lines(decode_unicode=True) if line] | ||
|
||
else: | ||
print(f"Failed to retrieve data: Status code {response.status_code}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe throw an Exception instead of printing.
|
||
def test_preprocess_gender_representation_df(): | ||
# Test the preprocessing function | ||
processed_df = preprocess_gender_represenation_df(SAMPLE_DF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the SAMPLE_DF
used and not the real dataset file? This test fails for me with Failed to retrieve data: Status code 404
. When calling this with the original dataset file (like in numericalQA
) this works as expected).
return mean_accuracy, mean_precision, mean_recall, mean_f1, percentage_retrieved, results_df | ||
|
||
|
||
def run_test(bechmark_df, model_uid, results_dictionary, main_url): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess parts of this should go into a benchmark test case in test_numericQA
. This is what you meant with your comment about Pytest, right?
Agree with this needing more integration into biochatter to be a part of the ecosystem; as is, this really is just a script tailored to the dataset, without biochatter involvement. Also, I feel that testing the edge cases and single representative examples may be more suitable for integrating into the current pytest framework. Also because running all cases is quite resource-intensive for testing only one particular kind of question. For now I am making this a draft to be addressed/integrated later. |
Closing this as stale. |
I've initiated this pull request for benchmarking LLMs against numerical data extraction/reasoning tasks to invite early feedback and foster improvemen. Please review and share your thoughts or modifications, as this is an ongoing process.
Key Developments and Issues:
Prompt Template & Output Formatting Across LLMs: Currently, all language learning models (LLMs) utilize the same prompt template and output formatting. This uniformity is crucial for comparative analysis, but it also introduces some challenges.
Output Formatting: I've observed occasional failures in output formatting. Interestingly, after multiple attempts, the system seems to correctly retrieve numbers and place them into a Python dictionary. This inconsistency needs to be addressed for reliable performance, e.g. run 5 times.
For evaluating the numeric extraction capability, we have a dataset of 650 texts, each with known real answers.
Areas Needing Assistance:
Separation of JSON: I propose that we need a clear distinction in our evaluation between the LLMs' ability to format output correctly (in JSON) and their proficiency in accurately retrieving numeric information. This separation will help in pinpointing the specific areas each model excels in or needs improvement.
Pytest: I am not well-versed with pytest for implementing tests in this context. Your expertise or guidance in this area would be highly beneficial.
New Branch for Implementation: I plan to initiate a new branch where these aspects can be implemented and tested more effectively. This branch will focus on refining the output formatting process and enhancing the accuracy of numeric extraction.
Specific Scoring for This Dataset: It’s worth noting that the scoring mechanism we're using is tailored for this particular dataset. While it's effective in this context, we might need to adapt or expand it for broader applications.