Numeric Extraction Benchmarking Framework #70

fmdelgado · 2023-12-14T09:11:29Z

I've initiated this pull request for benchmarking LLMs against numerical data extraction/reasoning tasks to invite early feedback and foster improvemen. Please review and share your thoughts or modifications, as this is an ongoing process.

Key Developments and Issues:

Prompt Template & Output Formatting Across LLMs: Currently, all language learning models (LLMs) utilize the same prompt template and output formatting. This uniformity is crucial for comparative analysis, but it also introduces some challenges.
Output Formatting: I've observed occasional failures in output formatting. Interestingly, after multiple attempts, the system seems to correctly retrieve numbers and place them into a Python dictionary. This inconsistency needs to be addressed for reliable performance, e.g. run 5 times.
For evaluating the numeric extraction capability, we have a dataset of 650 texts, each with known real answers.

Areas Needing Assistance:

Separation of JSON: I propose that we need a clear distinction in our evaluation between the LLMs' ability to format output correctly (in JSON) and their proficiency in accurately retrieving numeric information. This separation will help in pinpointing the specific areas each model excels in or needs improvement.
Pytest: I am not well-versed with pytest for implementing tests in this context. Your expertise or guidance in this area would be highly beneficial.
New Branch for Implementation: I plan to initiate a new branch where these aspects can be implemented and tested more effectively. This branch will focus on refining the output formatting process and enhancing the accuracy of numeric extraction.
Specific Scoring for This Dataset: It’s worth noting that the scoring mechanism we're using is tailored for this particular dataset. While it's effective in this context, we might need to adapt or expand it for broader applications.

slobentanzer · 2023-12-14T09:49:05Z

changed the merge base to the existing benchmark branch, we don't need to create a new branch for that.

slobentanzer · 2023-12-14T17:52:10Z

@nilskre I guess we need to go over this case and see how it compares to the automated/encrypted workflow (after the hackathon)

nilskre · 2023-12-15T15:34:10Z

Yes, would be nice to evaluate, whether we can bring these two developments together.

nilskre

Looks good. Due to a 503 on http://llm.biocypher.org/ I was not able to test it practically.

I think that there is still some work left in making it more generally applicable. At the moment it is tailored towards this one dataset.

But I think coming up with a concept that is generally applicable (and to adapt this case), should probably be its own follow-up task (and PR).

benchmark/numericQA.py

nilskre · 2024-01-09T13:54:30Z

benchmark/numericQA.py

+        json_data = [json.loads(line) for line in response.iter_lines(decode_unicode=True) if line]
+
+    else:
+        print(f"Failed to retrieve data: Status code {response.status_code}")


Maybe throw an Exception instead of printing.

nilskre · 2024-01-09T14:48:06Z

benchmark/test_numericQA.py

+
+def test_preprocess_gender_representation_df():
+    # Test the preprocessing function
+    processed_df = preprocess_gender_represenation_df(SAMPLE_DF)


Why is the SAMPLE_DF used and not the real dataset file? This test fails for me with Failed to retrieve data: Status code 404. When calling this with the original dataset file (like in numericalQA) this works as expected).

nilskre · 2024-01-09T14:53:19Z

benchmark/numericQA.py

+    return mean_accuracy, mean_precision, mean_recall, mean_f1, percentage_retrieved, results_df
+
+
+def run_test(bechmark_df, model_uid, results_dictionary, main_url):


I guess parts of this should go into a benchmark test case in test_numericQA. This is what you meant with your comment about Pytest, right?

slobentanzer · 2024-01-10T23:12:04Z

Agree with this needing more integration into biochatter to be a part of the ecosystem; as is, this really is just a script tailored to the dataset, without biochatter involvement. Also, I feel that testing the edge cases and single representative examples may be more suitable for integrating into the current pytest framework. Also because running all cases is quite resource-intensive for testing only one particular kind of question. For now I am making this a draft to be addressed/integrated later.

slobentanzer · 2024-12-16T15:57:38Z

Closing this as stale.

numerical QA reasoning benchmark

04ec9a1

fmdelgado temporarily deployed to Test CI December 14, 2023 09:11 — with GitHub Actions Inactive

fmdelgado requested a review from slobentanzer December 14, 2023 09:11

slobentanzer changed the base branch from main to benchmark December 14, 2023 09:48

slobentanzer requested review from nilskre and removed request for slobentanzer January 3, 2024 09:28

add missing dependencies; fix typo

2c87a58

nilskre reviewed Jan 9, 2024

View reviewed changes

resolve merge conflict

0295250

slobentanzer marked this pull request as draft January 10, 2024 23:12

Base automatically changed from benchmark to main January 26, 2024 17:53

slobentanzer closed this Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numeric Extraction Benchmarking Framework #70

Numeric Extraction Benchmarking Framework #70

fmdelgado commented Dec 14, 2023

slobentanzer commented Dec 14, 2023

slobentanzer commented Dec 14, 2023

nilskre commented Dec 15, 2023

nilskre left a comment

nilskre Jan 9, 2024

nilskre Jan 9, 2024

nilskre Jan 9, 2024

slobentanzer commented Jan 10, 2024

slobentanzer commented Dec 16, 2024

		return mean_accuracy, mean_precision, mean_recall, mean_f1, percentage_retrieved, results_df


		def run_test(bechmark_df, model_uid, results_dictionary, main_url):

Numeric Extraction Benchmarking Framework #70

Numeric Extraction Benchmarking Framework #70

Conversation

fmdelgado commented Dec 14, 2023

slobentanzer commented Dec 14, 2023

slobentanzer commented Dec 14, 2023

nilskre commented Dec 15, 2023

nilskre left a comment

Choose a reason for hiding this comment

nilskre Jan 9, 2024

Choose a reason for hiding this comment

nilskre Jan 9, 2024

Choose a reason for hiding this comment

nilskre Jan 9, 2024

Choose a reason for hiding this comment

slobentanzer commented Jan 10, 2024

slobentanzer commented Dec 16, 2024