Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLM] Basic evaluations of LLM outputs #1878

Open
jucor opened this issue Jan 21, 2025 · 0 comments
Open

[LLM] Basic evaluations of LLM outputs #1878

jucor opened this issue Jan 21, 2025 · 0 comments
Labels
feature-request For new feature suggestions

Comments

@jucor
Copy link
Contributor

jucor commented Jan 21, 2025

Why

Polis is widely used by governments. It is part of the national infrastructure of three countries—Taiwan, the U.K., and Finland—and is used ad hoc by many others, including government-adjacent organizations like the UNDP. Statistical guarantees will be invaluable in such cases, where the results of Polis are used to craft legislation, policies, and programs—giving Polis+SenseMaker a significant edge over generalized LLMs that broadly perform similar functions.

Since formal consistency theorems do not exist for LLMs, the next best thing is rigorous and thorough empirical evaluation on a wide set of test cases, real and synthetic.

The evaluation needs to be run for multiple LLMs: Gemini of course, but other LLMs too, as an appeal of Pol.is is the ability to run even on local nodes when required.

Our 2023 article with Anthropic (Small et al. 2023) came with eval code. If we are to look at the current gold standard, the Habermas Machine team (Tessler et al. 2024) have even systematically pre-registered their evaluation protocols (Tessler 2023b; 2023f; 2023e; 2023d; 2023c; 2023a). Pre-registration is probably too high a bar for our current experimentation, but shows the level of rigour of evaluations already existing in the community of applications of machine learning to democracy.

This will truly bring the reliability and transparency needed for government and other official uses.

Two key things we need to ensure:

  • That no areas of the conversation were totally ignored
  • That the numbers were not hallucinated

What

As we evaluate LLMs for report generation, here is a non-exhaustive brainstorm done in December 2024 of the type of evaluations we would want:

  • Evaluating the algorithm, not just the summaries
    • Check summaries done for multiple conversations
    • Check multiple summaries generated for a given conversation
  • Check basic requirements
    • Verify requirements stated in the prompt, such as language guidelines and format.
  • Check summary precision:
    • For each clause generated, does its citations match it?
  • Check statistical grounding
    • Are the numbers correctly reported, without misrepresentation?
    • Are all the numbers reported actual?
    • Are the “top” comments present?
  • Check summary recall:
    • Check percentage of comments represented by each subtask
  • Check topic precision/recall
    • Check percentage of comments represented by list of topics
  • Measure likeability by human
    • Human evaluators: which summary do they prefer between various prompts or between human and AI?
  • Check narrative
    • Does the summary weave the narrative and the statistics, like a data journalist? Facilitators not necessarily numerate.
  • Check subtlety
    • E.g. distinction between uniform agreement and uniform disagreement
  • Check bias against certain types of dialect/language
  • Check stability with respect to translation
    • BG will have multiple languages
    • Are there biases when translating? Can be tested e.g. by evaluating summary on iterated auto-translation
  • [Optional] Stability over incremental data
    • Simulate the user experience for periodic looks as conversation evolves, not just final

How

There are various ways the above could be done, but:

  • Human evaluators wouldn’t scale
  • So we could use an evaluator LLM
    • Leverage the fact that verification is easier than generation
    • For each check, give the “evaluator LLM” the generated summary, and tell it what to check for -- or break down into smaller clauses.
  • Safeguard against “Turtles all the way down”
    • “Who’s watching the watchmen?” : need to evaluate the automated evaluator!
    • I.e. human checks that the evaluator LLM is performing correctly on the evaluation task
    • Can be done on a small subset
  • Complementary option
    • Distance metrics on embeddings (cf Habermas supplementary material)

Data

Those evaluations could be run on the following data:

References

  • Small, Christopher T, Ivan Vendrov, Esin Durmus, Hadjar Homaei, Elizabeth Barry, Julien Cornebise, Ted Suzman, Deep Ganguli, and Colin Megill. 2023. ‘Opportunities and Risks of LLMs for Scalable Deliberation with Polis’. arXiv Preprint arXiv:2306.11932.
  • Tessler, Michael Henry. 2023a. ‘Fine-Tuning LMs for Consensus (Critique Exclusion)’. OSF Registries. https://doi.org/10.17605/OSF.IO/4ZJEU.
  • ———. 2023b. ‘Fine-Tuning LMs for Consensus: Evaluation of Fine-Tuning and Iteration # 1’. OSF Registries. https://doi.org/10.17605/OSF.IO/QUSHN.
  • ———. 2023c. ‘Fine-Tuning LMs for Consensus (Human Consensus Writer Eval)’. OSF Registries. https://doi.org/10.17605/OSF.IO/9UYZ8.
  • ———. 2023d. ‘Fine-Tuning LMs for Consensus (Human Opinion Exposure Baseline)’. OSF Registries. https://doi.org/10.17605/OSF.IO/KC5BR.
  • ———. 2023e. ‘Fine-Tuning LMs for Consensus (Model Comparison - OOD Generalization)’. OSF Registries. https://doi.org/10.17605/OSF.IO/YQABX.
  • ———. 2023f. ‘Fine-Tuning LMs for Consensus (Model Comparison 2)’. OSF Registries. https://doi.org/10.17605/OSF.IO/QH6YR.
  • Tessler, Michael Henry, Michiel A. Bakker, Daniel Jarrett, Hannah Sheahan, Martin J. Chadwick, Raphael Koster, Georgina Evans, et al. 2024. ‘AI Can Help Humans Find Common Ground in Democratic Deliberation’. Science 386 (6719): eadq2852. https://doi.org/10.1126/science.adq2852.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request For new feature suggestions
Projects
None yet
Development

No branches or pull requests

1 participant