[LLM] Basic evaluations of LLM outputs #1878

jucor · 2025-01-21T14:51:08Z

Why

Polis is widely used by governments. It is part of the national infrastructure of three countries—Taiwan, the U.K., and Finland—and is used ad hoc by many others, including government-adjacent organizations like the UNDP. Statistical guarantees will be invaluable in such cases, where the results of Polis are used to craft legislation, policies, and programs—giving Polis+SenseMaker a significant edge over generalized LLMs that broadly perform similar functions.

Since formal consistency theorems do not exist for LLMs, the next best thing is rigorous and thorough empirical evaluation on a wide set of test cases, real and synthetic.

The evaluation needs to be run for multiple LLMs: Gemini of course, but other LLMs too, as an appeal of Pol.is is the ability to run even on local nodes when required.

Our 2023 article with Anthropic (Small et al. 2023) came with eval code. If we are to look at the current gold standard, the Habermas Machine team (Tessler et al. 2024) have even systematically pre-registered their evaluation protocols (Tessler 2023b; 2023f; 2023e; 2023d; 2023c; 2023a). Pre-registration is probably too high a bar for our current experimentation, but shows the level of rigour of evaluations already existing in the community of applications of machine learning to democracy.

This will truly bring the reliability and transparency needed for government and other official uses.

Two key things we need to ensure:

That no areas of the conversation were totally ignored
That the numbers were not hallucinated

What

As we evaluate LLMs for report generation, here is a non-exhaustive brainstorm done in December 2024 of the type of evaluations we would want:

Evaluating the algorithm, not just the summaries
- Check summaries done for multiple conversations
- Check multiple summaries generated for a given conversation
Check basic requirements
- Verify requirements stated in the prompt, such as language guidelines and format.
Check summary precision:
- For each clause generated, does its citations match it?
Check statistical grounding
- Are the numbers correctly reported, without misrepresentation?
- Are all the numbers reported actual?
- Are the “top” comments present?
Check summary recall:
- Check percentage of comments represented by each subtask
Check topic precision/recall
- Check percentage of comments represented by list of topics
Measure likeability by human
- Human evaluators: which summary do they prefer between various prompts or between human and AI?
Check narrative
- Does the summary weave the narrative and the statistics, like a data journalist? Facilitators not necessarily numerate.
Check subtlety
- E.g. distinction between uniform agreement and uniform disagreement
Check bias against certain types of dialect/language
Check stability with respect to translation
- BG will have multiple languages
- Are there biases when translating? Can be tested e.g. by evaluating summary on iterated auto-translation
[Optional] Stability over incremental data
- Simulate the user experience for periodic looks as conversation evolves, not just final

How

There are various ways the above could be done, but:

Human evaluators wouldn’t scale
So we could use an evaluator LLM
- Leverage the fact that verification is easier than generation
- For each check, give the “evaluator LLM” the generated summary, and tell it what to check for -- or break down into smaller clauses.
Safeguard against “Turtles all the way down”
- “Who’s watching the watchmen?” : need to evaluate the automated evaluator!
- I.e. human checks that the evaluator LLM is performing correctly on the evaluation task
- Can be done on a small subset
Complementary option
- Distance metrics on embeddings (cf Habermas supplementary material)

Data

Those evaluations could be run on the following data:

Our open-data repository https://github.com/compdemocracy/openData/tree/master
The closed data in the pol.is instance, which provides an entirely held out dataset never published and thus not included in LLMs.

References

Small, Christopher T, Ivan Vendrov, Esin Durmus, Hadjar Homaei, Elizabeth Barry, Julien Cornebise, Ted Suzman, Deep Ganguli, and Colin Megill. 2023. ‘Opportunities and Risks of LLMs for Scalable Deliberation with Polis’. arXiv Preprint arXiv:2306.11932.
Tessler, Michael Henry. 2023a. ‘Fine-Tuning LMs for Consensus (Critique Exclusion)’. OSF Registries. https://doi.org/10.17605/OSF.IO/4ZJEU.
———. 2023b. ‘Fine-Tuning LMs for Consensus: Evaluation of Fine-Tuning and Iteration # 1’. OSF Registries. https://doi.org/10.17605/OSF.IO/QUSHN.
———. 2023c. ‘Fine-Tuning LMs for Consensus (Human Consensus Writer Eval)’. OSF Registries. https://doi.org/10.17605/OSF.IO/9UYZ8.
———. 2023d. ‘Fine-Tuning LMs for Consensus (Human Opinion Exposure Baseline)’. OSF Registries. https://doi.org/10.17605/OSF.IO/KC5BR.
———. 2023e. ‘Fine-Tuning LMs for Consensus (Model Comparison - OOD Generalization)’. OSF Registries. https://doi.org/10.17605/OSF.IO/YQABX.
———. 2023f. ‘Fine-Tuning LMs for Consensus (Model Comparison 2)’. OSF Registries. https://doi.org/10.17605/OSF.IO/QH6YR.
Tessler, Michael Henry, Michiel A. Bakker, Daniel Jarrett, Hannah Sheahan, Martin J. Chadwick, Raphael Koster, Georgina Evans, et al. 2024. ‘AI Can Help Humans Find Common Ground in Democratic Deliberation’. Science 386 (6719): eadq2852. https://doi.org/10.1126/science.adq2852.

jucor added the feature-request For new feature suggestions label Jan 21, 2025

This was referenced Jan 21, 2025

[Topics and LLM summaries] Automate sanity checks #1879

Open

Separate “Other” into “None of the above” and “Algorithm can’t decide” Jigsaw-Code/sensemaking-tools#10

Open

jucor mentioned this issue Jan 29, 2025

Speed-up calls to LLM by parallelization of the topic categorization Jigsaw-Code/sensemaking-tools#11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLM] Basic evaluations of LLM outputs #1878

[LLM] Basic evaluations of LLM outputs #1878

jucor commented Jan 21, 2025 •

edited

Loading

[LLM] Basic evaluations of LLM outputs #1878

[LLM] Basic evaluations of LLM outputs #1878

Comments

jucor commented Jan 21, 2025 • edited Loading

Why

What

How

Data

References

jucor commented Jan 21, 2025 •

edited

Loading