You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Polis is widely used by governments. It is part of the national infrastructure of three countries—Taiwan, the U.K., and Finland—and is used ad hoc by many others, including government-adjacent organizations like the UNDP. Statistical guarantees will be invaluable in such cases, where the results of Polis are used to craft legislation, policies, and programs—giving Polis+SenseMaker a significant edge over generalized LLMs that broadly perform similar functions.
Since formal consistency theorems do not exist for LLMs, the next best thing is rigorous and thorough empirical evaluation on a wide set of test cases, real and synthetic.
The evaluation needs to be run for multiple LLMs: Gemini of course, but other LLMs too, as an appeal of Pol.is is the ability to run even on local nodes when required.
Our 2023 article with Anthropic (Small et al. 2023) came with eval code. If we are to look at the current gold standard, the Habermas Machine team (Tessler et al. 2024) have even systematically pre-registered their evaluation protocols (Tessler 2023b; 2023f; 2023e; 2023d; 2023c; 2023a). Pre-registration is probably too high a bar for our current experimentation, but shows the level of rigour of evaluations already existing in the community of applications of machine learning to democracy.
This will truly bring the reliability and transparency needed for government and other official uses.
Two key things we need to ensure:
That no areas of the conversation were totally ignored
That the numbers were not hallucinated
What
As we evaluate LLMs for report generation, here is a non-exhaustive brainstorm done in December 2024 of the type of evaluations we would want:
Evaluating the algorithm, not just the summaries
Check summaries done for multiple conversations
Check multiple summaries generated for a given conversation
Check basic requirements
Verify requirements stated in the prompt, such as language guidelines and format.
Check summary precision:
For each clause generated, does its citations match it?
Check statistical grounding
Are the numbers correctly reported, without misrepresentation?
Are all the numbers reported actual?
Are the “top” comments present?
Check summary recall:
Check percentage of comments represented by each subtask
Check topic precision/recall
Check percentage of comments represented by list of topics
Measure likeability by human
Human evaluators: which summary do they prefer between various prompts or between human and AI?
Check narrative
Does the summary weave the narrative and the statistics, like a data journalist? Facilitators not necessarily numerate.
Check subtlety
E.g. distinction between uniform agreement and uniform disagreement
Check bias against certain types of dialect/language
Check stability with respect to translation
BG will have multiple languages
Are there biases when translating? Can be tested e.g. by evaluating summary on iterated auto-translation
[Optional] Stability over incremental data
Simulate the user experience for periodic looks as conversation evolves, not just final
How
There are various ways the above could be done, but:
Human evaluators wouldn’t scale
So we could use an evaluator LLM
Leverage the fact that verification is easier than generation
For each check, give the “evaluator LLM” the generated summary, and tell it what to check for -- or break down into smaller clauses.
Safeguard against “Turtles all the way down”
“Who’s watching the watchmen?” : need to evaluate the automated evaluator!
I.e. human checks that the evaluator LLM is performing correctly on the evaluation task
Can be done on a small subset
Complementary option
Distance metrics on embeddings (cf Habermas supplementary material)
Data
Those evaluations could be run on the following data:
The closed data in the pol.is instance, which provides an entirely held out dataset never published and thus not included in LLMs.
References
Small, Christopher T, Ivan Vendrov, Esin Durmus, Hadjar Homaei, Elizabeth Barry, Julien Cornebise, Ted Suzman, Deep Ganguli, and Colin Megill. 2023. ‘Opportunities and Risks of LLMs for Scalable Deliberation with Polis’. arXiv Preprint arXiv:2306.11932.
Tessler, Michael Henry, Michiel A. Bakker, Daniel Jarrett, Hannah Sheahan, Martin J. Chadwick, Raphael Koster, Georgina Evans, et al. 2024. ‘AI Can Help Humans Find Common Ground in Democratic Deliberation’. Science 386 (6719): eadq2852. https://doi.org/10.1126/science.adq2852.
The text was updated successfully, but these errors were encountered:
Why
Polis is widely used by governments. It is part of the national infrastructure of three countries—Taiwan, the U.K., and Finland—and is used ad hoc by many others, including government-adjacent organizations like the UNDP. Statistical guarantees will be invaluable in such cases, where the results of Polis are used to craft legislation, policies, and programs—giving Polis+SenseMaker a significant edge over generalized LLMs that broadly perform similar functions.
Since formal consistency theorems do not exist for LLMs, the next best thing is rigorous and thorough empirical evaluation on a wide set of test cases, real and synthetic.
The evaluation needs to be run for multiple LLMs: Gemini of course, but other LLMs too, as an appeal of Pol.is is the ability to run even on local nodes when required.
Our 2023 article with Anthropic (Small et al. 2023) came with eval code. If we are to look at the current gold standard, the Habermas Machine team (Tessler et al. 2024) have even systematically pre-registered their evaluation protocols (Tessler 2023b; 2023f; 2023e; 2023d; 2023c; 2023a). Pre-registration is probably too high a bar for our current experimentation, but shows the level of rigour of evaluations already existing in the community of applications of machine learning to democracy.
This will truly bring the reliability and transparency needed for government and other official uses.
Two key things we need to ensure:
What
As we evaluate LLMs for report generation, here is a non-exhaustive brainstorm done in December 2024 of the type of evaluations we would want:
How
There are various ways the above could be done, but:
Data
Those evaluations could be run on the following data:
References
The text was updated successfully, but these errors were encountered: