Repository for analyzing how expert personas behave on ForecastBench
If you have brew:
brew install uv
uv venv -- once
uv pip install -e . -- once
source .venv/bin/activate
deactivate
Experiment 1 = select top k or bottom k and make LLM generate
Experiment 2 = human eval of forecasts, check with LLM judge
Experiment 3 = trying system prompt
Experiment 4 = trying to do expert elicitation with few shot, (LLM judge or manually selecting)
Experiment 5 = Try sampling one question per topic and have all 7 topic experts forecast on these questions to see how experts that don't know anything about this topic perform
Experiment 6 = Try sampling one question per topic and do random selection of X filtered forecasts for the few shot prompt (based on what I understood from "Could you try random selection of filtered forecasts, instead of topic-relevant? There might be a difference").
Experiment 7 = Comparison of reasonings --> have a rubric and rating similarity on a scale of 1-5 and then seeing if an LLM judge can do that properly.
Analysis: - Add mean brier score and relative brier score
Supporting documentation for this project:
- Rationale & Methodology - Detailed explanation of experimental design and approach
- Feedback Variants - All feedback variants for all feedback types
- LLM as a Judge Framework - Rubric and evaluation methodology
- Additional Results & Analysis - Supplementary findings and visualizations