Popular repositories Loading
-
confabulations
confabulations PublicHallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
-
elimination_game
elimination_game PublicA multi-player tournament benchmark that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private conversations, form alliances, and vote to eliminate each other
-
nyt-connections
nyt-connections PublicBenchmark that evaluates LLMs using 601 NYT Connections puzzles extended with extra trick words
-
generalization
generalization PublicThematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which ite…
Something went wrong, please refresh the page to try again.
If the problem persists, check the GitHub status page or contact support.
If the problem persists, check the GitHub status page or contact support.