Statistical analysis methods for comparing prompt and model performance in LLM evaluations.
Stars
93
Forks
1
Watchers
93
Open Issues
0
Overall repository health assessment
No package.json found
This might not be a Node.js project
155
commits
Remove point advantage in favor of abs mean CIs; Add p-value printing method to `analyze` and `cli`
a239bd1View on GitHubAdd option to print p-values to analyze() and the CLI. Update index of website
d436681View on GitHubFinish removing point_advantage across the repository, in favor of abs point estimates.
5db02ebView on GitHubAdd discussion of 'Statistical Comparisons of Classifiers over Multiple Data Sets' to the 'Statistical Debates in LLM Evals' section
c8d126cView on GitHubImprove critical difference diagram plots. Use E[rank] for CD diagram numeric ranks---an estimation-based analogue to the Friedman test's scores.
4b42147View on GitHubUpdate analyze on cli to be consistent with the new analyze
4800236View on GitHubUpdate package version and add section on max-T to which method page.
e5114bdView on GitHubSwitch to regular bootstrap for non-binary inputs, when N>=200
8678dccView on GitHubRemove p-value correction printing in headers when no p-values asked for
60310c1View on GitHub