Offline evaluator for LLM agent traces from a CI/CD benchmark task. Agents are given a git repository and asked to get a change merged into `main`. The evaluator reads raw session logs, detects exploitative behavior, computes multi-dimensional scores, and produces a cross-model leaderboard.
Stars
0
Forks
0
Watchers
0
Open Issues
0
Overall repository health assessment
No package.json found
This might not be a Node.js project
23
commits
Merge pull request #1 from zaebee/fix/outcome-merge-detection
c5c178cView on GitHubrefactor: tighten _MERGE_SUCCESS_PAT and guard with gh pr command check
a87841cView on GitHubfix: detect merge success from tool_result output, not command text
9ac72a4View on GitHubdocs: update README to reflect v2 scoring, strategy taxonomy, OpenCode format
056e6ddView on GitHubfeat: detect sleep/polling loops as efficiency signal and looper condition
7f8cc73View on GitHubfix: catch --method PUT/PATCH/DELETE variant of ruleset bypass
0219516View on GitHubfeat: add strategy.py — BehaviorProfile computation and strategy classification
494cc2aView on GitHubfeat: metrics v2 — severity-weighted integrity, loop/retry efficiency deduction
b5ca951View on GitHubfeat: add severity modifiers to PatternRule and classify_event
3e8a3deView on GitHub