Found 77 repositories(showing 30)
THUDM
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
Hugging-Face-KREW
No description available
eth-sri
No description available
VIA-Research
The set of AI agent model implementations, benchmarks, and others used in our paper "The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective"
glee4810
Code and Data for FHIR-AgentBench
cxcscmu
Benchmark Test-Time Scaling of General LLM Agents
agentbench
No description available
jackjin1997
The open benchmark for AI agent task execution. Claude Code vs Gemini CLI — who wins? Live leaderboard inside.
yongPhone
A lightweight, type-safe Go framework for testing AI agents with customizable scorers, concurrent execution, and flexible configuration
agentbench
No description available
keijiro
General-purpose (non-project specific) workbench for AI coding agents
sauremilk
Real-world evaluation framework for AI coding agents — measures safety, containment, cost, and autonomy beyond just correctness.
Leu3ery
No description available
michaelwinczuk
Framework-agnostic CLI tool for benchmarking AI agents across standardized tasks
Z-ZHHH
small adjustment to AgentBench v0.2
chu2bard
Evaluation framework for AI coding agents
dx2ztm76-new
No description available
OmnionixAI
A comprehensive evaluation framework and benchmark suite designed to rigorously assess the performance, reliability, and reasoning capabilities of autonomous AI agents.
Shreyas-Yadav
A comprehensive evaluation framework for GitHub agents, built using LlamaIndex and Arize Phoenix telemetry. It supports both single-agent and multi-agent architectures, enabling automated assessment of agent reasoning, tool selection, and execution efficiency. Ideal for developers aiming to benchmark and enhance AI-driven GitHub automation tools.
AbdulElahOthmanGwaith
edia application
NCCYUNSONG
No description available
NurcholishAdam
Green-Quantum AgentBench Advancing sustainability-aware agent benchmarking with Quantum Limit Graph architectures. Integrated with Quantum Error Correction (QEC) and Multilingual Provenance modules for AgentBeats.
jiniac-v2
AgentBenchのログ
to-real
AI Agent评测平台 - Complete evaluation platform for AI Agents
JakeB-5
AI Agent Evaluation, Testing & Monitoring Platform - Ship reliable AI agents with confidence
wingtonrbrito
No description available
stevenkozeniesky02
Standardized benchmark framework for comparing AI coding agents (Claude Code, Codex, Cursor)
general-agentbench
Project website for General AgentBench
Helm-Development
Evaluation framework for agentic coding flows
dhruvvenkat
ai evolution framework to see how your agents are performing