Found 2,274 repositories(showing 30)
trycua
Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).
AgentOps-AI
Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI
THUDM
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
xlang-ai
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
openai
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
Barca0412
入门资料整理:1.多因子股票量化框架开源教程 2.学界和业界的经典资料收录 3.AI + 金融的相关工作,包括LLM, Agent, benchmark(evaluation), etc.
OSU-NLP-Group
[NeurIPS'23 Spotlight] "Mind2Web: Towards a Generalist Agent for the Web" -- the first LLM-based web agent and benchmark for generalist web agents
sierra-research
τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
pinchbench
PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai
microsoft
Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
google-research
AndroidWorld is an environment and benchmark for autonomous agents
SanMuzZzZz
LuaN1aoAgent is a cognitive-driven AI hacker. It is a fully autonomous AI penetration testing agent powered by DeepSeek V3.2. Using dual-graph reasoning, LuaN1ao achieves a success rate of over 90% on the XBOW Benchmark, with a median exploit cost of just $0.09.
TheAgentCompany
An agent benchmark with tasks in a simulated software company.
Ayanami0730
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
An Agent Skill helping you to optimize Xcode incremental and clean builds by running benchmarks and optimizing build settings.
facebookresearch
BenchMARL is a library for benchmarking Multi-Agent Reinforcement Learning (MARL). BenchMARL allows to quickly compare different MARL algorithms, tasks, and models while being systematically grounded in its two core tenets: reproducibility and standardization.
facebookresearch
MLGym A New Framework and Benchmark for Advancing AI Research Agents
SalesforceAIResearch
MCP-Universe is a comprehensive framework designed for RL training, benchmarking, and developing AI agents for general tool-use.
ServiceNow
AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.
proroklab
VMAS is a vectorized differentiable simulator designed for efficient Multi-Agent Reinforcement Learning benchmarking. It is comprised of a vectorized 2D physics engine written in PyTorch and a set of challenging multi-robot scenarios. Additional scenarios can be implemented through a simple and modular interface.
danijar
Benchmarking the Spectrum of Agent Capabilities
OSU-NLP-Group
[ICML'24 Spotlight] "TravelPlanner: A Benchmark for Real-World Planning with Language Agents"
Accenture
MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
facebookresearch
Meta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike static benchmarks, this platform introduces evolving environments where agents must adapt their strategies as new information becomes available, mirroring real-world challenges.
openai
Basic constrained RL agents used in experiments for the "Benchmarking Safe Exploration in Deep Reinforcement Learning" paper.
web-arena-x
VisualWebArena is a benchmark for multimodal agents.
Alibaba-NLP
Repo for Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent
camel-ai
🦀️ CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents. https://crab.camel-ai.org/
OpenMOSS
Official repo of VLABench, a large scale benchmark designed for fairly evaluating VLA, Embodied Agent, and VLMs.
ByteDance-Seed
[ICLR 2026] DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle