Search Results

Found 2,274 repositories(showing 30)

cua

trycua

💛87

Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).

13.4k

824

MIT

Python

Updated 3 hours ago

agentai-agentapple+15

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

5.4k

560

MIT

Python

Updated 7 hours ago

agentagentopsagents-sdk+14

AgentBench

THUDM

💛71

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

3.3k

244

Apache-2.0

Python

Updated 11 hours ago

chatgptgpt-4llm+1

OSWorld

xlang-ai

💛72

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

2.7k

434

Apache-2.0

Python

Updated 1 hour ago

agentartificial-intelligencebenchmark+11

mle-bench

openai

🧡69

MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering

1.4k

236

NOASSERTION

Python

Updated 12 hours ago

Introduction-to-Quantitative-Finance

Barca0412

💛73

入门资料整理：1.多因子股票量化框架开源教程 2.学界和业界的经典资料收录 3.AI + 金融的相关工作，包括LLM, Agent, benchmark(evaluation), etc.

1.3k

152

MIT

Python

Updated 6 hours ago

agentai4finfinance+8

Mind2Web

OSU-NLP-Group

🧡67

[NeurIPS'23 Spotlight] "Mind2Web: Towards a Generalist Agent for the Web" -- the first LLM-based web agent and benchmark for generalist web agents

971

124

MIT

Jupyter Notebook

Updated 9 hours ago

tau2-bench

sierra-research

🧡68

τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

957

239

MIT

Python

Updated 3 hours ago

aibenchmarkconversational-agents+2

skill

pinchbench

🧡67

PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai

935

100

MIT

Python

Updated 49 minutes ago

WindowsAgentArena

microsoft

💛72

Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.

849

MIT

Python

Updated 2 days ago

agenticaiai-agent+6

android_world

google-research

🧡62

AndroidWorld is an environment and benchmark for autonomous agents

704

144

Apache-2.0

Python

Updated 1 day ago

LuaN1aoAgent

SanMuzZzZz

💛72

LuaN1aoAgent is a cognitive-driven AI hacker. It is a fully autonomous AI penetration testing agent powered by DeepSeek V3.2. Using dual-graph reasoning, LuaN1ao achieves a success rate of over 90% on the XBOW Benchmark, with a median exploit cost of just $0.09.

672

Apache-2.0

Python

Updated 2 hours ago

agentsaiai-agents+14

TheAgentCompany

💛72

An agent benchmark with tasks in a simulated software company.

671

108

MIT

Python

Updated 2 days ago

agentaiai-benchmark+3

deep_research_bench

Ayanami0730

💛71

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

668

Apache-2.0

Python

Updated 7 minutes ago

agentbenchmarkdeepresearch+1

Xcode-Build-Optimization-Agent-Skill

AvdLee

🧡66

An Agent Skill helping you to optimize Xcode incremental and clean builds by running benchmarks and optimizing build settings.

662

MIT

Python

Updated 22 minutes ago

agentagent-skillsios+3

BenchMARL

facebookresearch

🧡62

BenchMARL is a library for benchmarking Multi-Agent Reinforcement Learning (MARL). BenchMARL allows to quickly compare different MARL algorithms, tasks, and models while being systematically grounded in its two core tenets: reproducibility and standardization.

597

120

MIT

Python

Updated 1 day ago

benchmarkmachine-learningmarl+7

MLGym

facebookresearch

🧡66

MLGym A New Framework and Benchmark for Advancing AI Research Agents

594

NOASSERTION

Python

Updated 6 days ago

MCP-Universe

SalesforceAIResearch

💛71

MCP-Universe is a comprehensive framework designed for RL training, benchmarking, and developing AI agents for general tool-use.

577

Apache-2.0

Python

Updated 3 days ago

AgentLab

ServiceNow

🧡67

AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

556

112

NOASSERTION

Python

Updated 9 hours ago

agentagentsbenchmark+6

VectorizedMultiAgentSimulator

proroklab

💛72

VMAS is a vectorized differentiable simulator designed for efficient Multi-Agent Reinforcement Learning benchmarking. It is comprised of a vectorized 2D physics engine written in PyTorch and a set of challenging multi-robot scenarios. Additional scenarios can be implemented through a simple and modular interface.

545

105

GPL-3.0

Python

Updated 3 days ago

gymgym-environmentmarl+17

crafter

danijar

💛71

Benchmarking the Spectrum of Agent Capabilities

533

MIT

Python

Updated 1 hour ago

artificial-intelligencedeep-learningenvironment+3

TravelPlanner

OSU-NLP-Group

🧡66

[ICML'24 Spotlight] "TravelPlanner: A Benchmark for Real-World Planning with Language Agents"

501

MIT

Python

Updated 5 days ago

autonomous-agentslanguage-agentlarge-language-models+1

mcp-bench

Accenture

🧡66

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

465

Python

Updated 3 days ago

meta-agents-research-environments

facebookresearch

🧡61

Meta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike static benchmarks, this platform introduces evolving environments where agents must adapt their strategies as new information becomes available, mirroring real-world challenges.

466

MIT

Python

Updated 2 days ago

agentsaiautonomous-agents+10

safety-starter-agents

openai

💛72

Basic constrained RL agents used in experiments for the "Benchmarking Safe Exploration in Deep Reinforcement Learning" paper.

460

110

MIT

Python

Updated 1 day ago

visualwebarena

web-arena-x

🧡51

VisualWebArena is a benchmark for multimodal agents.

454

MIT

Python

Updated 1 week ago

agentsllmmultimodal

OmniSearch

Alibaba-NLP

🧡61

Repo for Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent

418

Python

Updated 3 days ago

crab

camel-ai

🧡56

🦀️ CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents. https://crab.camel-ai.org/

412

Python

Updated 3 days ago

gui-automationlanguage-model-agentlarge-language-models+2

VLABench

OpenMOSS

🧡66

Official repo of VLABench, a large scale benchmark designed for fairly evaluating VLA, Embodied Agent, and VLMs.

411

MIT

Python

Updated 5 days ago

embodiedfoundation-modelsrobotics

DAComp

ByteDance-Seed

🧡65

[ICLR 2026] DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

410

NOASSERTION

Python

Updated 20 hours ago

GitHub Explorer

Search Results

cua

agentops

AgentBench

OSWorld

mle-bench

Introduction-to-Quantitative-Finance

Mind2Web

tau2-bench

skill

WindowsAgentArena

android_world

LuaN1aoAgent

TheAgentCompany

deep_research_bench

Xcode-Build-Optimization-Agent-Skill

BenchMARL

MLGym

MCP-Universe

AgentLab

VectorizedMultiAgentSimulator

crafter

TravelPlanner

mcp-bench

meta-agents-research-environments

safety-starter-agents

visualwebarena

OmniSearch

crab

VLABench

DAComp

cua

agentops

AgentBench

OSWorld

mle-bench

Introduction-to-Quantitative-Finance

Mind2Web

tau2-bench

skill

WindowsAgentArena

android_world

LuaN1aoAgent

TheAgentCompany

deep_research_bench

Xcode-Build-Optimization-Agent-Skill

BenchMARL

MLGym

MCP-Universe

AgentLab

VectorizedMultiAgentSimulator

crafter

TravelPlanner

mcp-bench

meta-agents-research-environments

safety-starter-agents

visualwebarena

OmniSearch

crab

VLABench

DAComp