Search Results

Found 66 repositories(showing 30)

claw-eval

🧡61

Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.

326

Python

Updated 6 hours ago

agentharnessllm+1

dspy-micro-agent

evalops

🧡55

Minimal agent runtime built with DSPy modules and a thin Python loop. Includes CLI, FastAPI server, and eval harness with OpenAI/Ollama support.

Python

Updated 3 weeks ago

agentagent-runtimeai+7

Set up and improve harness engineering (AGENTS.md, docs/, lint rules, eval systems, project-level prompt engineering) for AI-agent-friendly codebases. Triggers on: new/empty project setup for AI agents, AGENTS.md or CLAUDE.md creation, harness engineering questions, making agents work better on a codebase.

Updated 10 hours ago

agent-eval-harness

Siddharth-1001

🧡65

An open-source evaluation framework specifically for agentic systems — not just LLM outputs, but full agent behavior.

MIT

Python

Updated 4 hours ago

ai-agent-eval-harness

najeed

🧡50

The open-source MultiAgentOps evaluation and verification harness for any industry business workflow.

Apache-2.0

Python

Updated 4 days ago

agent-debuggingagent-observabilityagent-testing+17

pledge

tensor-goat

🧡65

Linux sandboxing for AI runtimes, code agents, tool executors, and eval harnesses - in one Python file.

Python

Updated 1 day ago

agent-eval-harness

plaited

❤️35

Evaluate AI agents with Unix-style pipeline commands. Schema-driven adapters for any CLI agent, trajectory capture, pass@k metrics, and multi-run comparison.

ISC

TypeScript

Updated 1 month ago

agent-comparisonagent-evaluationai-agents+11

harness-engineering

01clauding

🧡55

Harness engineering for coding agents: smaller root routers, hard verification gates, migration governance, and repo-specific evals.

Python

Updated 3 weeks ago

nist-pqc-rag-agent

hzhang092

🧡55

PQC Standards Navigator — Agentic RAG over NIST PDFs with citations + eval harness.

Python

Updated 3 weeks ago

evalforge

speed785

❤️45

Agent Evaluation Harness — write repeatable, measurable evals for AI agents. Python + TypeScript.

MIT

Python

Updated 2 weeks ago

ai-agentbenchmarkingevals+5

ship-kit

mariuscwium

🧡60

Agentic Development Harness - TypeScript. Quality harnesses for AI-integrated web apps: digital twins, prompt evals, code quality gates, accessibility checks.

MIT

Updated 2 weeks ago

Gauntlet

Deep-De-coder

🧡60

Adversarial eval harness for any LLM agent pipeline — Claude, OpenAI, or your own. CLI + REST API + MCP server for Cursor/Antigravity.

NOASSERTION

Python

Updated 1 week ago

adversarial-testinganthropicclaude+5

harness_engineering

alffei

🧡55

An agent-native harness engineering starter with structured docs, layered architecture, executable guardrails, evals, and CI.

JavaScript

Updated 1 week ago

adk-agent-harness-v1

ahmedmusawir

🧡55

This should be the first complete agent in a harness w/ MCPs, RAG, Session File Memory, Skills, Usage Meter, Evals etc. FIRST AGENT: GEMINI ARCHITECT

Python

Updated 1 week ago

CustomerServiceRefundAgent

snehbhaidasna

❤️45

End-to-end refund agent: deterministic policy engine, RAG with FAISS + Pinecone, NVIDIA Nemotron LLM explanations, LangSmith tracing, eval harnesses, and MCP tool interface. Hands-on architecture exploration using AI-assisted development.

Jupyter Notebook

Updated 2 months ago

scrubsy

ben-scire

❤️45

A tiny production-ready agent that fetches web pages, extracts and cleans content, and normalizes it to a simple schema. Ships with a FastAPI endpoint, a CLI, a Docker image, and a toy eval harness.

Python

Updated 1 month ago

agent-eval-harness

redinside-dev

🧡55

AI agent utility: agent-eval-harness

Python

Updated 3 weeks ago

agent-ecosystem

karthikabinav

🧡55

Agent ecosystem scaffold + eval harness v0

Python

Updated 3 weeks ago

agent-eval-harness

ashstep2

❤️45

Cross-provider dual-judge scoring across 6 dimensions that predict whether developers will delegate to a coding agent.

TypeScript

Updated 1 month ago

agent-eval-harness

opendatahub-io

❤️45

No description available

Apache-2.0

Updated 4 days ago

agent-eval-harness

bharatkhanna-dev

🧡60

Lightweight evaluation harness for LLM-based agents — scoring, trajectory analysis, and regression gating with pytest.

MIT

Python

Updated 3 weeks ago

agent-evaluation-harness

SainathPattipati

🧡50

Framework to benchmark and evaluate multi-agent system performance, accuracy, and efficiency

NOASSERTION

Python

Updated 1 month ago

ai-agentsbenchmarkingevaluation+2

agent-harness-eval

artemisveizi

❤️45

No description available

Python

Updated 1 week ago

agent-eval-harness

anuragg-saxenaa

❤️45

No description available

Python

Updated 3 weeks ago

Agent-Eval-Harness

adysinghh

❤️40

No description available

Python

Updated 1 week ago

agent-eval-harness

dragonstyle

❤️25

No description available

Python

Updated 1 year ago

agentic-eval-harness

cmangun

❤️35

Standardized evaluation harness for agentic systems that must be verifiable

MIT

Python

Updated 1 month ago

defender-eval-vue-app

opslane

❤️45

Vue 3 eval fixture for Defender agent harness

TypeScript

Updated 1 month ago

defender-eval-react-app

opslane

❤️45

React 18 eval fixture for Defender agent harness

TypeScript

Updated 1 month ago

ops-agent-eval-harness

marbatis

❤️45

No description available

MIT

Python

Updated 5 days ago

GitHub Explorer

Search Results

claw-eval

dspy-micro-agent

harness-engineering

agent-eval-harness

ai-agent-eval-harness

pledge

agent-eval-harness

harness-engineering

nist-pqc-rag-agent

evalforge

ship-kit

Gauntlet

harness_engineering

adk-agent-harness-v1

CustomerServiceRefundAgent

scrubsy

agent-eval-harness

agent-ecosystem

agent-eval-harness

agent-eval-harness

agent-eval-harness

agent-evaluation-harness

agent-harness-eval

agent-eval-harness

Agent-Eval-Harness

agent-eval-harness

agentic-eval-harness

defender-eval-vue-app

defender-eval-react-app

ops-agent-eval-harness

claw-eval

dspy-micro-agent

harness-engineering

agent-eval-harness

ai-agent-eval-harness

pledge

agent-eval-harness

harness-engineering

nist-pqc-rag-agent

evalforge

ship-kit

Gauntlet

harness_engineering

adk-agent-harness-v1

CustomerServiceRefundAgent

scrubsy

agent-eval-harness

agent-ecosystem

agent-eval-harness

agent-eval-harness

agent-eval-harness

agent-evaluation-harness

agent-harness-eval

agent-eval-harness

Agent-Eval-Harness

agent-eval-harness

agentic-eval-harness

defender-eval-vue-app

defender-eval-react-app

ops-agent-eval-harness