Found 66 repositories(showing 30)
claw-eval
Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.
evalops
Minimal agent runtime built with DSPy modules and a thin Python loop. Includes CLI, FastAPI server, and eval harness with OpenAI/Ollama support.
10xChengTu
Set up and improve harness engineering (AGENTS.md, docs/, lint rules, eval systems, project-level prompt engineering) for AI-agent-friendly codebases. Triggers on: new/empty project setup for AI agents, AGENTS.md or CLAUDE.md creation, harness engineering questions, making agents work better on a codebase.
Siddharth-1001
An open-source evaluation framework specifically for agentic systems — not just LLM outputs, but full agent behavior.
najeed
The open-source MultiAgentOps evaluation and verification harness for any industry business workflow.
tensor-goat
Linux sandboxing for AI runtimes, code agents, tool executors, and eval harnesses - in one Python file.
plaited
Evaluate AI agents with Unix-style pipeline commands. Schema-driven adapters for any CLI agent, trajectory capture, pass@k metrics, and multi-run comparison.
01clauding
Harness engineering for coding agents: smaller root routers, hard verification gates, migration governance, and repo-specific evals.
hzhang092
PQC Standards Navigator — Agentic RAG over NIST PDFs with citations + eval harness.
speed785
Agent Evaluation Harness — write repeatable, measurable evals for AI agents. Python + TypeScript.
mariuscwium
Agentic Development Harness - TypeScript. Quality harnesses for AI-integrated web apps: digital twins, prompt evals, code quality gates, accessibility checks.
Deep-De-coder
Adversarial eval harness for any LLM agent pipeline — Claude, OpenAI, or your own. CLI + REST API + MCP server for Cursor/Antigravity.
alffei
An agent-native harness engineering starter with structured docs, layered architecture, executable guardrails, evals, and CI.
ahmedmusawir
This should be the first complete agent in a harness w/ MCPs, RAG, Session File Memory, Skills, Usage Meter, Evals etc. FIRST AGENT: GEMINI ARCHITECT
snehbhaidasna
End-to-end refund agent: deterministic policy engine, RAG with FAISS + Pinecone, NVIDIA Nemotron LLM explanations, LangSmith tracing, eval harnesses, and MCP tool interface. Hands-on architecture exploration using AI-assisted development.
ben-scire
A tiny production-ready agent that fetches web pages, extracts and cleans content, and normalizes it to a simple schema. Ships with a FastAPI endpoint, a CLI, a Docker image, and a toy eval harness.
redinside-dev
AI agent utility: agent-eval-harness
karthikabinav
Agent ecosystem scaffold + eval harness v0
ashstep2
Cross-provider dual-judge scoring across 6 dimensions that predict whether developers will delegate to a coding agent.
opendatahub-io
No description available
bharatkhanna-dev
Lightweight evaluation harness for LLM-based agents — scoring, trajectory analysis, and regression gating with pytest.
SainathPattipati
Framework to benchmark and evaluate multi-agent system performance, accuracy, and efficiency
artemisveizi
No description available
anuragg-saxenaa
No description available
adysinghh
No description available
dragonstyle
No description available
cmangun
Standardized evaluation harness for agentic systems that must be verifiable
opslane
Vue 3 eval fixture for Defender agent harness
opslane
React 18 eval fixture for Defender agent harness
marbatis
No description available