Search Results

Found 1,165 repositories(showing 30)

opik

comet-ml

💚94

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

18.7k

1.4k

Apache-2.0

Python

Updated 1 hour ago

evaluationhacktoberfesthacktoberfest2025+10

evals

openai

💚100

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

18.1k

2.9k

NOASSERTION

Python

Updated 2 hours ago

Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced analytics with timeline and execution graph view

16.1k

3.6k

Apache-2.0

Python

Updated 13 hours ago

agentic-aiagentic-ai-developmentagentneo+9

bisheng

dataelement

💚91

BISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful and comprehensive features include: GenAI workflow, RAG, Agent, Unified model management, Evaluation, SFT, Dataset Management, Enterprise-level System Management, Observability and more.

11.3k

1.8k

Apache-2.0

TypeScript

Updated 15 hours ago

agentaichatbot+17

evidently

evidentlyai

💛85

Evidently is an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.

7.4k

813

Apache-2.0

Jupyter Notebook

Updated 7 hours ago

data-driftdata-qualitydata-science+11

skill

pinchbench

🧡67

PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai

928

MIT

Python

Updated 1 hour ago

OpenP5

agiresearch

🧡66

OpenP5: An Open-Source Platform for Developing, Training, and Evaluating LLM-based Recommender Systems

348

Apache-2.0

Python

Updated 17 hours ago

genai-bench

sgl-project

🧡61

Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serving systems.

287

MIT

Python

Updated 3 days ago

autoarena

kolenaIO

🧡55

Rank LLMs, RAG systems, and prompts using automated head-to-head evaluation

108

Apache-2.0

TypeScript

Updated 2 weeks ago

aievaluationhacktoberfest+4

flow-judge

flowaicom

🧡50

Code for evaluating with Flow-Judge-v0.1 - an open-source, lightweight (3.8B) language model optimized for LLM system evaluations. Crafted for accuracy, speed, and customization.

Apache-2.0

Python

Updated 2 months ago

DocBench

Anni-Zou

🧡65

DocBench: A Benchmark for Evaluating LLM-based Document Reading Systems

Python

Updated 4 days ago

circle-guard-bench

whitecircle-ai

🧡60

First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)

Apache-2.0

Python

Updated 3 weeks ago

aibenchmarkbenchmarking+12

One-Eval

OpenDCAI

🧡65

Automated system for LLM evaluation via agents.

Apache-2.0

Python

Updated 16 hours ago

agentagentsbenchmark+8

ruby_llm-tribunal

Alqemist-labs

🧡60

LLM evaluation framework for Ruby, powered by RubyLLM. Tribunal provides tools for evaluating and testing LLM outputs, detecting hallucinations, measuring response quality, and ensuring safety. Perfect for RAG systems, chatbots, and any LLM-powered application.

MIT

Ruby

Updated 1 week ago

SmartSafe

xyh4ck

💛70

SmartSafe LLM Evaluation System

MIT

Vue

Updated 2 days ago

llm-observability-opik

benitomartin

🧡55

LLM Evaluation and Observability System for Football Content

MIT

Python

Updated 2 weeks ago

bertscorecomet-mlcosine-similarity+8

Eval-ai-library

meshkovQA

💛70

Comprehensive AI Model Evaluation Framework with advanced techniques including Temperature-Controlled Verdict Aggregation via Generalized Power Mean. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.

Apache-2.0

Python

Updated 22 hours ago

ai-evaluationai-evaluation-frameworkai-evaluation-metrics+3

leaderboard

pinchbench

❤️45

PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai

TypeScript

Updated 8 hours ago

LEXam

LEXam-Benchmark

🧡55

[ICLR 2026] This Repo provides code for evaluating LLMs on LEXam - a comprehensive benchmark evaluating AI system's legal reasoning ability with law exam questions, using both open and multiple-choice questions.

Apache-2.0

Python

Updated 2 weeks ago

FinRpt

jinsong8

🧡55

[AAAI 2026] The offical implementation of paper "FinRpt: Dataset, Evaluation System and LLM-based Multi-agent Framework for Equity Research Report Generation"

Python

Updated 4 hours ago

benckmarkdatasetllm-based-agent

on-device-ai-comm

abman23

❤️40

On-device AI/LLM comm system | Pre-trained LLM integrated with 5G-NR PHY | Fine-tuned BART on noisy 3GPP CDL channels | 50 % compression via quantization | Evaluated with NVIDIA Sionna LLS

MIT

Python

Updated 3 months ago

LLMOPS

ngtranminhtuan

❤️40

NLP/LLM Mlops Pipeline to dev/train/evaluation, scalable deploy and monitoring systems.

MIT

Jupyter Notebook

Updated 10 months ago

GAMMA-v2

DALYBIGAS

🧡65

GAMMA-v2: An end-to-end co-design simulation framework integrating gem5 and MLIR, enabling LLM and operator-level workload modeling, configurable accelerator generation, and system-level evaluation for mapping and architecture exploration.

BSD-3-Clause

C++

Updated 2 days ago

MIPSEval

stratosphereips

❤️45

Multi-turn Injection Planning System for LLM Evaluation

GPL-2.0

Python

Updated 2 months ago

CSE291_MedLM

JayJhaveri1906

❤️35

Automated systems are crucial for summarizing medical information. Large Language Models (LLMs) show promise in healthcare, specifically for Closed-Book Generative QnA. This study compares general and medical-specific LMs, evaluates their performance in medical Q&A, and provides insights into their suitability for medical applications.

Jupyter Notebook

Updated 3 months ago

LLM-Code-Benchmark

MichaelYang-lyx

❤️40

This is a benchmark and corresponding evaluation system for llm

MIT

Python

Updated 11 months ago

agent-eval-harness

Siddharth-1001

🧡65

An open-source evaluation framework specifically for agentic systems — not just LLM outputs, but full agent behavior.

MIT

Python

Updated 3 days ago

CORE

Ian-Tharp

🧡60

C.O.R.E. is an all-encompassing cognitive architecture I designed as a system for enabling AI technologies to interact fully as a personalized assistant. Autonomous agentic building, workflows, memory, and evolution is just the beginning of what CORE (Comprehension, Orchestration, Reasoning, Evaluation) can enable with LLM technologies. Vibe coded!

MIT

Python

Updated 3 weeks ago

agiaigenerative-ai+3

OSVBench

lishangyu-hkust

❤️40

(AAAI 2026) OSVBench, a new benchmark for evaluating Large Language Models (LLMs) in generating complete specification code pertaining to operating system kernel verification tasks.

Python

Updated 1 month ago

llm-eval-sweep

ayulockin

❤️35

A simple repository showcasing a few LLM Evaluation strategies and leverages W&B Sweeps to optimize the LLM system.

Apache-2.0

Jupyter Notebook

Updated 1 year ago

GitHub Explorer

Search Results

opik

evals

RagaAI-Catalyst

bisheng

evidently

skill

OpenP5

genai-bench

autoarena

flow-judge

DocBench

circle-guard-bench

One-Eval

ruby_llm-tribunal

SmartSafe

llm-observability-opik

Eval-ai-library

leaderboard

LEXam

FinRpt

on-device-ai-comm

LLMOPS

GAMMA-v2

MIPSEval

CSE291_MedLM

LLM-Code-Benchmark

agent-eval-harness

CORE

OSVBench

llm-eval-sweep

opik

evals

RagaAI-Catalyst

bisheng

evidently

skill

OpenP5

genai-bench

autoarena

flow-judge

DocBench

circle-guard-bench

One-Eval

ruby_llm-tribunal

SmartSafe

llm-observability-opik

Eval-ai-library

leaderboard

LEXam

FinRpt

on-device-ai-comm

LLMOPS

GAMMA-v2

MIPSEval

CSE291_MedLM

LLM-Code-Benchmark

agent-eval-harness

CORE

OSVBench

llm-eval-sweep