Found 1,165 repositories(showing 30)
comet-ml
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
openai
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
raga-ai-hub
Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced analytics with timeline and execution graph view
dataelement
BISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful and comprehensive features include: GenAI workflow, RAG, Agent, Unified model management, Evaluation, SFT, Dataset Management, Enterprise-level System Management, Observability and more.
evidentlyai
Evidently is an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.
pinchbench
PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai
agiresearch
OpenP5: An Open-Source Platform for Developing, Training, and Evaluating LLM-based Recommender Systems
sgl-project
Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serving systems.
kolenaIO
Rank LLMs, RAG systems, and prompts using automated head-to-head evaluation
flowaicom
Code for evaluating with Flow-Judge-v0.1 - an open-source, lightweight (3.8B) language model optimized for LLM system evaluations. Crafted for accuracy, speed, and customization.
Anni-Zou
DocBench: A Benchmark for Evaluating LLM-based Document Reading Systems
whitecircle-ai
First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)
OpenDCAI
Automated system for LLM evaluation via agents.
Alqemist-labs
LLM evaluation framework for Ruby, powered by RubyLLM. Tribunal provides tools for evaluating and testing LLM outputs, detecting hallucinations, measuring response quality, and ensuring safety. Perfect for RAG systems, chatbots, and any LLM-powered application.
xyh4ck
SmartSafe LLM Evaluation System
benitomartin
LLM Evaluation and Observability System for Football Content
meshkovQA
Comprehensive AI Model Evaluation Framework with advanced techniques including Temperature-Controlled Verdict Aggregation via Generalized Power Mean. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.
pinchbench
PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai
LEXam-Benchmark
[ICLR 2026] This Repo provides code for evaluating LLMs on LEXam - a comprehensive benchmark evaluating AI system's legal reasoning ability with law exam questions, using both open and multiple-choice questions.
jinsong8
[AAAI 2026] The offical implementation of paper "FinRpt: Dataset, Evaluation System and LLM-based Multi-agent Framework for Equity Research Report Generation"
abman23
On-device AI/LLM comm system | Pre-trained LLM integrated with 5G-NR PHY | Fine-tuned BART on noisy 3GPP CDL channels | 50 % compression via quantization | Evaluated with NVIDIA Sionna LLS
ngtranminhtuan
NLP/LLM Mlops Pipeline to dev/train/evaluation, scalable deploy and monitoring systems.
DALYBIGAS
GAMMA-v2: An end-to-end co-design simulation framework integrating gem5 and MLIR, enabling LLM and operator-level workload modeling, configurable accelerator generation, and system-level evaluation for mapping and architecture exploration.
stratosphereips
Multi-turn Injection Planning System for LLM Evaluation
JayJhaveri1906
Automated systems are crucial for summarizing medical information. Large Language Models (LLMs) show promise in healthcare, specifically for Closed-Book Generative QnA. This study compares general and medical-specific LMs, evaluates their performance in medical Q&A, and provides insights into their suitability for medical applications.
MichaelYang-lyx
This is a benchmark and corresponding evaluation system for llm
Siddharth-1001
An open-source evaluation framework specifically for agentic systems — not just LLM outputs, but full agent behavior.
Ian-Tharp
C.O.R.E. is an all-encompassing cognitive architecture I designed as a system for enabling AI technologies to interact fully as a personalized assistant. Autonomous agentic building, workflows, memory, and evolution is just the beginning of what CORE (Comprehension, Orchestration, Reasoning, Evaluation) can enable with LLM technologies. Vibe coded!
lishangyu-hkust
(AAAI 2026) OSVBench, a new benchmark for evaluating Large Language Models (LLMs) in generating complete specification code pertaining to operating system kernel verification tasks.
ayulockin
A simple repository showcasing a few LLM Evaluation strategies and leverages W&B Sweeps to optimize the LLM system.