Search Results

Found 10,359 repositories(showing 30)

mlflow

💚95

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

25.1k

5.5k

Apache-2.0

Python

Updated 47 minutes ago

agentopsagentsai+15

langfuse

💚95

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

24.4k

2.5k

NOASSERTION

TypeScript

Updated 2 minutes ago

analyticsautogenevaluation+16

promptfoo

💚95

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.

19.4k

1.7k

MIT

TypeScript

Updated 1 minute ago

cici-cdcicd+15

opik

comet-ml

💚94

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

18.7k

1.4k

Apache-2.0

Python

Updated 11 minutes ago

evaluationhacktoberfesthacktoberfest2025+10

evals

openai

💚100

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

18.1k

2.9k

NOASSERTION

Python

Updated 25 minutes ago

RagaAI-Catalyst

raga-ai-hub

💚95

Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced analytics with timeline and execution graph view

16.1k

3.6k

Apache-2.0

Python

Updated 11 hours ago

agentic-aiagentic-ai-developmentagentneo+9

deepeval

confident-ai

💚98

The LLM Evaluation Framework

14.5k

1.3k

Apache-2.0

Python

Updated 26 minutes ago

evaluation-frameworkevaluation-metricsllm-evaluation+3

ragas

vibrantlabsai

💚92

Supercharge Your LLM Application Evaluations 🚀

13.2k

1.3k

Apache-2.0

Python

Updated 3 hours ago

evaluationllmllmops

gorilla

ShishirPatil

💚96

Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)

12.8k

1.3k

Apache-2.0

Python

Updated 3 hours ago

apiapi-documentationchatgpt+5

bisheng

dataelement

💚91

BISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful and comprehensive features include: GenAI workflow, RAG, Agent, Unified model management, Evaluation, SFT, Dataset Management, Enterprise-level System Management, Observability and more.

11.3k

1.8k

Apache-2.0

TypeScript

Updated 14 hours ago

agentaichatbot+17

tensorzero

💛84

TensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and experimentation.

11.2k

804

Apache-2.0

Rust

Updated 2 hours ago

aiai-engineeringanthropic+17

phoenix

Arize-ai

💛82

AI Observability & Evaluation

9.2k

795

NOASSERTION

Jupyter Notebook

Updated 27 minutes ago

agentsai-monitoringai-observability+13

oumi

oumi-ai

💛82

Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!

9.1k

740

Apache-2.0

Python

Updated 3 hours ago

dpoevaluationfine-tuning+9

garak

NVIDIA

💛86

the LLM vulnerability scanner

7.5k

857

Apache-2.0

HTML

Updated 1 hour ago

aillm-evaluationllm-security+2

evidently

evidentlyai

💛85

Evidently is an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.

7.4k

813

Apache-2.0

Jupyter Notebook

Updated 6 hours ago

data-driftdata-qualitydata-science+11

opencompass

open-compass

💛79

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

6.8k

754

Apache-2.0

Python

Updated 3 hours ago

benchmarkchatgptevaluation+5

chinese-llm-benchmark

jeinlee1991

💛73

ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括359个大模型，覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3-max、qwen3.5-plus、百川、讯飞星火、商汤senseChat等商用模型，以及step3.5-flash、kimi-k2.5、ernie4.5、MiniMax-M2.5、deepseek-v3.2、Qwen3.5、llama4、智谱GLM-5、GLM-4.7、LongCat、gemma3、mistral等开源大模型。不仅提供排行榜，也提供规模超200万的大模型缺陷库！方便广大社区研究分析、改进大模型。

5.8k

234

Updated 26 minutes ago

agentic-aiartificial-intelligencellm-agent+1

helicone

Helicone

💛80

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

5.4k

504

Apache-2.0

TypeScript

Updated 3 hours ago

agent-monitoringanalyticsevaluation+16

giskard-oss

Giskard-AI

💛79

🐢 Open-Source Evaluation & Testing library for LLM Agents

5.2k

424

Apache-2.0

Python

Updated 2 hours ago

agent-evaluationai-red-teamai-security+14

LLM-Engineers-Handbook

PacktPublishing

💛87

The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices

4.9k

1.2k

MIT

Python

Updated 13 hours ago

awsfine-tuning-llmgenai+6

RouteLLM

lm-sys

💛73

A framework for serving and evaluating LLM routers - save LLM costs without compromising quality

4.8k

366

Apache-2.0

Python

Updated 3 hours ago

AutoRAG

Marker-Inc-Korea

💛79

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

4.7k

387

Apache-2.0

Python

Updated 5 hours ago

analysisautomlbenchmarking+15

agenta

Agenta-AI

💛74

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

4.0k

506

NOASSERTION

TypeScript

Updated 15 minutes ago

agentsevaluationllm-as-a-judge+12

lmms-eval

EvolvingLMMs-Lab

💛80

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

4.0k

553

NOASSERTION

Python

Updated 4 hours ago

agiaudio-evaluationbenchmark+8

AI-Infra-Guard

Tencent

💛77

A full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, Skills Scan, MCP scan, AI Infra scan and LLM jailbreak evaluation.

3.4k

339

Apache-2.0

Python

Updated 20 minutes ago

agentaibenchmark+13

AgentBench

THUDM

💛71

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

3.3k

244

Apache-2.0

Python

Updated 17 hours ago

chatgptgpt-4llm+1

trulens

truera

💛76

Evaluation and Tracking for LLM Experiments and AI Agents

3.2k

259

MIT

Python

Updated 14 hours ago

agent-evaluationagentopsai-agents+10

langwatch

🧡61

The platform for LLM evaluations and AI agent testing

3.2k

305

NOASSERTION

TypeScript

Updated 2 hours ago

aianalyticsdatasets+10

LLMZoo

FreedomIntelligence

💛70

⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡

3.0k

195

Apache-2.0

Python

Updated 5 days ago

lmnr

lmnr-ai

💛75

Laminar - open-source observability platform purpose-built for AI agents. YC S24.

2.8k

190

Apache-2.0

TypeScript

Updated 59 minutes ago

agent-observabilityagentsai+17

GitHub Explorer

Search Results

mlflow

langfuse

promptfoo

opik

evals

RagaAI-Catalyst

deepeval

ragas

gorilla

bisheng

tensorzero

phoenix

oumi

garak

evidently

opencompass

chinese-llm-benchmark

helicone

giskard-oss

LLM-Engineers-Handbook

RouteLLM

AutoRAG

agenta

lmms-eval

AI-Infra-Guard

AgentBench

trulens

langwatch

LLMZoo

lmnr

mlflow

langfuse

promptfoo

opik

evals

RagaAI-Catalyst

deepeval

ragas

gorilla

bisheng

tensorzero

phoenix

oumi

garak

evidently

opencompass

chinese-llm-benchmark

helicone

giskard-oss

LLM-Engineers-Handbook

RouteLLM

AutoRAG

agenta

lmms-eval

AI-Infra-Guard

AgentBench

trulens

langwatch

LLMZoo

lmnr