Search Results

Found 222,827 repositories(showing 30)

FastChat

lm-sys

💚95

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

39.4k

4.8k

Apache-2.0

Python

Updated 14 minutes ago

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

25.2k

5.5k

Apache-2.0

Python

Updated 3 hours ago

agentopsagentsai+15

adk-python

google

💚95

An open-source, code-first Python toolkit for building, evaluating, and deploying sophisticated AI agents with flexibility and control.

18.8k

3.2k

Apache-2.0

Python

Updated 1 hour ago

agentagenticagentic-ai+13

opik

comet-ml

💚94

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

18.7k

1.4k

Apache-2.0

Python

Updated 29 minutes ago

evaluationhacktoberfesthacktoberfest2025+10

evals

openai

💚100

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

18.2k

2.9k

NOASSERTION

Python

Updated 2 hours ago

RagaAI-Catalyst

raga-ai-hub

💚95

Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced analytics with timeline and execution graph view

16.1k

3.6k

Apache-2.0

Python

Updated 6 hours ago

agentic-aiagentic-ai-developmentagentneo+9

deepeval

confident-ai

💚98

The LLM Evaluation Framework

14.6k

1.3k

Apache-2.0

Python

Updated 52 minutes ago

evaluation-frameworkevaluation-metricsllm-evaluation+3

cua

trycua

💛87

Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).

13.4k

829

MIT

Python

Updated 2 hours ago

agentai-agentapple+15

ragas

vibrantlabsai

💚92

Supercharge Your LLM Application Evaluations 🚀

13.3k

1.3k

Apache-2.0

Python

Updated 2 hours ago

evaluationllmllmops

gorilla

ShishirPatil

💚96

Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)

12.8k

1.3k

Apache-2.0

Python

Updated 1 hour ago

apiapi-documentationchatgpt+5

lm-evaluation-harness

EleutherAI

💚92

A framework for few-shot evaluation of language models.

12.1k

3.2k

MIT

Python

Updated 27 minutes ago

evaluation-frameworklanguage-modeltransformer

bisheng

dataelement

💚91

BISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful and comprehensive features include: GenAI workflow, RAG, Agent, Unified model management, Evaluation, SFT, Dataset Management, Enterprise-level System Management, Observability and more.

11.3k

1.8k

Apache-2.0

TypeScript

Updated 6 hours ago

agentaichatbot+17

tensorzero

💛84

TensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and experimentation.

11.2k

806

Apache-2.0

Rust

Updated 1 hour ago

aiai-engineeringanthropic+17

ParlAI

facebookresearch

💚91

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

10.6k

2.1k

MIT

Python

Updated 6 hours ago

Theano

💚90

Theano was a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It is being continued as PyTensor: www.github.com/pymc-devs/pytensor

10.0k

2.5k

NOASSERTION

Python

Updated 1 day ago

phoenix

Arize-ai

💛82

AI Observability & Evaluation

9.2k

801

NOASSERTION

Jupyter Notebook

Updated 3 hours ago

agentsai-monitoringai-observability+13

oumi

oumi-ai

💛82

Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!

9.2k

744

Apache-2.0

Python

Updated 1 hour ago

dpoevaluationfine-tuning+9

expr

expr-lang

💛78

Expression language and expression evaluation for Go

7.8k

492

MIT

Updated 5 hours ago

bytecodeconfiguration-languageengine+8

adk-go

google

💛79

An open-source, code-first Go toolkit for building, evaluating, and deploying sophisticated AI agents with flexibility and control.

7.5k

622

Apache-2.0

Updated 2 hours ago

a2aagentsagents-sdk+11

evidently

evidentlyai

💛86

Evidently is an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.

7.4k

813

Apache-2.0

Jupyter Notebook

Updated 6 hours ago

data-driftdata-qualitydata-science+11

opencompass

open-compass

💛79

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

6.8k

755

Apache-2.0

Python

Updated 16 hours ago

benchmarkchatgptevaluation+5

gallery

flutter

💛87

Flutter Gallery was a resource to help developers evaluate and use Flutter

6.6k

1.5k

BSD-3-Clause

Dart

Updated 21 hours ago

dartflutter

tensortrade

tensortrade-org

💛88

An open source reinforcement learning framework for training, evaluating, and deploying robust trading agents.

6.1k

1.2k

Apache-2.0

Python

Updated 42 minutes ago

agent-starter-pack

GoogleCloudPlatform

💛85

Ship AI Agents to Google Cloud in minutes, not months. Production-ready templates with built-in CI/CD, evaluation, and observability.

6.1k

1.4k

Apache-2.0

Python

Updated 6 hours ago

agentsgcpgemini+5

ToolBench

OpenBMB

💛80

[ICLR'24 spotlight] An open platform for training, serving, and evaluating large language model for tool learning.

5.6k

481

Apache-2.0

Python

Updated 6 hours ago

helicone

Helicone

💛81

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

5.5k

507

Apache-2.0

TypeScript

Updated 6 hours ago

agent-monitoringanalyticsevaluation+16

coze-loop

coze-dev

💛83

Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.

5.4k

745

Apache-2.0

Updated 2 hours ago

agentagent-evaluationagent-observability+14

giskard-oss

Giskard-AI

💛79

🐢 Open-Source Evaluation & Testing library for LLM Agents

5.2k

423

Apache-2.0

Python

Updated 2 hours ago

agent-evaluationai-red-teamai-security+14

Object-Detection-Metrics

rafaelpadilla

💛85

Most popular metrics used to evaluate object detection algorithms.

5.1k

1.0k

MIT

Python

Updated 1 day ago

average-precisionbounding-boxesmean-average-precision+4

transformerlab-app

transformerlab

💛75

The open source research environment for AI researchers to seamlessly train, evaluate, and scale models from local hardware to GPU clusters.

4.9k

507

AGPL-3.0

Python

Updated 2 hours ago

diffusiondiffusion-modelselectron+7

GitHub Explorer

Search Results

FastChat

mlflow

adk-python

opik

evals

RagaAI-Catalyst

deepeval

cua

ragas

gorilla

lm-evaluation-harness

bisheng

tensorzero

ParlAI

Theano

phoenix

oumi

expr

adk-go

evidently

opencompass

gallery

tensortrade

agent-starter-pack

ToolBench

helicone

coze-loop

giskard-oss

Object-Detection-Metrics

transformerlab-app

FastChat

mlflow

adk-python

opik

evals

RagaAI-Catalyst

deepeval

cua

ragas

gorilla

lm-evaluation-harness

bisheng

tensorzero

ParlAI

Theano

phoenix

oumi

expr

adk-go

evidently

opencompass

gallery

tensortrade

agent-starter-pack

ToolBench

helicone

coze-loop

giskard-oss

Object-Detection-Metrics

transformerlab-app