Found 222,827 repositories(showing 30)
lm-sys
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
mlflow
The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.
An open-source, code-first Python toolkit for building, evaluating, and deploying sophisticated AI agents with flexibility and control.
comet-ml
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
openai
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
raga-ai-hub
Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced analytics with timeline and execution graph view
confident-ai
The LLM Evaluation Framework
trycua
Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).
vibrantlabsai
Supercharge Your LLM Application Evaluations π
ShishirPatil
Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
EleutherAI
A framework for few-shot evaluation of language models.
dataelement
BISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful and comprehensive features include: GenAI workflow, RAG, Agent, Unified model management, Evaluation, SFT, Dataset Management, Enterprise-level System Management, Observability and more.
tensorzero
TensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and experimentation.
facebookresearch
A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
Theano
Theano was a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It is being continued as PyTensor: www.github.com/pymc-devs/pytensor
Arize-ai
AI Observability & Evaluation
oumi-ai
Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!
expr-lang
Expression language and expression evaluation for Go
An open-source, code-first Go toolkit for building, evaluating, and deploying sophisticated AI agents with flexibility and control.
evidentlyai
Evidently is ββan open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.
open-compass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
flutter
Flutter Gallery was a resource to help developers evaluate and use Flutter
tensortrade-org
An open source reinforcement learning framework for training, evaluating, and deploying robust trading agents.
GoogleCloudPlatform
Ship AI Agents to Google Cloud in minutes, not months. Production-ready templates with built-in CI/CD, evaluation, and observability.
OpenBMB
[ICLR'24 spotlight] An open platform for training, serving, and evaluating large language model for tool learning.
Helicone
π§ Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 π
coze-dev
Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.
Giskard-AI
π’ Open-Source Evaluation & Testing library for LLM Agents
rafaelpadilla
Most popular metrics used to evaluate object detection algorithms.
transformerlab
The open source research environment for AI researchers to seamlessly train, evaluate, and scale models from local hardware to GPU clusters.