Found 10,359 repositories(showing 30)
mlflow
The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.
langfuse
๐ชข Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. ๐YC W23
promptfoo
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.
comet-ml
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
openai
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
raga-ai-hub
Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced analytics with timeline and execution graph view
confident-ai
The LLM Evaluation Framework
vibrantlabsai
Supercharge Your LLM Application Evaluations ๐
ShishirPatil
Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
dataelement
BISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful and comprehensive features include: GenAI workflow, RAG, Agent, Unified model management, Evaluation, SFT, Dataset Management, Enterprise-level System Management, Observability and more.
tensorzero
TensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and experimentation.
Arize-ai
AI Observability & Evaluation
oumi-ai
Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!
NVIDIA
the LLM vulnerability scanner
evidentlyai
Evidently is โโan open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.
open-compass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
jeinlee1991
ReLE่ฏๆต๏ผไธญๆAIๅคงๆจกๅ่ฝๅ่ฏๆต๏ผๆ็ปญๆดๆฐ๏ผ๏ผ็ฎๅๅทฒๅๆฌ359ไธชๅคงๆจกๅ๏ผ่ฆ็chatgptใgpt-5.2ใo4-miniใ่ฐทๆญgemini-3-proใClaude-4.6ใๆๅฟERNIE-X1.1ใERNIE-5.0ใqwen3-maxใqwen3.5-plusใ็พๅทใ่ฎฏ้ฃๆ็ซใๅๆฑคsenseChat็ญๅ็จๆจกๅ๏ผ ไปฅๅstep3.5-flashใkimi-k2.5ใernie4.5ใMiniMax-M2.5ใdeepseek-v3.2ใQwen3.5ใllama4ใๆบ่ฐฑGLM-5ใGLM-4.7ใLongCatใgemma3ใmistral็ญๅผๆบๅคงๆจกๅใไธไป ๆไพๆ่กๆฆ๏ผไนๆไพ่งๆจก่ถ 200ไธ็ๅคงๆจกๅ็ผบ้ทๅบ๏ผๆนไพฟๅนฟๅคง็คพๅบ็ ็ฉถๅๆใๆน่ฟๅคงๆจกๅใ
Helicone
๐ง Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 ๐
Giskard-AI
๐ข Open-Source Evaluation & Testing library for LLM Agents
PacktPublishing
The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices
lm-sys
A framework for serving and evaluating LLM routers - save LLM costs without compromising quality
Marker-Inc-Korea
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
Agenta-AI
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
EvolvingLMMs-Lab
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
Tencent
A full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, Skills Scan, MCP scan, AI Infra scan and LLM jailbreak evaluation.
THUDM
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
truera
Evaluation and Tracking for LLM Experiments and AI Agents
langwatch
The platform for LLM evaluations and AI agent testing
FreedomIntelligence
โกLLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.โก
lmnr-ai
Laminar - open-source observability platform purpose-built for AI agents. YC S24.