Found 5,160 repositories(showing 30)
toon-format
🎒 Token-Oriented Object Notation (TOON) – Compact, human-readable, schema-aware JSON for LLM prompts. Spec, benchmarks, TypeScript SDK.
openai
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
jeinlee1991
ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括359个大模型,覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3-max、qwen3.5-plus、百川、讯飞星火、商汤senseChat等商用模型, 以及step3.5-flash、kimi-k2.5、ernie4.5、MiniMax-M2.5、deepseek-v3.2、Qwen3.5、llama4、智谱GLM-5、GLM-4.7、LongCat、gemma3、mistral等开源大模型。不仅提供排行榜,也提供规模超200万的大模型缺陷库!方便广大社区研究分析、改进大模型。
AgentOps-AI
Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI
LearningCircuit
Local Deep Research achieves ~95% on SimpleQA benchmark (tested with GPT-4.1-mini). Supports local and cloud LLMs (Ollama, Google, Anthropic, ...). Searches 10+ sources - arXiv, PubMed, web, and your private documents. Everything Local & Encrypted.
THUDM
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
FreedomIntelligence
⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡
modelscope
A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.
FreedomIntelligence
A curated list of medical LLMs, multimodal systems, datasets, benchmarks, and more. 🏥
harbor-framework
A benchmark for LLMs on complicated tasks in the terminal
XiongjieDai
Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference?
mbzuai-oryx
[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
OpenGenerativeAI
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
Barca0412
入门资料整理:1.多因子股票量化框架开源教程 2.学界和业界的经典资料收录 3.AI + 金融的相关工作,包括LLM, Agent, benchmark(evaluation), etc.
DEEP-PolyU
[TKDE2025] Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL | A curated list of resources (surveys, papers, benchmarks, and opensource projects) on large language model-based text-to-SQL.
vava-nessa
Find, benchmark and install in CLI 200+ FREE coding LLM models across 20+ providers in real time
SakanaAI
Hypernetworks that adapt LLMs for specific benchmark tasks using only textual task description as the input
LiveBench
LiveBench: A Challenging, Contamination-Free LLM Benchmark
ray-project
LLMPerf is a library for validating and benchmarking LLMs
A benchmark to evaluate language models on questions I've previously asked them to solve.
lmarena
Arena-Hard-Auto: An automatic LLM benchmark.
VILA-Lab
A principled instruction benchmark on formulating effective queries and prompts for large language models (LLMs). Our paper: https://arxiv.org/abs/2312.16171
OSU-NLP-Group
[NeurIPS'23 Spotlight] "Mind2Web: Towards a Generalist Agent for the Web" -- the first LLM-based web agent and benchmark for generalist web agents
pinchbench
PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai
ScalingIntelligence
KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)
llm2014
No description available
The-FinAI
This repository introduces PIXIU, an open-source resource featuring the first financial large language models (LLMs), instruction tuning data, and evaluation benchmarks to holistically assess financial LLMs. Our goal is to continually push forward the open-source development of financial artificial intelligence (AI).
kagisearch
Minimal Python library to connect to LLMs (OpenAI, Anthropic, Google, Groq, Reka, Together, AI21, Cohere, Aleph Alpha, HuggingfaceHub), with a built-in model performance benchmark.
MME-Benchmarks
✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
codefuse-ai
Industrial-first evaluation benchmark for LLMs in the DevOps/AIOps domain.