Found 4,224 repositories(showing 30)
toon-format
🎒 Token-Oriented Object Notation (TOON) – Compact, human-readable, schema-aware JSON for LLM prompts. Spec, benchmarks, TypeScript SDK.
openai
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
AgentOps-AI
Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI
LearningCircuit
Local Deep Research achieves ~95% on SimpleQA benchmark (tested with GPT-4.1-mini). Supports local and cloud LLMs (Ollama, Google, Anthropic, ...). Searches 10+ sources - arXiv, PubMed, web, and your private documents. Everything Local & Encrypted.
THUDM
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
FreedomIntelligence
⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡
modelscope
A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.
FreedomIntelligence
A curated list of medical LLMs, multimodal systems, datasets, benchmarks, and more. 🏥
harbor-framework
A benchmark for LLMs on complicated tasks in the terminal
mbzuai-oryx
[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
OpenGenerativeAI
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
Barca0412
入门资料整理:1.多因子股票量化框架开源教程 2.学界和业界的经典资料收录 3.AI + 金融的相关工作,包括LLM, Agent, benchmark(evaluation), etc.
DEEP-PolyU
[TKDE2025] Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL | A curated list of resources (surveys, papers, benchmarks, and opensource projects) on large language model-based text-to-SQL.
vava-nessa
Find, benchmark and install in CLI 200+ FREE coding LLM models across 20+ providers in real time
SakanaAI
Hypernetworks that adapt LLMs for specific benchmark tasks using only textual task description as the input
SalesforceAIResearch
Salesforce Enterprise Deep Research
LiveBench
LiveBench: A Challenging, Contamination-Free LLM Benchmark
ray-project
LLMPerf is a library for validating and benchmarking LLMs
lmarena
Arena-Hard-Auto: An automatic LLM benchmark.
VILA-Lab
A principled instruction benchmark on formulating effective queries and prompts for large language models (LLMs). Our paper: https://arxiv.org/abs/2312.16171
pinchbench
PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai
OSU-NLP-Group
[NeurIPS'23 Spotlight] "Mind2Web: Towards a Generalist Agent for the Web" -- the first LLM-based web agent and benchmark for generalist web agents
ScalingIntelligence
KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)
The-FinAI
This repository introduces PIXIU, an open-source resource featuring the first financial large language models (LLMs), instruction tuning data, and evaluation benchmarks to holistically assess financial LLMs. Our goal is to continually push forward the open-source development of financial artificial intelligence (AI).
kagisearch
Minimal Python library to connect to LLMs (OpenAI, Anthropic, Google, Groq, Reka, Together, AI21, Cohere, Aleph Alpha, HuggingfaceHub), with a built-in model performance benchmark.
MME-Benchmarks
✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
codefuse-ai
Industrial-first evaluation benchmark for LLMs in the DevOps/AIOps domain.
onejune2018
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
abacusai
This repository contains code and tooling for the Abacus.AI LLM Context Expansion project. Also included are evaluation scripts and benchmark tasks that evaluate a model’s information retrieval capabilities with context expansion. We also include key experimental results and instructions for reproducing and building on them.
leobeeson
A collection of benchmarks and datasets for evaluating LLM.