Search Results

Found 5,160 repositories(showing 30)

toon

toon-format

💚96

🎒 Token-Oriented Object Notation (TOON) – Compact, human-readable, schema-aware JSON for LLM prompts. Spec, benchmarks, TypeScript SDK.

23.7k

1.1k

MIT

TypeScript

Updated 2 minutes ago

data-formatllmserialization+1

evals

openai

💚100

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

18.2k

2.9k

NOASSERTION

Python

Updated 4 hours ago

ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括359个大模型，覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3-max、qwen3.5-plus、百川、讯飞星火、商汤senseChat等商用模型，以及step3.5-flash、kimi-k2.5、ernie4.5、MiniMax-M2.5、deepseek-v3.2、Qwen3.5、llama4、智谱GLM-5、GLM-4.7、LongCat、gemma3、mistral等开源大模型。不仅提供排行榜，也提供规模超200万的大模型缺陷库！方便广大社区研究分析、改进大模型。

5.8k

234

Updated 8 hours ago

agentic-aiartificial-intelligencellm-agent+1

agentops

AgentOps-AI

💛81

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

5.4k

561

MIT

Python

Updated 1 hour ago

agentagentopsagents-sdk+14

local-deep-research

LearningCircuit

💛73

Local Deep Research achieves ~95% on SimpleQA benchmark (tested with GPT-4.1-mini). Supports local and cloud LLMs (Ollama, Google, Anthropic, ...). Searches 10+ sources - arXiv, PubMed, web, and your private documents. Everything Local & Encrypted.

4.3k

408

MIT

Python

Updated 46 minutes ago

academiaanthropicarxiv+17

AgentBench

THUDM

💛71

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

3.3k

244

Apache-2.0

Python

Updated 13 hours ago

chatgptgpt-4llm+1

LLMZoo

FreedomIntelligence

💛70

⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡

3.0k

195

Apache-2.0

Python

Updated 1 day ago

evalscope

modelscope

💛76

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

2.6k

305

Apache-2.0

Python

Updated 53 minutes ago

evaluationllmperformance+2

Awesome-AI4Med

FreedomIntelligence

🧡67

A curated list of medical LLMs, multimodal systems, datasets, benchmarks, and more. 🏥

2.6k

450

Updated 1 day ago

awesome-listscollectiondatasets+5

terminal-bench

harbor-framework

🧡67

A benchmark for LLMs on complicated tasks in the terminal

1.9k

499

Apache-2.0

Python

Updated 6 hours ago

GPU-Benchmarks-on-LLM-Inference

XiongjieDai

🧡68

Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference?

1.9k

Jupyter Notebook

Updated 1 day ago

Video-ChatGPT

mbzuai-oryx

🧡68

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

1.5k

130

CC-BY-4.0

Python

Updated 1 day ago

chatbotclipgpt-4+8

llm-colosseum

OpenGenerativeAI

💛73

Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM

1.5k

178

MIT

Jupyter Notebook

Updated 2 days ago

benchmarkgenaillm+1

Introduction-to-Quantitative-Finance

Barca0412

💛73

入门资料整理：1.多因子股票量化框架开源教程 2.学界和业界的经典资料收录 3.AI + 金融的相关工作，包括LLM, Agent, benchmark(evaluation), etc.

1.3k

153

MIT

Python

Updated 16 hours ago

agentai4finfinance+8

Awesome-LLM-based-Text2SQL

DEEP-PolyU

💛72

[TKDE2025] Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL | A curated list of resources (surveys, papers, benchmarks, and opensource projects) on large language model-based text-to-SQL.

1.3k

119

MIT

Updated 1 day ago

awesomeawesome-text-to-sqlawesome-text2sql+9

free-coding-models

vava-nessa

🧡68

Find, benchmark and install in CLI 200+ FREE coding LLM models across 20+ providers in real time

1.3k

136

NOASSERTION

JavaScript

Updated 17 minutes ago

aideepseekfree+12

text-to-lora

SakanaAI

💛72

Hypernetworks that adapt LLMs for specific benchmark tasks using only textual task description as the input

1.3k

Apache-2.0

Python

Updated 1 day ago

fine-tuninghypernetworksllm+2

LiveBench

🧡62

LiveBench: A Challenging, Contamination-Free LLM Benchmark

1.1k

101

NOASSERTION

Python

Updated 58 minutes ago

llmperf

ray-project

💛73

LLMPerf is a library for validating and benchmarking LLMs

1.1k

204

Apache-2.0

Python

Updated 5 days ago

yet-another-applied-llm-benchmark

carlini

🧡57

A benchmark to evaluate language models on questions I've previously asked them to solve.

1.1k

GPL-3.0

Python

Updated 1 week ago

arena-hard-auto

lmarena

🧡68

Arena-Hard-Auto: An automatic LLM benchmark.

1.0k

150

Apache-2.0

Python

Updated 2 days ago

ATLAS

VILA-Lab

🧡67

A principled instruction benchmark on formulating effective queries and prompts for large language models (LLMs). Our paper: https://arxiv.org/abs/2312.16171

980

105

Apache-2.0

Python

Updated 1 day ago

Mind2Web

OSU-NLP-Group

🧡67

[NeurIPS'23 Spotlight] "Mind2Web: Towards a Generalist Agent for the Web" -- the first LLM-based web agent and benchmark for generalist web agents

975

124

MIT

Jupyter Notebook

Updated 22 hours ago

skill

pinchbench

🧡57

PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai

966

103

MIT

Python

Updated 20 minutes ago

KernelBench

ScalingIntelligence

🧡67

KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)

920

153

NOASSERTION

Jupyter Notebook

Updated 10 hours ago

benchmarkcodegenevaluation+3

llm_benchmark

llm2014

🧡56

No description available

916

Updated 10 hours ago

PIXIU

The-FinAI

💛72

This repository introduces PIXIU, an open-source resource featuring the first financial large language models (LLMs), instruction tuning data, and evaluation benchmarks to holistically assess financial LLMs. Our goal is to continually push forward the open-source development of financial artificial intelligence (AI).

848

114

MIT

Jupyter Notebook

Updated 1 day ago

aifinancechatgptfintech+12

pyllms

kagisearch

🧡66

Minimal Python library to connect to LLMs (OpenAI, Anthropic, Google, Groq, Reka, Together, AI21, Cohere, Aleph Alpha, HuggingfaceHub), with a built-in model performance benchmark.

819

MIT

Python

Updated 6 days ago

Video-MME

MME-Benchmarks

🧡61

✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

752

Updated 13 hours ago

large-language-modelslarge-vision-language-modelsmme+3

codefuse-devops-eval

codefuse-ai

🧡66

Industrial-first evaluation benchmark for LLMs in the DevOps/AIOps domain.

652

NOASSERTION

Python

Updated 15 hours ago

GitHub Explorer

Search Results

toon

evals

chinese-llm-benchmark

agentops

local-deep-research

AgentBench

LLMZoo

evalscope

Awesome-AI4Med

terminal-bench

GPU-Benchmarks-on-LLM-Inference

Video-ChatGPT

llm-colosseum

Introduction-to-Quantitative-Finance

Awesome-LLM-based-Text2SQL

free-coding-models

text-to-lora

LiveBench

llmperf

yet-another-applied-llm-benchmark

arena-hard-auto

ATLAS

Mind2Web

skill

KernelBench

llm_benchmark

PIXIU

pyllms

Video-MME

codefuse-devops-eval

toon

evals

chinese-llm-benchmark

agentops

local-deep-research

AgentBench

LLMZoo

evalscope

Awesome-AI4Med

terminal-bench

GPU-Benchmarks-on-LLM-Inference

Video-ChatGPT

llm-colosseum

Introduction-to-Quantitative-Finance

Awesome-LLM-based-Text2SQL

free-coding-models

text-to-lora

LiveBench

llmperf

yet-another-applied-llm-benchmark

arena-hard-auto

ATLAS

Mind2Web

skill

KernelBench

llm_benchmark

PIXIU

pyllms

Video-MME

codefuse-devops-eval