Search Results

Found 4,224 repositories(showing 30)

toon

toon-format

💚96

🎒 Token-Oriented Object Notation (TOON) – Compact, human-readable, schema-aware JSON for LLM prompts. Spec, benchmarks, TypeScript SDK.

23.8k

1.1k

MIT

TypeScript

Updated 1 hour ago

data-formatllmserialization+1

evals

openai

💚100

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

18.2k

2.9k

NOASSERTION

Python

Updated 6 hours ago

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

5.5k

562

MIT

Python

Updated 3 hours ago

agentagentopsagents-sdk+14

local-deep-research

LearningCircuit

💛73

Local Deep Research achieves ~95% on SimpleQA benchmark (tested with GPT-4.1-mini). Supports local and cloud LLMs (Ollama, Google, Anthropic, ...). Searches 10+ sources - arXiv, PubMed, web, and your private documents. Everything Local & Encrypted.

4.3k

407

MIT

Python

Updated 4 hours ago

academiaanthropicarxiv+17

AgentBench

THUDM

💛71

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

3.3k

244

Apache-2.0

Python

Updated 18 hours ago

chatgptgpt-4llm+1

LLMZoo

FreedomIntelligence

💛70

⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡

3.0k

195

Apache-2.0

Python

Updated 1 hour ago

evalscope

modelscope

💛76

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

2.6k

309

Apache-2.0

Python

Updated 12 hours ago

evaluationllmperformance+2

Awesome-AI4Med

FreedomIntelligence

🧡67

A curated list of medical LLMs, multimodal systems, datasets, benchmarks, and more. 🏥

2.6k

450

Updated 20 hours ago

awesome-listscollectiondatasets+5

terminal-bench

harbor-framework

🧡67

A benchmark for LLMs on complicated tasks in the terminal

2.0k

499

Apache-2.0

Python

Updated 1 hour ago

Video-ChatGPT

mbzuai-oryx

🧡68

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

1.5k

130

CC-BY-4.0

Python

Updated 2 days ago

chatbotclipgpt-4+8

llm-colosseum

OpenGenerativeAI

💛73

Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM

1.5k

178

MIT

Jupyter Notebook

Updated 3 days ago

benchmarkgenaillm+1

Introduction-to-Quantitative-Finance

Barca0412

💛73

入门资料整理：1.多因子股票量化框架开源教程 2.学界和业界的经典资料收录 3.AI + 金融的相关工作，包括LLM, Agent, benchmark(evaluation), etc.

1.3k

153

MIT

Python

Updated 19 hours ago

agentai4finfinance+8

Awesome-LLM-based-Text2SQL

DEEP-PolyU

💛72

[TKDE2025] Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL | A curated list of resources (surveys, papers, benchmarks, and opensource projects) on large language model-based text-to-SQL.

1.3k

119

MIT

Updated 19 hours ago

awesomeawesome-text-to-sqlawesome-text2sql+9

free-coding-models

vava-nessa

🧡68

Find, benchmark and install in CLI 200+ FREE coding LLM models across 20+ providers in real time

1.3k

139

NOASSERTION

JavaScript

Updated 2 hours ago

aideepseekfree+12

text-to-lora

SakanaAI

💛72

Hypernetworks that adapt LLMs for specific benchmark tasks using only textual task description as the input

1.3k

Apache-2.0

Python

Updated 15 hours ago

fine-tuninghypernetworksllm+2

enterprise-deep-research

SalesforceAIResearch

💛73

Salesforce Enterprise Deep Research

1.2k

180

Apache-2.0

Python

Updated 23 hours ago

deep-research-agente2bfastapi+6

LiveBench

🧡62

LiveBench: A Challenging, Contamination-Free LLM Benchmark

1.1k

101

NOASSERTION

Python

Updated 13 hours ago

llmperf

ray-project

💛73

LLMPerf is a library for validating and benchmarking LLMs

1.1k

204

Apache-2.0

Python

Updated 3 hours ago

arena-hard-auto

lmarena

🧡68

Arena-Hard-Auto: An automatic LLM benchmark.

1.0k

150

Apache-2.0

Python

Updated 3 days ago

ATLAS

VILA-Lab

🧡67

A principled instruction benchmark on formulating effective queries and prompts for large language models (LLMs). Our paper: https://arxiv.org/abs/2312.16171

980

105

Apache-2.0

Python

Updated 2 days ago

skill

pinchbench

🧡57

PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai

976

105

MIT

Python

Updated 9 hours ago

Mind2Web

OSU-NLP-Group

🧡67

[NeurIPS'23 Spotlight] "Mind2Web: Towards a Generalist Agent for the Web" -- the first LLM-based web agent and benchmark for generalist web agents

975

124

MIT

Jupyter Notebook

Updated 2 days ago

KernelBench

ScalingIntelligence

🧡67

KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)

919

153

NOASSERTION

Jupyter Notebook

Updated 1 day ago

benchmarkcodegenevaluation+3

PIXIU

The-FinAI

💛72

This repository introduces PIXIU, an open-source resource featuring the first financial large language models (LLMs), instruction tuning data, and evaluation benchmarks to holistically assess financial LLMs. Our goal is to continually push forward the open-source development of financial artificial intelligence (AI).

848

114

MIT

Jupyter Notebook

Updated 2 days ago

aifinancechatgptfintech+12

pyllms

kagisearch

🧡66

Minimal Python library to connect to LLMs (OpenAI, Anthropic, Google, Groq, Reka, Together, AI21, Cohere, Aleph Alpha, HuggingfaceHub), with a built-in model performance benchmark.

820

MIT

Python

Updated 23 hours ago

Video-MME

MME-Benchmarks

🧡61

✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

752

Updated 1 day ago

large-language-modelslarge-vision-language-modelsmme+3

codefuse-devops-eval

codefuse-ai

🧡66

Industrial-first evaluation benchmark for LLMs in the DevOps/AIOps domain.

652

NOASSERTION

Python

Updated 1 day ago

Awesome-LLM-Eval

onejune2018

💛71

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表，主要面向基础大模型评测，旨在探求生成式AI的技术边界.

630

MIT

Updated 2 days ago

awsome-listawsome-listsbenchmark+16

Long-Context

abacusai

🧡61

This repository contains code and tooling for the Abacus.AI LLM Context Expansion project. Also included are evaluation scripts and benchmark tasks that evaluate a model’s information retrieval capabilities with context expansion. We also include key experimental results and instructions for reproducing and building on them.

601

Apache-2.0

Python

Updated 1 week ago

llm_benchmarks

leobeeson

🧡56

A collection of benchmarks and datasets for evaluating LLM.

561

Updated 1 week ago

GitHub Explorer

Search Results

toon

evals

agentops

local-deep-research

AgentBench

LLMZoo

evalscope

Awesome-AI4Med

terminal-bench

Video-ChatGPT

llm-colosseum

Introduction-to-Quantitative-Finance

Awesome-LLM-based-Text2SQL

free-coding-models

text-to-lora

enterprise-deep-research

LiveBench

llmperf

arena-hard-auto

ATLAS

skill

Mind2Web

KernelBench

PIXIU

pyllms

Video-MME

codefuse-devops-eval

Awesome-LLM-Eval

Long-Context

llm_benchmarks

toon

evals

agentops

local-deep-research

AgentBench

LLMZoo

evalscope

Awesome-AI4Med

terminal-bench

Video-ChatGPT

llm-colosseum

Introduction-to-Quantitative-Finance

Awesome-LLM-based-Text2SQL

free-coding-models

text-to-lora

enterprise-deep-research

LiveBench

llmperf

arena-hard-auto

ATLAS

skill

Mind2Web

KernelBench

PIXIU

pyllms

Video-MME

codefuse-devops-eval

Awesome-LLM-Eval

Long-Context

llm_benchmarks