Search Results

Found 28 repositories(showing 28)

chinese-llm-benchmark

jeinlee1991

💛73

ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括359个大模型，覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3-max、qwen3.5-plus、百川、讯飞星火、商汤senseChat等商用模型，以及step3.5-flash、kimi-k2.5、ernie4.5、MiniMax-M2.5、deepseek-v3.2、Qwen3.5、llama4、智谱GLM-5、GLM-4.7、LongCat、gemma3、mistral等开源大模型。不仅提供排行榜，也提供规模超200万的大模型缺陷库！方便广大社区研究分析、改进大模型。

5.8k

234

Updated 2 hours ago

agentic-aiartificial-intelligencellm-agent+1

Flames

AI45Lab

🧡60

Flames is a highly adversarial benchmark in Chinese for LLM's harmlessness evaluation developed by Shanghai AI Lab and Fudan NLP Group.

Apache-2.0

Updated 1 week ago

CHARM

opendatalab

❤️45

[ACL 2024 Main Conference] Chinese commonsense benchmark for LLMs

Apache-2.0

Python

Updated 2 months ago

E-EVAL

AI-EDU-LAB

❤️30

Official github repo for E-Eval, a Chinese K12 education evaluation benchmark for LLMs.

Python

Updated 4 months ago

Strata-Sword

Alibaba-AAIG

❤️40

The Strata-Sword is a hierarchical Chinese-English jailbreak safety benchmark based on quantified reasoning complexity, developed in-house by Alibaba-AAIG | Strata-Sword 是 Alibaba-AAIG自研的中英文分层越狱攻击安全基准，将“推理复杂度”作为可评估的安全维度，并提出多种中文特有攻击方法，以系统评测不同推理复杂度下LLMs和LRMs的安全边界，从而为提升模型安全性提供新思路。

Python

Updated 2 months ago

medLLM_QA_benchmark

aistairc

❤️40

Trilingual (English, Japanese, Chinese) QA benchmark for medical LLM

MIT

Python

Updated 10 months ago

benchmarkllmmedical+1

MiLiC-Eval

luciusssss

🧡60

[ACL'25 Findings] MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority Languages

MIT

Python

Updated 2 weeks ago

kazakhmongolianmultilingual+3

SC-TC-Bench

brucelyu17

❤️25

[FAccT '25] Characterizing Bias: Benchmarking LLMs in Simplified versus Traditional Chinese

Python

Updated 2 months ago

algorithmic-auditsalgorithmic-fairnessbenchmark-dataset+7

CANDY

SCUNLP

❤️45

We present CANDY, a benchmark designed to systematically evaluate the capabilities and limitations of LLMs in fact-checking Chinese misinformation

Updated 2 months ago

LLM-kinship-arena

matchyc

❤️45

【年关将至！】Benchmark for evaluating LLMs on Chinese kinship term inference (中文亲属关系). Given a relation chain (e.g., "my father's elder brother"), models must output the correct address term (e.g., 伯父). LLM-as-Judge scoring; supports SiliconFlow, OpenRouter, OpenAI, Gemini.

Python

Updated 1 month ago

CMedCalc-Bench

Zhihong-Zhu

❤️20

[EMNLP 2025] CMedCalc-Bench: A Fine-Grained Benchmark for Chinese Medical Calculations in LLM

Updated 6 months ago

Multi-Physics

luozhongze

❤️30

Multi-Physics: A Comprehensive Benchmark for Multimodel LLMs Reasoning on Chinese Multi-subject Physics Problems

Python

Updated 2 months ago

FineMATH

tjunlp-lab

🧡55

FineMath is a fine-grained mathematical evaluation benchmark dataset for assessing Chinese LLMs. FineMath is created to cover the major key mathematical concepts taught in elementary school math, which are further divided into 17 categories of math word problems, enabling in-depth analysis of mathematical reasoning abilities of LLMs.

Updated 2 weeks ago

chinese-llm-benchmark

xiaofuqing13

❤️45

国内大模型对比测评，火山引擎/千帆/混元/星火/Kimi电影知识问答横向对比

Python

Updated 1 month ago

ai-evaluationbenchmarkchinese-ai+6

MedicalArena

FreedomIntelligence

❤️35

benchmarking medical LLMs in Chinese

Updated 2 years ago

Chinese_LLM_Inference-Benchmark

iLovEing

❤️40

中文大模型推理&评测参考代码

Apache-2.0

Python

Updated 1 year ago

mock-politeness-benchmark

zhangyitong2021

💛70

Paper showcase for Chinese mock politeness benchmarking in LLMs

MIT

Python

Updated 1 day ago

mock-politeness-benchmark

sirichen2

💛70

Paper showcase for Chinese mock politeness benchmarking in LLMs

MIT

Python

Updated 1 day ago

LLM_mbti_bias

liudandd

❤️30

mbtiBias: The First Benchmark for Rootless MBTI Stereotypes in Chinese Conversational LLMs

Jupyter Notebook

Updated 5 months ago

kampobench

medicalcloud

🧡60

a derivative of openai/healthbench for benchmarking LLM for traditional Japanese/Chinese medicine

MIT

Python

Updated 1 week ago

Pun2Pun

rexera

❤️35

[ACL 2025] Pun2Pun: Benchmarking LLMs on Textual-Visual Chinese-English Pun Translation via Pragmatics Model and Linguistic Reasoning

Python

Updated 3 months ago

Made_in_China

Lkh97

❤️30

This repo is related to the paper "Made in China Thinking in America: U.S. Values Persist in Chinese LLMs". It is a benchmarking of values of LLMs made in China and US.

Python

Updated 7 months ago

TCMI-F-6D-Benchmark

123adf-dev

🧡60

This repository contains the code, benchmark, and evaluation framework for TCMI-F-6D (TCMI-Foundation-6Domain Benchmark), which measures the foundational interdisciplinary competency of LLMs in Traditional Chinese Medicine Informatics.

Apache-2.0

Python

Updated 2 weeks ago

china-bench

adamholter

❤️45

ChinaBench — LLM censorship benchmark for China-sensitive prompts. Test how models handle Tibet, Tiananmen, Uyghurs, Taiwan & more.

HTML

Updated 1 month ago

BrainBench

levor-lawfrobert

🧡60

BrainBench: A 100-question benchmark exposing commonsense reasoning gaps in LLMs across 20 failure categories. Includes English and Chinese datasets, evaluation code, and results for 8 frontier models.

MIT

Updated 1 week ago

pedja1904

🧡55

Config and Profile for Predrag Aranđelović | Lead AI Automation Consultant @ Lus Digital Consulting. Specialized in OpenClaw orchestration, n8n automation, and benchmarking SOTA Chinese LLMs (Qwen/DeepSeek) for B2B SaaS revenue growth.

Updated 1 week ago

CLM_bench

kekeaii

❤️45

# CLM-Bench: Cross-Lingual Misalignment Benchmark A culture-aware benchmark for evaluating cross-lingual knowledge editing in Large Language Models (LLMs). CLM-Bench contains 1,010 high-quality Chinese-first CounterFact pairs spanning 24 domains, designed to reveal cross-lingual misalignment in knowledge editing methods. 📄 [Paper](link-to-paper)

Updated 2 months ago

PUTHMedQA

conghuiz

❤️40

A high-quality benchmark to assess the medical capabilities of LLMs. PUTHMedQA comprises over 53,000 multiple-choice questions (MCQs) and 1,300 multi-answer questions (MAQs) collected from English and Chinese real-world medical examinations, covering 21 organ-specific specialties.

MIT

Updated 10 months ago

All 28 repositories loaded

GitHub Explorer

Search Results

chinese-llm-benchmark

Flames

CHARM

E-EVAL

Strata-Sword

medLLM_QA_benchmark

MiLiC-Eval

SC-TC-Bench

CANDY

LLM-kinship-arena

CMedCalc-Bench

Multi-Physics

FineMATH

chinese-llm-benchmark

MedicalArena

Chinese_LLM_Inference-Benchmark

mock-politeness-benchmark

mock-politeness-benchmark

LLM_mbti_bias

kampobench

Pun2Pun

Made_in_China

TCMI-F-6D-Benchmark

china-bench

BrainBench

pedja1904

CLM_bench

PUTHMedQA

chinese-llm-benchmark

Flames

CHARM

E-EVAL

Strata-Sword

medLLM_QA_benchmark

MiLiC-Eval

SC-TC-Bench

CANDY

LLM-kinship-arena

CMedCalc-Bench

Multi-Physics

FineMATH

chinese-llm-benchmark

MedicalArena

Chinese_LLM_Inference-Benchmark

mock-politeness-benchmark

mock-politeness-benchmark

LLM_mbti_bias

kampobench

Pun2Pun

Made_in_China

TCMI-F-6D-Benchmark

china-bench

BrainBench

pedja1904

CLM_bench

PUTHMedQA