Found 28 repositories(showing 28)
jeinlee1991
ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括359个大模型,覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3-max、qwen3.5-plus、百川、讯飞星火、商汤senseChat等商用模型, 以及step3.5-flash、kimi-k2.5、ernie4.5、MiniMax-M2.5、deepseek-v3.2、Qwen3.5、llama4、智谱GLM-5、GLM-4.7、LongCat、gemma3、mistral等开源大模型。不仅提供排行榜,也提供规模超200万的大模型缺陷库!方便广大社区研究分析、改进大模型。
AI45Lab
Flames is a highly adversarial benchmark in Chinese for LLM's harmlessness evaluation developed by Shanghai AI Lab and Fudan NLP Group.
opendatalab
[ACL 2024 Main Conference] Chinese commonsense benchmark for LLMs
AI-EDU-LAB
Official github repo for E-Eval, a Chinese K12 education evaluation benchmark for LLMs.
Alibaba-AAIG
The Strata-Sword is a hierarchical Chinese-English jailbreak safety benchmark based on quantified reasoning complexity, developed in-house by Alibaba-AAIG | Strata-Sword 是 Alibaba-AAIG自研的中英文分层越狱攻击安全基准,将“推理复杂度”作为可评估的安全维度,并提出多种中文特有攻击方法,以系统评测不同推理复杂度下LLMs和LRMs的安全边界,从而为提升模型安全性提供新思路。
aistairc
Trilingual (English, Japanese, Chinese) QA benchmark for medical LLM
luciusssss
[ACL'25 Findings] MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority Languages
brucelyu17
[FAccT '25] Characterizing Bias: Benchmarking LLMs in Simplified versus Traditional Chinese
SCUNLP
We present CANDY, a benchmark designed to systematically evaluate the capabilities and limitations of LLMs in fact-checking Chinese misinformation
matchyc
【年关将至!】Benchmark for evaluating LLMs on Chinese kinship term inference (中文亲属关系). Given a relation chain (e.g., "my father's elder brother"), models must output the correct address term (e.g., 伯父). LLM-as-Judge scoring; supports SiliconFlow, OpenRouter, OpenAI, Gemini.
Zhihong-Zhu
[EMNLP 2025] CMedCalc-Bench: A Fine-Grained Benchmark for Chinese Medical Calculations in LLM
luozhongze
Multi-Physics: A Comprehensive Benchmark for Multimodel LLMs Reasoning on Chinese Multi-subject Physics Problems
tjunlp-lab
FineMath is a fine-grained mathematical evaluation benchmark dataset for assessing Chinese LLMs. FineMath is created to cover the major key mathematical concepts taught in elementary school math, which are further divided into 17 categories of math word problems, enabling in-depth analysis of mathematical reasoning abilities of LLMs.
xiaofuqing13
国内大模型对比测评,火山引擎/千帆/混元/星火/Kimi电影知识问答横向对比
FreedomIntelligence
benchmarking medical LLMs in Chinese
iLovEing
中文大模型推理&评测参考代码
zhangyitong2021
Paper showcase for Chinese mock politeness benchmarking in LLMs
sirichen2
Paper showcase for Chinese mock politeness benchmarking in LLMs
liudandd
mbtiBias: The First Benchmark for Rootless MBTI Stereotypes in Chinese Conversational LLMs
medicalcloud
a derivative of openai/healthbench for benchmarking LLM for traditional Japanese/Chinese medicine
rexera
[ACL 2025] Pun2Pun: Benchmarking LLMs on Textual-Visual Chinese-English Pun Translation via Pragmatics Model and Linguistic Reasoning
Lkh97
This repo is related to the paper "Made in China Thinking in America: U.S. Values Persist in Chinese LLMs". It is a benchmarking of values of LLMs made in China and US.
123adf-dev
This repository contains the code, benchmark, and evaluation framework for TCMI-F-6D (TCMI-Foundation-6Domain Benchmark), which measures the foundational interdisciplinary competency of LLMs in Traditional Chinese Medicine Informatics.
adamholter
ChinaBench — LLM censorship benchmark for China-sensitive prompts. Test how models handle Tibet, Tiananmen, Uyghurs, Taiwan & more.
levor-lawfrobert
BrainBench: A 100-question benchmark exposing commonsense reasoning gaps in LLMs across 20 failure categories. Includes English and Chinese datasets, evaluation code, and results for 8 frontier models.
pedja1904
Config and Profile for Predrag Aranđelović | Lead AI Automation Consultant @ Lus Digital Consulting. Specialized in OpenClaw orchestration, n8n automation, and benchmarking SOTA Chinese LLMs (Qwen/DeepSeek) for B2B SaaS revenue growth.
kekeaii
# CLM-Bench: Cross-Lingual Misalignment Benchmark A culture-aware benchmark for evaluating cross-lingual knowledge editing in Large Language Models (LLMs). CLM-Bench contains 1,010 high-quality Chinese-first CounterFact pairs spanning 24 domains, designed to reveal cross-lingual misalignment in knowledge editing methods. 📄 [Paper](link-to-paper)
conghuiz
A high-quality benchmark to assess the medical capabilities of LLMs. PUTHMedQA comprises over 53,000 multiple-choice questions (MCQs) and 1,300 multi-answer questions (MAQs) collected from English and Chinese real-world medical examinations, covering 21 organ-specific specialties.
All 28 repositories loaded