Found 4,502 repositories(showing 30)
ConardLi
A powerful tool for creating datasets for LLM fine-tuning 、RAG and Eval
MLGroupJLU
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
tjunlp-lab
The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey.
onejune2018
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
abacaj
Run evaluation on LLMs using human-eval benchmark
A curated list of Human Preference Datasets for LLM fine-tuning, RLHF, and eval.
claw-eval
Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.
redotvideo
LLM fine-tuning and eval
rajshah4
Sample notebooks and prompts for LLM evaluation
llm-jp
No description available
swordlidev
A Survey on Benchmarks of Multimodal Large Language Models
Turing-Project
Scenario-based Evaluation dataset for LLM (beta)
jayminban
This project benchmarks 41 open-source large language models across 19 evaluation tasks using the lm-evaluation-harness library.
gordicaleksa
Serbian LLM Eval.
Asaf-Yehudai
Top papers related to LLM-based agent evaluation
justplus
大语言模型评估平台,支持多种评估基准、自定义数据集和性能测试。支持基于自定义数据集的RAG评估。
cyberark
Simple LLM Evaluation Using LLM As a Judge 👩⚖️
grigio
llm-eval-simple is a simple LLM evaluation framework with intermediate actions and prompt pattern selection
tuhh-softsec
No description available
OSU-NLP-Group
[ACL'24] Code and data of paper "When is Tree Search Useful for LLM Planning? It Depends on the Discriminator"
mkurman
Fine-tunes a student LLM using teacher feedback for improved reasoning and answer quality. Implements GRPO with teacher-provided evaluations.
h2oai
Large-language Model Evaluation framework with Elo Leaderboard and A-B testing
VectorBoard
Open Source Embeddings Optimisation and Eval Framework for RAG/LLM Applications. Documentations at https://docs.vectorboard.ai/introduction
friendliai
No description available
aws-samples
No description available
daekeun-ml
Performs benchmarking on two Korean datasets with minimal time and effort.
flexpa
Benchmarking Large Language Models for FHIR
Code space for 'Evaluation of large language models for discovery of gene set function'
percent4
Using LLM to evaluate MMLU dataset.
llm-jp
A lightweight framework for evaluating visual-language models.