Search Results

Found 60 repositories(showing 30)

svelte-bench

khromov

🧡65

An LLM benchmark for Svelte 5 based on the OpenAI methodology from OpenAIs paper "Evaluating Large Language Models Trained on Code".

167

TypeScript

Updated 11 hours ago

aisveltesveltekit

llm-evaluation-methodology

aws-samples

❤️35

No description available

MIT-0

Python

Updated 2 months ago

An app and set of methodologies designed to evaluate the performance of various Large Language Models (LLMs) on the text-to-SQL task. Our goal is to offer a standardized way to measure how well these models can generate SQL queries from natural language descriptions

Jupyter Notebook

Updated 1 week ago

benchmarkllm-benchmarkingtext-to-sql

cultural_evolution

CakeCrusher

❤️35

implements the methodology outlined in the paper *Cultural Evolution of Cooperation among LLM Agents*. The paper explores whether a society of large language model (LLM) agents can develop cooperative norms through cultural evolution, using the classic *Donor Game*. The goal is to evaluate multi-agent interaction dynamics

Jupyter Notebook

Updated 6 months ago

rag-model-training

avnlp

💛70

Training code for advanced RAG techniques - Adaptive-RAG, Corrective RAG, RQ-RAG, Self-RAG, Agentic RAG, and ReZero. Reproduces paper methodologies to fine-tune LLMs via SFT and GRPO for adaptive retrieval, corrective evaluation, query refinement, self-reflection, and agentic search behaviors.

MIT

Python

Updated 1 day ago

adaptive-ragagentic-ragcrag+7

frai-benchmark

sebuzdugan

🧡50

FRAI Benchmark: Future Responsible AI Evaluation A consensus-based safety and compliance benchmark for SOTA LLMs (DeepSeek, Grok, GPT-5). Uses a "Panel of Experts" methodology where multiple frontier models judge response quality to ensure non-biased, high-fidelity safety scores

NOASSERTION

Python

Updated 2 months ago

datawarden

e-xperiments

❤️25

This repository is dedicated to providing cutting-edge tools and methodologies to evaluate and curate datasets specifically designed for Large Language Models (LLMs). Leveraging the capabilities of LLMs themselves, combined with programmatic best practices, our toolkit ensures a robust evaluation and refinement process for your datasets.

Apache-2.0

Python

Updated 1 year ago

HyTE

5krus

❤️40

LLM-based Hypothesize-Test-Evaluate methodology.

MPL-2.0

Python

Updated 9 months ago

Measuring-Fairness-in-LLMs

nsatpute

❤️35

This project creates a framework to evaluate and ensure fairness in Large Language Models (LLMs). It focuses on detecting and mitigating biases through model testing, red teaming with adversarial prompts, and robust scoring methodologies. The aim is to ensure AI-generated content is ethical, fair, and legally compliant across diverse use cases.

Jupyter Notebook

Updated 10 months ago

llm-eval-plugin

EladAriel

❤️45

A comprehensive toolkit for evaluating and building LLM applications, combining systematic evaluation methodology with practical RAG implementation.

Python

Updated 1 month ago

gilBERTo-Language-Model

arthurcerveira

❤️35

Methodology to pre-train and evaluate a LLM to the Portuguese language

Python

Updated 1 year ago

bertlarge-language-modelsllm+2

AART-AI-Adversarial-Research-Toolkit

caspiankeyes

❤️45

AART provides security researchers, AI labs, and red teams with a structured framework for conducting thorough adversarial evaluations of LLM systems. The framework implements a multi-dimensional assessment methodology that systematically probes model boundaries, quantifies security vulnerabilities and benchmarks defensive robustness in frontier AI

MIT

HTML

Updated 1 week ago

Dharma_Code

Palmerschallon

🧡50

Polyglot ontological activations for LLM systems. 68 terms from 20+ traditions mapped to computational patterns, plus 10 algorithms native to the ontology that have no equivalents in standard CS. Includes benchmark suite and a documented evaluation methodology confound finding.

MIT

Python

Updated 1 month ago

coat-dataset

maczg

❤️35

Dataset containing 269 privacy policy evaluations by multiple LLMs (o1, o3, o4-mini, qwen3) using the PrivacySpy scoring methodology. Includes detailed rubric assessments, policy citations, and model performance metrics in JSON, CSV, and Parquet formats.

Python

Updated 5 months ago

cs-generation

hitz-zentroa

❤️30

Code-switching Generation using LLMs: Methodology and Evaluation

Python

Updated 1 year ago

audio-fc-annotation

jeffersonmonkam

❤️35

Audio-based evaluation of LLM function calling behavior with structured QA methodology

Updated 3 months ago

promptfoo-evals-project

tetyana-s

🧡65

Promptfoo evaluation suite implements a systematic evaluation methodology to ensure LLM reliability, safety, and performance across various scenarios

Updated 3 days ago

llm-forensic-timeline

studiawan

🧡60

A standardized methodology and dataset for evaluating LLM-based digital forensic timeline analysis

MIT

Python

Updated 1 week ago

enterprise-rag-eval

mukulchhabra23

❤️45

Case-aware evaluation framework for enterprise RAG systems using LLM-as-a-judge methodology. Includes reproducible pipeline, metrics, and benchmarking tools.

Python

Updated 1 month ago

evaluation-system

rasiulyte

❤️45

Example of LLM hallucination detection system with A/B testing, drift monitoring, and interactive dashboard. Demonstrating AI evaluation methodology and statistical testing.

Python

Updated 2 months ago

efficient-domain-adaptation

closestfriend

❤️40

Research repository for Brie: LLM-assisted data authoring methodology achieving 91% win rates. Training infrastructure, evaluation framework, and paper documenting small-data domain adaptation.

Python

Updated 2 months ago

data-authoringdomain-adaptationevaluation+8

provenance-epistemic-reliability-preprint

novaiteraresearch

🧡50

Preprint on provenance-aware architectures for epistemic reliability in LLMs, by Ariel J. Furlow (Nova Itera Research Group LLC). Timestamped provenance anchor for conceptual frameworks and evaluation methodologies.

NOASSERTION

Updated 1 month ago

evidence_synthesis_pipeline

anacsmelo

🧡50

Final Course Project in Biomedical Engineering, whose objective is to evaluate the methodological quality of evidence syntheses generated by LLMs, using the PICOS framework.

AGPL-3.0

Python

Updated 1 month ago

QA_wiki_tables

Armand394

❤️30

Question and Answering on semi-structured tables in the data Science Methodology project. Making a pipeline to (i) process and clean semi-structured tables, (ii) Use SOTA LLMs to convert questions to SQL queries, (iii) Make an extensive evaluation protocol to evaluate and test the method.

Jupyter Notebook

Updated 3 months ago

model-evaluation-infrastructure

caspiankeyes

💛70

A comprehensive model evaluation infrastructure that extends existing adversarial testing paradigms by establishing a unified, recursive methodology for LLM security assessment. Unlike previous approaches that treat security as an add-on consideration, FRAME applies a pluralist lens to multidomain security.

MIT

Updated 3 hours ago

evals

nibzard

❤️45

LLM evaluation pipeline implementing Eugene Yan's Product Evals methodology

Python

Updated 2 months ago

CausalJudge

yiqunchen

❤️40

LLM Evaluation of Causal Claims and Methodological Assumptions

Python

Updated 2 months ago

llm-correctness-evaluator

HamiltonMussi

❤️35

LLM response correctness evaluation system using LlamaIndex methodology.

Python

Updated 6 months ago

ai-regex

FAIR-IALAB-UBA

❤️35

Experiments and evaluations on LLM + Regex methodology

Python

Updated 1 year ago

lab-llm-as-judge-evaluation

aprilatkinson

🧡50

LLM evaluation lab using LLM-as-Judge methodology, custom benchmarks, and Python implementation.

Python

Updated 3 weeks ago

GitHub Explorer

Search Results

svelte-bench

llm-evaluation-methodology

txt-2-sql-benchmark

cultural_evolution

rag-model-training

frai-benchmark

datawarden

HyTE

Measuring-Fairness-in-LLMs

llm-eval-plugin

gilBERTo-Language-Model

AART-AI-Adversarial-Research-Toolkit

Dharma_Code

coat-dataset

cs-generation

audio-fc-annotation

promptfoo-evals-project

llm-forensic-timeline

enterprise-rag-eval

evaluation-system

efficient-domain-adaptation

provenance-epistemic-reliability-preprint

evidence_synthesis_pipeline

QA_wiki_tables

model-evaluation-infrastructure

evals

CausalJudge

llm-correctness-evaluator

ai-regex

lab-llm-as-judge-evaluation

svelte-bench

llm-evaluation-methodology

txt-2-sql-benchmark

cultural_evolution

rag-model-training

frai-benchmark

datawarden

HyTE

Measuring-Fairness-in-LLMs

llm-eval-plugin

gilBERTo-Language-Model

AART-AI-Adversarial-Research-Toolkit

Dharma_Code

coat-dataset

cs-generation

audio-fc-annotation

promptfoo-evals-project

llm-forensic-timeline

enterprise-rag-eval

evaluation-system

efficient-domain-adaptation

provenance-epistemic-reliability-preprint

evidence_synthesis_pipeline

QA_wiki_tables

model-evaluation-infrastructure

evals

CausalJudge

llm-correctness-evaluator

ai-regex

lab-llm-as-judge-evaluation