Found 60 repositories(showing 30)
khromov
An LLM benchmark for Svelte 5 based on the OpenAI methodology from OpenAIs paper "Evaluating Large Language Models Trained on Code".
aws-samples
No description available
tryDML
An app and set of methodologies designed to evaluate the performance of various Large Language Models (LLMs) on the text-to-SQL task. Our goal is to offer a standardized way to measure how well these models can generate SQL queries from natural language descriptions
CakeCrusher
implements the methodology outlined in the paper *Cultural Evolution of Cooperation among LLM Agents*. The paper explores whether a society of large language model (LLM) agents can develop cooperative norms through cultural evolution, using the classic *Donor Game*. The goal is to evaluate multi-agent interaction dynamics
avnlp
Training code for advanced RAG techniques - Adaptive-RAG, Corrective RAG, RQ-RAG, Self-RAG, Agentic RAG, and ReZero. Reproduces paper methodologies to fine-tune LLMs via SFT and GRPO for adaptive retrieval, corrective evaluation, query refinement, self-reflection, and agentic search behaviors.
sebuzdugan
FRAI Benchmark: Future Responsible AI Evaluation A consensus-based safety and compliance benchmark for SOTA LLMs (DeepSeek, Grok, GPT-5). Uses a "Panel of Experts" methodology where multiple frontier models judge response quality to ensure non-biased, high-fidelity safety scores
e-xperiments
This repository is dedicated to providing cutting-edge tools and methodologies to evaluate and curate datasets specifically designed for Large Language Models (LLMs). Leveraging the capabilities of LLMs themselves, combined with programmatic best practices, our toolkit ensures a robust evaluation and refinement process for your datasets.
5krus
LLM-based Hypothesize-Test-Evaluate methodology.
nsatpute
This project creates a framework to evaluate and ensure fairness in Large Language Models (LLMs). It focuses on detecting and mitigating biases through model testing, red teaming with adversarial prompts, and robust scoring methodologies. The aim is to ensure AI-generated content is ethical, fair, and legally compliant across diverse use cases.
EladAriel
A comprehensive toolkit for evaluating and building LLM applications, combining systematic evaluation methodology with practical RAG implementation.
arthurcerveira
Methodology to pre-train and evaluate a LLM to the Portuguese language
caspiankeyes
AART provides security researchers, AI labs, and red teams with a structured framework for conducting thorough adversarial evaluations of LLM systems. The framework implements a multi-dimensional assessment methodology that systematically probes model boundaries, quantifies security vulnerabilities and benchmarks defensive robustness in frontier AI
Palmerschallon
Polyglot ontological activations for LLM systems. 68 terms from 20+ traditions mapped to computational patterns, plus 10 algorithms native to the ontology that have no equivalents in standard CS. Includes benchmark suite and a documented evaluation methodology confound finding.
maczg
Dataset containing 269 privacy policy evaluations by multiple LLMs (o1, o3, o4-mini, qwen3) using the PrivacySpy scoring methodology. Includes detailed rubric assessments, policy citations, and model performance metrics in JSON, CSV, and Parquet formats.
hitz-zentroa
Code-switching Generation using LLMs: Methodology and Evaluation
jeffersonmonkam
Audio-based evaluation of LLM function calling behavior with structured QA methodology
tetyana-s
Promptfoo evaluation suite implements a systematic evaluation methodology to ensure LLM reliability, safety, and performance across various scenarios
studiawan
A standardized methodology and dataset for evaluating LLM-based digital forensic timeline analysis
mukulchhabra23
Case-aware evaluation framework for enterprise RAG systems using LLM-as-a-judge methodology. Includes reproducible pipeline, metrics, and benchmarking tools.
rasiulyte
Example of LLM hallucination detection system with A/B testing, drift monitoring, and interactive dashboard. Demonstrating AI evaluation methodology and statistical testing.
closestfriend
Research repository for Brie: LLM-assisted data authoring methodology achieving 91% win rates. Training infrastructure, evaluation framework, and paper documenting small-data domain adaptation.
novaiteraresearch
Preprint on provenance-aware architectures for epistemic reliability in LLMs, by Ariel J. Furlow (Nova Itera Research Group LLC). Timestamped provenance anchor for conceptual frameworks and evaluation methodologies.
anacsmelo
Final Course Project in Biomedical Engineering, whose objective is to evaluate the methodological quality of evidence syntheses generated by LLMs, using the PICOS framework.
Armand394
Question and Answering on semi-structured tables in the data Science Methodology project. Making a pipeline to (i) process and clean semi-structured tables, (ii) Use SOTA LLMs to convert questions to SQL queries, (iii) Make an extensive evaluation protocol to evaluate and test the method.
caspiankeyes
A comprehensive model evaluation infrastructure that extends existing adversarial testing paradigms by establishing a unified, recursive methodology for LLM security assessment. Unlike previous approaches that treat security as an add-on consideration, FRAME applies a pluralist lens to multidomain security.
nibzard
LLM evaluation pipeline implementing Eugene Yan's Product Evals methodology
yiqunchen
LLM Evaluation of Causal Claims and Methodological Assumptions
HamiltonMussi
LLM response correctness evaluation system using LlamaIndex methodology.
FAIR-IALAB-UBA
Experiments and evaluations on LLM + Regex methodology
aprilatkinson
LLM evaluation lab using LLM-as-Judge methodology, custom benchmarks, and Python implementation.