Search Results

Found 40 repositories(showing 30)

claude-plugins

closedloop-ai

🧡65

Open-source Claude Code plugins for multi-agent software delivery. Plan-first SDLC workflow, code review, LLM quality judges, and self-learning — grounded in your codebase. Bootstrap, Plan, & Ship.

Apache-2.0

Python

Updated 10 hours ago

agentic-aiai-agentsanthropic+6

LLM-Self-Judge

OPPO-Mente-Lab

🧡65

Official repository for the paper “When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning.”

Apache-2.0

Python

Updated 4 days ago

lone-arena

Contextualist

❤️40

Self-hosted LLM chatbot arena, with yourself as the only judge

MIT

Python

Updated 6 months ago

human-evaluationllm

ai-exercises

ngathan

❤️45

LLM exercises to help people learn core concepts in LLM space such as prompt engineering, retrieval, agentic designs (e.g., self-reflection, best-of-n), evaluation (e.g., LLM Judge)

Jupyter Notebook

Updated 2 months ago

OpenEvol

skyhuang233

🧡60

Offline LLM self-improvement pipeline — mine training candidates from conversations, judge quality, and export SFT / preference / pretraining datasets ready for fine-tuning.

Python

Updated 1 day ago

openclaw-self-evolution

stormhierta

🧡65

Self-evolution pipeline for OpenClaw skills using genetic algorithms and LLM-as-judge fitness evaluation

TypeScript

Updated 1 day ago

BeLLMark

Context-Management

❤️45

Self-hosted LLM evaluation studio with blind A/B/C judging and stakeholder-ready exports

Updated 1 month ago

tier-judge

Komuccap1

💛70

Evaluator-Optimizer LLM orchestrator with tiered models, Judge quality control, and self-correction. Built with LangGraph.

MIT

Python

Updated 2 days ago

self-correcting-rag-gemini

Abhinaba925

❤️35

An advanced RAG system that uses an LLM-as-a-Judge to evaluate and self-correct its answers, built with LangChain, LangGraph, and Google Gemini

Python

Updated 8 months ago

Automated hallucination detection pipeline for RAG systems. Uses Llama 3 as a "Judge LLM" to perform pairwise evaluations against synthetic ground truth data. Implements a self-reflective scoring mechanism to ensure factual accuracy and system reliability. Built with Python and NVIDIA API.

Python

Updated 2 months ago

LLM-as-a-Self-Referential-Judge

Bstrato

❤️35

An LLM-based evaluation system that benchmarks graph vs. tabular data structures for healthcare unit transition analysis and clinical decision support.

Jupyter Notebook

Updated 3 months ago

LLM-Judge

basusum

❤️30

Investigating self-bias in LLM Judges

Jupyter Notebook

Updated 1 year ago

Context-based-self-questioning-LLMs-with-GNN-judge

olo126

🧡55

CSE 515 project

Python

Updated 3 weeks ago

deep-juno

ich-mayday

❤️45

Self-Learning AI Calendar Assistant prototype with LangGraph and LLM-as-a-Judge

Python

Updated 2 months ago

gpt-eval-translations

mrseanryan

❤️25

Evaluate translations by either a self-hosted Embedder or using Chat-GPT as LLM-as-judge.

MIT

Python

Updated 1 year ago

evaluationllm-as-judgetranslation-evaluation

verdict

jamjahal

🧡55

A portable two-layer LLM evaluation pipeline for AI agent outputs — heuristic guard + LLM-as-Judge with Self-Refine retry loop

Python

Updated 2 weeks ago

ai-eval-pipeline

Ishanuj99

❤️45

Automated evaluation pipeline for AI agents — LLM-as-Judge, tool call evaluation, coherence checks, self-updating suggestions

Python

Updated 1 month ago

agentic-workflow-engineering

cgy11102

🧡65

Multi-agent AI workflows with autonomous tool use, self-reflection loops, and LLM-as-a-judge evaluation

Jupyter Notebook

Updated 7 hours ago

agentic-aigradiogroq+3

agentic-critic-refine-multi-llm

DIVYANI-DROID

❤️35

Demonstration of multi-agent orchestration with LLMs — includes Critic, Reflection (self-refine), and Judge patterns for ranking responses.

Jupyter Notebook

Updated 7 months ago

model-toolbox

turancannb02

🧡55

A self-hosted toolbox for benchmarking and evaluating open-source LLMs across multiple inference backends with real-time metrics and LLM-as-judge scoring.

Python

Updated 4 weeks ago

llm-judge-calibrator

joaquinhuigomez

🧡60

Detect position bias, verbosity bias, and self-preference in LLM judges. Position-swap evaluation + Cohen's Kappa + calibration report.

MIT

Python

Updated 3 weeks ago

Self_RAG

aprotiim

🧡55

Self Reflective RAG system where the LLM actively judges its own retrieval, evidence, and answers instead of blindly trusting retrieved documents

Jupyter Notebook

Updated 2 weeks ago

sr-chatbot

kdeng03

❤️45

This is a general framework of self-rewarding LLM. With DPO method, we can use preference learning by judging the generated answer to build a stronger LLM after iterations.

Python

Updated 1 month ago

LocalRAG

Aeryes

❤️45

A privacy-first, offline RAG agent built with LangGraph and Docker. Features self-correction, hybrid search, and local LLM-as-a-judge testing.

Python

Updated 2 months ago

nous

broomva

💛70

Metacognitive evaluation — real-time quality scoring with inline heuristics and LLM-as-judge. EGRI loop for self-improvement. Part of the Life Agent OS.

MIT

Rust

Updated 8 hours ago

agent-frameworkagent-osai+6

benchmarking-arena

alexfacehead

🧡55

Self-hosted LLM evaluation platform. Multi-model benchmarking with real-time monitoring, pluggable scoring (exact match, code execution, LLM-as-judge), and interactive replay engine. Built with Python/FastAPI + React/TypeScript.

Python

Updated 2 weeks ago

bedrock-agent-core-operations-hub

sujithpvarghese

🧡55

An autonomous Agent-to-Agent (A2A) orchestration hub built with Strands SDK, Amazon Bedrock, and TypeScript. Features self-healing triage, episodic memory, and an LLM-as-Judge evaluation pipeline.

TypeScript

Updated 1 week ago

a2aamazon-bedrockautonomous-agents+9

Aegis-Self-Evolving-Agent-Fleet

Harisankar005

🧡65

Aegis composes specialist agents to carry out open missions, detects capability gaps, synthesizes new agent/tools on the fly, and self-improves via LLM-judges, memory consolidation, and AgentOps.

MIT

Python

Updated 1 day ago

prompt-stash

R1pples

🧡50

🧬 Self-Evolving AI Prompt Manager for VS Code — Save, organize, auto-optimize, and reuse your best prompts. Powered by open-source prompt libraries + LLM Judge evaluation. 202 tests passing.

MIT

TypeScript

Updated 1 month ago

a2a-llm-as-a-judge

bfalkowski

❤️35

A self-contained A2A (Agent-to-Agent) LLM-as-a-Judge agent for connectivity demos. This agent can be deployed to Heroku and provides JSON-RPC endpoints for agent communication.

Java

Updated 6 months ago

GitHub Explorer

Search Results

claude-plugins

LLM-Self-Judge

lone-arena

ai-exercises

OpenEvol

openclaw-self-evolution

BeLLMark

tier-judge

self-correcting-rag-gemini

RAG-Evaluation-Framework

LLM-as-a-Self-Referential-Judge

LLM-Judge

Context-based-self-questioning-LLMs-with-GNN-judge

deep-juno

gpt-eval-translations

verdict

ai-eval-pipeline

agentic-workflow-engineering

agentic-critic-refine-multi-llm

model-toolbox

llm-judge-calibrator

Self_RAG

sr-chatbot

LocalRAG

nous

benchmarking-arena

bedrock-agent-core-operations-hub

Aegis-Self-Evolving-Agent-Fleet

prompt-stash

a2a-llm-as-a-judge

claude-plugins

LLM-Self-Judge

lone-arena

ai-exercises

OpenEvol

openclaw-self-evolution

BeLLMark

tier-judge

self-correcting-rag-gemini

RAG-Evaluation-Framework

LLM-as-a-Self-Referential-Judge

LLM-Judge

Context-based-self-questioning-LLMs-with-GNN-judge

deep-juno

gpt-eval-translations

verdict

ai-eval-pipeline

agentic-workflow-engineering

agentic-critic-refine-multi-llm

model-toolbox

llm-judge-calibrator

Self_RAG

sr-chatbot

LocalRAG

nous

benchmarking-arena

bedrock-agent-core-operations-hub

Aegis-Self-Evolving-Agent-Fleet

prompt-stash

a2a-llm-as-a-judge