Found 40 repositories(showing 30)
closedloop-ai
Open-source Claude Code plugins for multi-agent software delivery. Plan-first SDLC workflow, code review, LLM quality judges, and self-learning — grounded in your codebase. Bootstrap, Plan, & Ship.
OPPO-Mente-Lab
Official repository for the paper “When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning.”
Contextualist
Self-hosted LLM chatbot arena, with yourself as the only judge
ngathan
LLM exercises to help people learn core concepts in LLM space such as prompt engineering, retrieval, agentic designs (e.g., self-reflection, best-of-n), evaluation (e.g., LLM Judge)
skyhuang233
Offline LLM self-improvement pipeline — mine training candidates from conversations, judge quality, and export SFT / preference / pretraining datasets ready for fine-tuning.
stormhierta
Self-evolution pipeline for OpenClaw skills using genetic algorithms and LLM-as-judge fitness evaluation
Context-Management
Self-hosted LLM evaluation studio with blind A/B/C judging and stakeholder-ready exports
Komuccap1
Evaluator-Optimizer LLM orchestrator with tiered models, Judge quality control, and self-correction. Built with LangGraph.
Abhinaba925
An advanced RAG system that uses an LLM-as-a-Judge to evaluate and self-correct its answers, built with LangChain, LangGraph, and Google Gemini
nidhijain16
Automated hallucination detection pipeline for RAG systems. Uses Llama 3 as a "Judge LLM" to perform pairwise evaluations against synthetic ground truth data. Implements a self-reflective scoring mechanism to ensure factual accuracy and system reliability. Built with Python and NVIDIA API.
An LLM-based evaluation system that benchmarks graph vs. tabular data structures for healthcare unit transition analysis and clinical decision support.
basusum
Investigating self-bias in LLM Judges
CSE 515 project
ich-mayday
Self-Learning AI Calendar Assistant prototype with LangGraph and LLM-as-a-Judge
mrseanryan
Evaluate translations by either a self-hosted Embedder or using Chat-GPT as LLM-as-judge.
jamjahal
A portable two-layer LLM evaluation pipeline for AI agent outputs — heuristic guard + LLM-as-Judge with Self-Refine retry loop
Ishanuj99
Automated evaluation pipeline for AI agents — LLM-as-Judge, tool call evaluation, coherence checks, self-updating suggestions
cgy11102
Multi-agent AI workflows with autonomous tool use, self-reflection loops, and LLM-as-a-judge evaluation
DIVYANI-DROID
Demonstration of multi-agent orchestration with LLMs — includes Critic, Reflection (self-refine), and Judge patterns for ranking responses.
turancannb02
A self-hosted toolbox for benchmarking and evaluating open-source LLMs across multiple inference backends with real-time metrics and LLM-as-judge scoring.
joaquinhuigomez
Detect position bias, verbosity bias, and self-preference in LLM judges. Position-swap evaluation + Cohen's Kappa + calibration report.
aprotiim
Self Reflective RAG system where the LLM actively judges its own retrieval, evidence, and answers instead of blindly trusting retrieved documents
kdeng03
This is a general framework of self-rewarding LLM. With DPO method, we can use preference learning by judging the generated answer to build a stronger LLM after iterations.
Aeryes
A privacy-first, offline RAG agent built with LangGraph and Docker. Features self-correction, hybrid search, and local LLM-as-a-judge testing.
broomva
Metacognitive evaluation — real-time quality scoring with inline heuristics and LLM-as-judge. EGRI loop for self-improvement. Part of the Life Agent OS.
alexfacehead
Self-hosted LLM evaluation platform. Multi-model benchmarking with real-time monitoring, pluggable scoring (exact match, code execution, LLM-as-judge), and interactive replay engine. Built with Python/FastAPI + React/TypeScript.
sujithpvarghese
An autonomous Agent-to-Agent (A2A) orchestration hub built with Strands SDK, Amazon Bedrock, and TypeScript. Features self-healing triage, episodic memory, and an LLM-as-Judge evaluation pipeline.
Harisankar005
Aegis composes specialist agents to carry out open missions, detects capability gaps, synthesizes new agent/tools on the fly, and self-improves via LLM-judges, memory consolidation, and AgentOps.
R1pples
🧬 Self-Evolving AI Prompt Manager for VS Code — Save, organize, auto-optimize, and reuse your best prompts. Powered by open-source prompt libraries + LLM Judge evaluation. 202 tests passing.
bfalkowski
A self-contained A2A (Agent-to-Agent) LLM-as-a-Judge agent for connectivity demos. This agent can be deployed to Heroku and provides JSON-RPC endpoints for agent communication.