Search Results

Found 75 repositories(showing 30)

RepoToTextForLLMs

Doriandarko

🧡67

Automate the analysis of GitHub repositories for LLMs with RepoToTextForLLMs. Fetch READMEs, structure, and non-binary files efficiently. Outputs include analysis prompts to aid in comprehensive repo evaluation

787

102

Python

Updated 5 hours ago

ml-sobench

apple

🧡50

SO-Bench release for evaluating visual structured output capabilities of multimodal LLMs.

NOASSERTION

Python

Updated 3 days ago

eval-lens

SimonRendonA

🧡65

Evaluate structured LLM outputs with precision. Compare model outputs against expected schemas and values — row by row.

MIT

TypeScript

Updated 1 day ago

aideveloper-toolsevaluation+5

llm-eval-notes

maxpetrusenko

💛70

Public LLM evaluation artifacts: hallucination, brittleness, structured output, and tool-use tests

MIT

Python

Updated 1 day ago

ai-engineeringai-safetybenchmarks+10

Language-Model-Quality-Auditor

mohdibrahimaiml

❤️40

A comprehensive human-in-the-loop evaluation platform for Large Language Models, built for AI alignment and safety research. This Flask-based application enables human evaluators to provide structured feedback on LLM outputs across multiple quality dimensions.

MIT

HTML

Updated 8 months ago

autonomous-scholarship-sdr-agent

Sama-ndari

❤️45

An agentic AI system that automatically drafts, evaluates, formats, and sends professional Master’s scholarship application emails to universities using multiple LLMs, structured outputs, input guardrails, and SendGrid email delivery -- all orchestrated from a single Jupyter Notebook.

Jupyter Notebook

Updated 1 month ago

agentic-aiai-agentsautonomous-agents+8

llmpass

ikanam-ai

❤️20

LLM-Profiling is a service for analyzing semantic relationships between words using graph-based methods. It helps validate the quality of large language models (LLMs) by evaluating their ability to capture and reproduce meaningful structures. The tool enables comparison, visualization, and metric-based assessment of LLM outputs.

MIT

Python

Updated 5 months ago

llm-redteam-microfuzzer

Thibbeer

🧡55

A micro-fuzzer for red teaming LLMs. Tests structured-output bypasses, prompt injection, and canary leakage. 10/16 attacks leaked a system-secret during evaluation.

Python

Updated 2 weeks ago

llmtester

DennisGross

❤️35

A lightweight toolkit for generating, storing, and analyzing LLM outputs using Ollama. Supports structured response saving, thinking extraction, and customizable test and summary functions for evaluation.

Python

Updated 10 months ago

llm-eval-playground

epaunova

❤️35

This project demonstrates how to evaluate and compare outputs from different Large Language Models (LLMs) using structured scoring methods — including factuality, clarity, and verbosity.

Jupyter Notebook

Updated 10 months ago

universal_llm_batch_generation_framework

nachai-l

🧡55

Universal, schema-driven LLM batch generation and validation framework for structured JSON outputs from tabular data. Supports CSV/TSV/XLSX inputs, Pydantic schema enforcement, optional judge passes, grouping, retries, deterministic caching, and reusable outputs across tasks like extraction, alignment, and evaluation.

Python

Updated 3 weeks ago

Prompt_Engineering_Portfolio

Hermeticpoet

❤️40

A curated portfolio of practical prompt engineering experiments focused on LLM evaluation, security testing, and advanced prompting techniques. Includes examples of jailbreak resistance, few‑shot prompting, chain‑of‑thought reasoning, and structured output refinement, documented with goals, prompts, outputs, and analysis

MIT

Updated 3 months ago

llm-financial-regulatory-auditor

giuliano-t

❤️40

A structured evaluation pipeline for LLM-generated outputs in financial supervision contexts. Combines PRA-aligned prompts, thread-type detection, and metric-level meta-review to assess relevance, justification, and actionability across 50+ regulatory and conversational metrics.

MIT

Jupyter Notebook

Updated 7 months ago

financial-nlpllmnlp-evaluation+2

PROMPTING_APPLICATION

duarajper4

❤️35

A simple Hugging Face Space that demonstrates prompt engineering basics using a user-friendly interface. Built to test and explore prompt design strategies with a generative AI model.My projects focus on:✅Foundational LLMs, Prompt Engineering, NLP, Text Generation, Structured Output Evaluation, and Chain-of-Thought Reasoning

Python

Updated 11 months ago

LLM-Prompt-Evaluator

geegorbee

❤️35

A lightweight Python tool for testing and evaluating LLM responses. Includes prompt engineering samples, manual scoring criteria, and output auditing structure for tone, accuracy, and ethical safety. Built for aspiring AI content reviewers and prompt engineers.

Python

Updated 8 months ago

datasetbuilder

hazyy00

🧡50

A pipeline for generating LLM training datasets by combining semantic search with GPT-3.5/GPT-4. Given a set of questions and source documents, it retrieves relevant paragraphs, generates answers, evaluates answer quality, and optionally produces multi-turn conversations — outputting structured JSON datasets ready for fine-tuning.

Python

Updated 3 weeks ago

LLM-as-Judge

syed-waleed-ahmed

❤️35

A Streamlit web app that uses a Groq-powered LLM (Llama 3) to act as an impartial judge for evaluating and comparing two model outputs. Supports custom criteria, presets like creativity and brand tone, and returns structured scores, explanations, and a winner. Built end-to-end with Python, Groq API, and Streamlit.

Python

Updated 4 months ago

a-b-testingai-automationai-evaluation+12

llm-structured-output-evaluation

Genie-Experiments

❤️35

A comprehensive benchmark suite for evaluating language models' ability to generate and validate structured outputs (e.g., JSON) using Pydantic, instructor, and outlines library.

Python

Updated 9 months ago

structured_output_evaluation_of_LLMs

mhardik003

❤️35

Evaluation of structured outputs, focusing primarily on JSON, and proposes a set of techniques and metrics that better capture the fidelity, accuracy, and usefulness of generated structured data.

Python

Updated 6 months ago

LLM-Structured-Output-Reliability-Evaluation

gnandeep-2002

❤️45

Comparative evaluation of LLM APIs (Gemini vs DeepSeek) in a deterministic 3D maze environment. Measures structured JSON compliance, navigation accuracy, and execution reliability under zero-retry constraints. Demonstrates LLM formatting discipline, spatial reasoning, and real-world action consistency.

JavaScript

Updated 2 months ago

LLM-Eval

dvmukul

🧡55

A structured evaluation framework for LLM outputs.

Python

Updated 1 week ago

llm-evaluation-suite

unzer-0x

🧡55

Practical LLM evaluation toolkit for prompt testing, scoring, and structured AI output reviews.

Python

Updated 2 weeks ago

llm-evaluation-engine

solaicoffee

❤️45

Production-style Python engine for structured evaluation, scoring, and benchmarking of LLM outputs.

Python

Updated 1 month ago

llm-readiness-platform

themattnash

❤️35

Evaluation framework for LLM-powered features: structured testing, output scoring, and drift detection.

MIT

Python

Updated 1 month ago

genai-essay-evaluation-system

VISHWAS-dto

🧡65

LLM-powered essay evaluation system using LangGraph with structured outputs and multi-node analysis

Jupyter Notebook

Updated 13 hours ago

user-feedback-rag-score

SaipriyaBudde

❤️35

Lightweight API for evaluating RAG-based LLM outputs with structured scoring & feedback

Python

Updated 7 months ago

workflow-arize-ai-phoenix-llm-evaluation-pipeline

leeroopedia

❤️45

Structured workflow for evaluating LLM outputs using the Arize Phoenix Evals framework

Python

Updated 1 month ago

StructTune-AI

BhaveshMakhija

🧡55

Fine-tuning LLMs with LoRA & DPO for structured JSON output, evaluation metrics, and React-based visualization.

Python

Updated 1 week ago

ad-generator-agents

BirsenYY

❤️45

A LangGraph-based agentic workflow demonstrating structured LLM outputs, controlled state transitions, and iterative content evaluation.

Python

Updated 2 months ago

clinical-prompt-engineering-examples

nateroehrig

❤️35

Applied clinical prompt engineering for developmental and school psychology workflows, including structured evaluation of LLM outputs.

Updated 3 months ago

GitHub Explorer

Search Results

RepoToTextForLLMs

ml-sobench

eval-lens

llm-eval-notes

Language-Model-Quality-Auditor

autonomous-scholarship-sdr-agent

llmpass

llm-redteam-microfuzzer

llmtester

llm-eval-playground

universal_llm_batch_generation_framework

Prompt_Engineering_Portfolio

llm-financial-regulatory-auditor

PROMPTING_APPLICATION

LLM-Prompt-Evaluator

datasetbuilder

LLM-as-Judge

llm-structured-output-evaluation

structured_output_evaluation_of_LLMs

LLM-Structured-Output-Reliability-Evaluation

LLM-Eval

llm-evaluation-suite

llm-evaluation-engine

llm-readiness-platform

genai-essay-evaluation-system

user-feedback-rag-score

workflow-arize-ai-phoenix-llm-evaluation-pipeline

StructTune-AI

ad-generator-agents

clinical-prompt-engineering-examples

RepoToTextForLLMs

ml-sobench

eval-lens

llm-eval-notes

Language-Model-Quality-Auditor

autonomous-scholarship-sdr-agent

llmpass

llm-redteam-microfuzzer

llmtester

llm-eval-playground

universal_llm_batch_generation_framework

Prompt_Engineering_Portfolio

llm-financial-regulatory-auditor

PROMPTING_APPLICATION

LLM-Prompt-Evaluator

datasetbuilder

LLM-as-Judge

llm-structured-output-evaluation

structured_output_evaluation_of_LLMs

LLM-Structured-Output-Reliability-Evaluation

LLM-Eval

llm-evaluation-suite

llm-evaluation-engine

llm-readiness-platform

genai-essay-evaluation-system

user-feedback-rag-score

workflow-arize-ai-phoenix-llm-evaluation-pipeline

StructTune-AI

ad-generator-agents

clinical-prompt-engineering-examples