Found 75 repositories(showing 30)
Doriandarko
Automate the analysis of GitHub repositories for LLMs with RepoToTextForLLMs. Fetch READMEs, structure, and non-binary files efficiently. Outputs include analysis prompts to aid in comprehensive repo evaluation
apple
SO-Bench release for evaluating visual structured output capabilities of multimodal LLMs.
SimonRendonA
Evaluate structured LLM outputs with precision. Compare model outputs against expected schemas and values — row by row.
maxpetrusenko
Public LLM evaluation artifacts: hallucination, brittleness, structured output, and tool-use tests
mohdibrahimaiml
A comprehensive human-in-the-loop evaluation platform for Large Language Models, built for AI alignment and safety research. This Flask-based application enables human evaluators to provide structured feedback on LLM outputs across multiple quality dimensions.
Sama-ndari
An agentic AI system that automatically drafts, evaluates, formats, and sends professional Master’s scholarship application emails to universities using multiple LLMs, structured outputs, input guardrails, and SendGrid email delivery -- all orchestrated from a single Jupyter Notebook.
ikanam-ai
LLM-Profiling is a service for analyzing semantic relationships between words using graph-based methods. It helps validate the quality of large language models (LLMs) by evaluating their ability to capture and reproduce meaningful structures. The tool enables comparison, visualization, and metric-based assessment of LLM outputs.
Thibbeer
A micro-fuzzer for red teaming LLMs. Tests structured-output bypasses, prompt injection, and canary leakage. 10/16 attacks leaked a system-secret during evaluation.
DennisGross
A lightweight toolkit for generating, storing, and analyzing LLM outputs using Ollama. Supports structured response saving, thinking extraction, and customizable test and summary functions for evaluation.
epaunova
This project demonstrates how to evaluate and compare outputs from different Large Language Models (LLMs) using structured scoring methods — including factuality, clarity, and verbosity.
Universal, schema-driven LLM batch generation and validation framework for structured JSON outputs from tabular data. Supports CSV/TSV/XLSX inputs, Pydantic schema enforcement, optional judge passes, grouping, retries, deterministic caching, and reusable outputs across tasks like extraction, alignment, and evaluation.
Hermeticpoet
A curated portfolio of practical prompt engineering experiments focused on LLM evaluation, security testing, and advanced prompting techniques. Includes examples of jailbreak resistance, few‑shot prompting, chain‑of‑thought reasoning, and structured output refinement, documented with goals, prompts, outputs, and analysis
giuliano-t
A structured evaluation pipeline for LLM-generated outputs in financial supervision contexts. Combines PRA-aligned prompts, thread-type detection, and metric-level meta-review to assess relevance, justification, and actionability across 50+ regulatory and conversational metrics.
duarajper4
A simple Hugging Face Space that demonstrates prompt engineering basics using a user-friendly interface. Built to test and explore prompt design strategies with a generative AI model.My projects focus on:✅Foundational LLMs, Prompt Engineering, NLP, Text Generation, Structured Output Evaluation, and Chain-of-Thought Reasoning
geegorbee
A lightweight Python tool for testing and evaluating LLM responses. Includes prompt engineering samples, manual scoring criteria, and output auditing structure for tone, accuracy, and ethical safety. Built for aspiring AI content reviewers and prompt engineers.
hazyy00
A pipeline for generating LLM training datasets by combining semantic search with GPT-3.5/GPT-4. Given a set of questions and source documents, it retrieves relevant paragraphs, generates answers, evaluates answer quality, and optionally produces multi-turn conversations — outputting structured JSON datasets ready for fine-tuning.
syed-waleed-ahmed
A Streamlit web app that uses a Groq-powered LLM (Llama 3) to act as an impartial judge for evaluating and comparing two model outputs. Supports custom criteria, presets like creativity and brand tone, and returns structured scores, explanations, and a winner. Built end-to-end with Python, Groq API, and Streamlit.
Genie-Experiments
A comprehensive benchmark suite for evaluating language models' ability to generate and validate structured outputs (e.g., JSON) using Pydantic, instructor, and outlines library.
mhardik003
Evaluation of structured outputs, focusing primarily on JSON, and proposes a set of techniques and metrics that better capture the fidelity, accuracy, and usefulness of generated structured data.
gnandeep-2002
Comparative evaluation of LLM APIs (Gemini vs DeepSeek) in a deterministic 3D maze environment. Measures structured JSON compliance, navigation accuracy, and execution reliability under zero-retry constraints. Demonstrates LLM formatting discipline, spatial reasoning, and real-world action consistency.
dvmukul
A structured evaluation framework for LLM outputs.
unzer-0x
Practical LLM evaluation toolkit for prompt testing, scoring, and structured AI output reviews.
solaicoffee
Production-style Python engine for structured evaluation, scoring, and benchmarking of LLM outputs.
themattnash
Evaluation framework for LLM-powered features: structured testing, output scoring, and drift detection.
VISHWAS-dto
LLM-powered essay evaluation system using LangGraph with structured outputs and multi-node analysis
SaipriyaBudde
Lightweight API for evaluating RAG-based LLM outputs with structured scoring & feedback
Structured workflow for evaluating LLM outputs using the Arize Phoenix Evals framework
BhaveshMakhija
Fine-tuning LLMs with LoRA & DPO for structured JSON output, evaluation metrics, and React-based visualization.
BirsenYY
A LangGraph-based agentic workflow demonstrating structured LLM outputs, controlled state transitions, and iterative content evaluation.
nateroehrig
Applied clinical prompt engineering for developmental and school psychology workflows, including structured evaluation of LLM outputs.