Found 21 repositories(showing 21)
stephenleo
Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, etc on tasks like multi-label classification, named entity recognition, synthetic data generation, etc.
thedataquarry
Structured output benchmarks comparing DSPy and BAML with different LLMs
PrethigahShanmugarajah
A Python-based tool that uses LLMs like LLaMA2 or GPT to generate synthetic test data from JSON schemas. Automates test data creation for QA, benchmarking, and ML training with validated, structured outputs.
bendechrai
Benchmark tool for testing LLM structured JSON response adherence across providers (OpenAI, Anthropic, Google, Groq, OpenRouter). Tests one-shot vs sequential prompting and strict vs non-strict modes with retry handling.
codeboratory
No description available
LLM Structured Output Benchmark: JSON extraction from text via OpenRouter with Instructor + Zod
solaicoffee
Production-style Python engine for structured evaluation, scoring, and benchmarking of LLM outputs.
Bae-ChangHyun
Benchmark tool for comparing LLM structured output frameworks (Instructor, LangChain, Marvin, PydanticAI, Mirascope, Guardrails)
samidala
A production-ready system to benchmark local LLM inference performance with structured JSON output.
divyathakran
A framework for benchmarking and evaluating locally running LLMs across latency, throughput, and structured output reliability.
Vaibhavi-Sita
AI benchmarking platform to evaluate and compare multiple LLM outputs using structured pipelines, automated scoring, and human rating workflows.
sharathStack
Benchmarks 5 prompt strategies (zero-shot, CoT, few-shot, role-based, structured output) against a weighted rubric. Produces JSONL annotations for LLM training. Python · NLP
abraromar002
Benchmarking and comparing 3 local LLMs (Llama 3.2, Phi-4 Mini, Mistral 7B) using Ollama — inference speed, structured output validation, temperature variance analysis · FastAPI · Pydantic
PRAISELab-PicusLab
A benchmark for evaluating the structural correctness and environmental efficiency of structured output formats in large language models (LLMs), focusing on token usage, generation latency, and carbon emissions.
TJ-Neary
Comprehensive LLM evaluation framework comparing local and cloud models with hardware-aware benchmarking. Evaluate across code generation, document analysis, and structured output using pass@k, LLM-as-Judge, and RAG metrics. Supports Ollama, Google Gemini, Anthropic, and OpenAI.
lobuem
Shannon is an open‑source AI platform for large language model interaction and evaluation. It provides tools for querying, benchmarking, and comparing LLM outputs, enabling developers and researchers to test models, analyze performance, and build applications with structured LLM workflows.
srdesai1
A production-ready LLM-as-a-Judge evaluation framework. Automates AI benchmarking using Gemini 2.5 Flash Lite, structured Pydantic outputs, and resilient rate-limit handling. Built for high-velocity, zero-cost AI quality assurance.
thulasiramk-2310
TRX-AI is a CLI-based AI reasoning and code-review assistant that uses hybrid intent detection (rules + local LLM), multi-agent analysis (Debug/Improve/Predict), and structured outputs to quickly diagnose issues, suggest fixes, and benchmark response quality.
jafetsinke
A benchmarking project to compare the structured JSON generation capabilities of LLM libraries like dottxt/outlines and guidance-ai/guidance. The test scenario converts unstructured resume text into a well-defined JSON format, evaluating output quality, consistency, and adherence to schema validation.
ritwikraghav14
Code for the paper "Are LLMS Good for Semantic Role Labeling via Question Answering?: A Preliminary Analysis" (IJCNLP-AACL SRW). Benchmarks Llama, Mistral, Qwen, OpenChat, and Gemini on QA-SRL 2.0 using zero-shot and three-shot prompting, focusing on structured output precision.
An end-to-end local AI assistant running open-source LLMs via Ollama with a FastAPI interface. Benchmarks multiple models (Llama3, Mistral, Phi) using metrics like latency, tokens/sec, and time-to-first-token. Includes Pydantic-validated structured outputs, retry logic, and a model evaluation framework.
All 21 repositories loaded