Search Results

Found 21 repositories(showing 21)

llm-structured-output-benchmarks

stephenleo

🧡65

Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, etc on tasks like multi-label classification, named entity recognition, synthetic data generation, etc.

186

Apache-2.0

Python

Updated 6 days ago

structured-outputs

thedataquarry

💛70

Structured output benchmarks comparing DSPy and BAML with different LLMs

MIT

Python

Updated 1 day ago

bamldspyinformation-extraction+5

synthetic_llm-main

PrethigahShanmugarajah

❤️45

A Python-based tool that uses LLMs like LLaMA2 or GPT to generate synthetic test data from JSON schemas. Automates test data creation for QA, benchmarking, and ML training with validated, structured outputs.

Python

Updated 2 months ago

llm-structured-output-benchmark

bendechrai

❤️35

Benchmark tool for testing LLM structured JSON response adherence across providers (OpenAI, Anthropic, Google, Groq, OpenRouter). Tests one-shot vs sequential prompting and strict vs non-strict modes with retry handling.

TypeScript

Updated 4 months ago

llm-structured-output-benchmark

codeboratory

❤️30

No description available

MIT

Updated 1 year ago

llm-structured-output-benchmark

x0rium

❤️40

LLM Structured Output Benchmark: JSON extraction from text via OpenRouter with Instructor + Zod

MIT

TypeScript

Updated 7 months ago

llmopenrouteropenrouter-api

llm-evaluation-engine

solaicoffee

❤️45

Production-style Python engine for structured evaluation, scoring, and benchmarking of LLM outputs.

Python

Updated 1 month ago

struct-output-bench

Bae-ChangHyun

🧡60

Benchmark tool for comparing LLM structured output frameworks (Instructor, LangChain, Marvin, PydanticAI, Mirascope, Guardrails)

MIT

Python

Updated 3 weeks ago

polyglot-llm-benchmark

samidala

❤️45

A production-ready system to benchmark local LLM inference performance with structured JSON output.

Python

Updated 1 month ago

fastapigolangllama+1

offline-llm-benchmark

divyathakran

🧡60

A framework for benchmarking and evaluating locally running LLMs across latency, throughput, and structured output reliability.

MIT

Python

Updated 3 weeks ago

model-arena

Vaibhavi-Sita

❤️40

AI benchmarking platform to evaluate and compare multiple LLM outputs using structured pipelines, automated scoring, and human rating workflows.

TypeScript

Updated 1 month ago

artificial-intelligencehuman-in-the-looplarge-language-models+1

LLM-Prompt-Engineering-Evaluation-Toolkit

sharathStack

🧡60

Benchmarks 5 prompt strategies (zero-shot, CoT, few-shot, role-based, structured output) against a weighted rubric. Produces JSONL annotations for LLM training. Python · NLP

Python

Updated 1 day ago

aievalsllmnlp+3

local-slm-benchmark

abraromar002

🧡55

Benchmarking and comparing 3 local LLMs (Llama 3.2, Phi-4 Mini, Mistral 7B) using Ollama — inference speed, structured output validation, temperature variance analysis · FastAPI · Pydantic

Python

Updated 3 weeks ago

ToonBenchmarking-Structural-Environmental

PRAISELab-PicusLab

❤️45

A benchmark for evaluating the structural correctness and environmental efficiency of structured output formats in large language models (LLMs), focusing on token usage, generation latency, and carbon emissions.

NOASSERTION

Python

Updated 1 month ago

AI_Eval

TJ-Neary

🧡50

Comprehensive LLM evaluation framework comparing local and cloud models with hardware-aware benchmarking. Evaluate across code generation, document analysis, and structured output using pass@k, LLM-as-Judge, and RAG metrics. Supports Ollama, Google Gemini, Anthropic, and OpenAI.

MIT

Python

Updated 1 month ago

apple-siliconbenchmarkingcode-generation+9

Shannon

lobuem

🧡50

Shannon is an open‑source AI platform for large language model interaction and evaluation. It provides tools for querying, benchmarking, and comparing LLM outputs, enabling developers and researchers to test models, analyze performance, and build applications with structured LLM workflows.

MIT

Updated 1 month ago

Auto-Eval-Pipeline

srdesai1

🧡65

A production-ready LLM-as-a-Judge evaluation framework. Automates AI benchmarking using Gemini 2.5 Flash Lite, structured Pydantic outputs, and resilient rate-limit handling. Built for high-velocity, zero-cost AI quality assurance.

Python

Updated 4 days ago

TRX-AI

thulasiramk-2310

💛70

TRX-AI is a CLI-based AI reasoning and code-review assistant that uses hybrid intent detection (rules + local LLM), multi-agent analysis (Debug/Improve/Predict), and structured outputs to quickly diagnose issues, suggest fixes, and benchmark response quality.

MIT

Python

Updated 6 days ago

llm-json-benchmark

jafetsinke

❤️35

A benchmarking project to compare the structured JSON generation capabilities of LLM libraries like dottxt/outlines and guidance-ai/guidance. The test scenario converts unstructured resume text into a well-defined JSON format, evaluating output quality, consistency, and adherence to schema validation.

Python

Updated 1 year ago

Benchmarking-LLMs-QA-SRL

ritwikraghav14

❤️35

Code for the paper "Are LLMS Good for Semantic Role Labeling via Question Answering?: A Preliminary Analysis" (IJCNLP-AACL SRW). Benchmarks Llama, Mistral, Qwen, OpenChat, and Gemini on QA-SRL 2.0 using zero-shot and three-shot prompting, focusing on structured output precision.

Updated 4 months ago

Local-AI-Assistant-Ollama-FastAPI

shiva17

🧡55

An end-to-end local AI assistant running open-source LLMs via Ollama with a FastAPI interface. Benchmarks multiple models (Llama3, Mistral, Phi) using metrics like latency, tokens/sec, and time-to-first-token. Includes Pydantic-validated structured outputs, retry logic, and a model evaluation framework.

Python

Updated 2 weeks ago

All 21 repositories loaded

GitHub Explorer

Search Results

llm-structured-output-benchmarks

structured-outputs

synthetic_llm-main

llm-structured-output-benchmark

llm-structured-output-benchmark

llm-structured-output-benchmark

llm-evaluation-engine

struct-output-bench

polyglot-llm-benchmark

offline-llm-benchmark

model-arena

LLM-Prompt-Engineering-Evaluation-Toolkit

local-slm-benchmark

ToonBenchmarking-Structural-Environmental

AI_Eval

Shannon

Auto-Eval-Pipeline

TRX-AI

llm-json-benchmark

Benchmarking-LLMs-QA-SRL

Local-AI-Assistant-Ollama-FastAPI

llm-structured-output-benchmarks

structured-outputs

synthetic_llm-main

llm-structured-output-benchmark

llm-structured-output-benchmark

llm-structured-output-benchmark

llm-evaluation-engine

struct-output-bench

polyglot-llm-benchmark

offline-llm-benchmark

model-arena

LLM-Prompt-Engineering-Evaluation-Toolkit

local-slm-benchmark

ToonBenchmarking-Structural-Environmental

AI_Eval

Shannon

Auto-Eval-Pipeline

TRX-AI

llm-json-benchmark

Benchmarking-LLMs-QA-SRL

Local-AI-Assistant-Ollama-FastAPI