
from openjudge639
Tools and patterns to build automated evaluation pipelines for LLMs: graders, runners, aggregators, and analysis utilities for comparing model outputs and scori
OpenJudge provides an architecture and tooling to create repeatable evaluation pipelines for LLM applications. It standardises graders (LLM-based or function-based), orchestrates batch runs with a GradingRunner, aggregates scores, and provides analyzers for win‑rates and validation metrics.
Use this skill when you need to compare model outputs, validate responses against references, auto‑generate graders from labeled data, or produce statistics for model selection and reward modelling. Ideal for research, alignment evaluations, and production QA.
Best for developer-facing agents and tooling that can run Python code and call model APIs (OpenAI‑compatible endpoints, local LLM endpoints). Works with agents that support programmatic evaluation workflows.
This skill has not been reviewed by our automated audit pipeline yet.