OpenJudge — LLM Evaluation Pipeline

Tools and patterns to build automated evaluation pipelines for LLMs: graders, runners, aggregators, and analysis utilities for comparing model outputs and scori

triggers:evaluate modelgrading runnergraderwin ratesautomated evaluationopenjudgegrading

GitHub SKILL.md

What it does

OpenJudge provides an architecture and tooling to create repeatable evaluation pipelines for LLM applications. It standardises graders (LLM-based or function-based), orchestrates batch runs with a GradingRunner, aggregates scores, and provides analyzers for win‑rates and validation metrics.

When to use it

Use this skill when you need to compare model outputs, validate responses against references, auto‑generate graders from labeled data, or produce statistics for model selection and reward modelling. Ideal for research, alignment evaluations, and production QA.

What's included

Scripts: no top-level scripts listed (has_scripts=false) but companion docs exist in the repo (pipeline.md, graders.md, analyzer.md, generator.md).
References: additional docs in sibling files describe grader configuration and analysis patterns.
Instructions: quickstart example demonstrating model config, grader instantiation, and running GradingRunner over a dataset, plus details on result handling and model configuration.

Compatible agents

Best for developer-facing agents and tooling that can run Python code and call model APIs (OpenAI‑compatible endpoints, local LLM endpoints). Works with agents that support programmatic evaluation workflows.

Not yet audited

This skill has not been reviewed by our automated audit pipeline yet.

Information

Repository: openjudge
Stars: 639

Related Skills

Development Worktree

Create an isolated git worktree for feature work, auto-run project setup, and verify a clean test baseline before development.

WRDS Query & ETL Enforcement

Standards and enforcement guidance for querying WRDS data and running SAS/ETL on the WRDS grid—includes query validation, SGE submission patterns, and performan

Academic Research Search

Search academic literature across multiple sources, deduplicate results, resolve DOIs, and surface trusted papers with concise takeaways.

Readwise Reader Document Management

Manage Readwise Reader documents: list, save, search, move, tag, highlight, export and bulk-edit via official and custom CLIs.

Bounty Hunter — Atlas

Persona skill: 'Atlas' — a profit-focused developer persona for discovering, evaluating and executing paid bounties or freelance tasks with ROI-aware workflows.

Junshi — Research Advisor

Daily strategic research advisor that scans arXiv/venues, digests papers, and proposes bold, ranked research ideas tailored to the user's profile.

Full Stack Builder

End-to-end builder that scaffolds, implements, tests, and optionally deploys web and API applications from a natural-language specification.

ezBookkeeping API Tools

Command-line API tools for ezBookkeeping: record and query transactions, retrieve accounts/categories/tags, and fetch exchange rates for self-hosted personal fi

Back to Skills

OpenJudge — LLM Evaluation Pipeline

from openjudge639

Tools and patterns to build automated evaluation pipelines for LLMs: graders, runners, aggregators, and analysis utilities for comparing model outputs and scori

triggers:evaluate modelgrading runnergraderwin ratesautomated evaluationopenjudgegrading

GitHub SKILL.md

What it does

When to use it

What's included

Scripts: no top-level scripts listed (has_scripts=false) but companion docs exist in the repo (pipeline.md, graders.md, analyzer.md, generator.md).
References: additional docs in sibling files describe grader configuration and analysis patterns.
Instructions: quickstart example demonstrating model config, grader instantiation, and running GradingRunner over a dataset, plus details on result handling and model configuration.

Compatible agents

Not yet audited

This skill has not been reviewed by our automated audit pipeline yet.

Information

Repository: openjudge
Stars: 639

Related Skills

Development Worktree

Create an isolated git worktree for feature work, auto-run project setup, and verify a clean test baseline before development.

WRDS Query & ETL Enforcement

Standards and enforcement guidance for querying WRDS data and running SAS/ETL on the WRDS grid—includes query validation, SGE submission patterns, and performan

Academic Research Search

Search academic literature across multiple sources, deduplicate results, resolve DOIs, and surface trusted papers with concise takeaways.

Readwise Reader Document Management

Manage Readwise Reader documents: list, save, search, move, tag, highlight, export and bulk-edit via official and custom CLIs.

Bounty Hunter — Atlas

Persona skill: 'Atlas' — a profit-focused developer persona for discovering, evaluating and executing paid bounties or freelance tasks with ROI-aware workflows.

Junshi — Research Advisor

Daily strategic research advisor that scans arXiv/venues, digests papers, and proposes bold, ranked research ideas tailored to the user's profile.

Full Stack Builder

End-to-end builder that scaffolds, implements, tests, and optionally deploys web and API applications from a natural-language specification.

ezBookkeeping API Tools

Command-line API tools for ezBookkeeping: record and query transactions, retrieve accounts/categories/tags, and fetch exchange rates for self-hosted personal fi

OpenJudge — LLM Evaluation Pipeline

What it does

When to use it

What's included

Compatible agents

Tags

Not yet audited

Information

Related Skills

OpenJudge — LLM Evaluation Pipeline

What it does

When to use it

What's included

Compatible agents

Tags

Not yet audited

Information

Related Skills