Run Benchmarks

Name: Run Benchmarks
Rating: 87 (1 reviews)
Author: sourcegraph

Trust Score 87/100

Launch, manage, and rerun CodeScaleBench benchmark suites with safety guardrails, paired baseline+full execution, and orchestration utilities.

triggers:run benchmarkslaunch benchmark suitererun failurespaired runquick rerun

GitHub SKILL.md

What it does

This skill provides end-to-end orchestration for running CodeScaleBench benchmark suites. It outlines canonical commands, launch patterns, and mandatory approval gates before execution. It supports paired baseline+full runs, quick reruns of failures, gap-fill runs, and multi-account parallel execution with tooling to monitor and manage runs.

When to use it

Use when you need to run benchmarking suites, relaunch failed tasks, or perform curated paired experiments (baseline + sourcegraph_full). It should be used only after confirming key parameters: model, suite or selection file, execution config (paired/baseline-only/full-only), parallelism slots, and category (staging/official).

What's included

Scripts: repository contains orchestration and runner scripts (configs/harnesses/*, scripts/running/control_plane.py, monitor scripts) though not bundled in the SKILL fetch (has_scripts=false in this scan).
References: documentation references such as docs/DAYTONA.md and control plane scripts are mentioned in the skill body.
Instructions: the SKILL.md defines policies (mandatory approval), canonical commands for paired runs and reruns, and infrastructure checks (account health) to run before launching.

Compatible agents

Best for agents that can execute shell commands and manage long-running jobs (agents with Bash/Run capabilities). It is an operational skill intended for human-in-the-loop execution — the skill explicitly requires user confirmation before any destructive or expensive actions.

Audit Summary

CodeScaleBench run skill for launching and managing benchmark suites with paired baseline+MCP execution guardrails. No bundled scripts to test. SKILL.md is well-structured with a mandatory approval gate before execution, clear canonical command patterns, and run policy enforcement. Niche tooling specific to Sourcegraph's CodeScaleBench framework.

Watch Out

Tied to Sourcegraph's CodeScaleBench ecosystem — requires that repo cloned locally
References external scripts (./configs/harnesses/) that aren't bundled with the skill

Notes

Clean skill with no security concerns. Mandatory approval gate before running is a good safety pattern. No scripts included — purely instructional SKILL.md. Niche audience limited to teams running CodeScaleBench benchmarks.

Information

Repository: codescalebench
Stars: 14

Trust Score

Overall87

Security95

Code Quality78

Architecture80

Usefulness35

Related Skills

Development Worktree

Create an isolated git worktree for feature work, auto-run project setup, and verify a clean test baseline before development.

ds-fix — data-science mid-analysis fixer

Orchestrates diagnosis and targeted fixes mid-analysis: diagnose root cause, apply fixes with output-first verification, and update project learnings.

Readwise Reader Document Management

Manage Readwise Reader documents: list, save, search, move, tag, highlight, export and bulk-edit via official and custom CLIs.

Bounty Hunter — Atlas

Persona skill: 'Atlas' — a profit-focused developer persona for discovering, evaluating and executing paid bounties or freelance tasks with ROI-aware workflows.

Junshi — Research Advisor

Daily strategic research advisor that scans arXiv/venues, digests papers, and proposes bold, ranked research ideas tailored to the user's profile.

Full Stack Builder

End-to-end builder that scaffolds, implements, tests, and optionally deploys web and API applications from a natural-language specification.

ezBookkeeping API Tools

Command-line API tools for ezBookkeeping: record and query transactions, retrieve accounts/categories/tags, and fetch exchange rates for self-hosted personal fi

Feishu Voice Sender

Convert MP3s and send them as native Feishu voice messages (playable voice clips) to users or groups.

Back to Skills