
from codescalebench14
Launch, manage, and rerun CodeScaleBench benchmark suites with safety guardrails, paired baseline+full execution, and orchestration utilities.
This skill provides end-to-end orchestration for running CodeScaleBench benchmark suites. It outlines canonical commands, launch patterns, and mandatory approval gates before execution. It supports paired baseline+full runs, quick reruns of failures, gap-fill runs, and multi-account parallel execution with tooling to monitor and manage runs.
Use when you need to run benchmarking suites, relaunch failed tasks, or perform curated paired experiments (baseline + sourcegraph_full). It should be used only after confirming key parameters: model, suite or selection file, execution config (paired/baseline-only/full-only), parallelism slots, and category (staging/official).
Best for agents that can execute shell commands and manage long-running jobs (agents with Bash/Run capabilities). It is an operational skill intended for human-in-the-loop execution — the skill explicitly requires user confirmation before any destructive or expensive actions.
CodeScaleBench run skill for launching and managing benchmark suites with paired baseline+MCP execution guardrails. No bundled scripts to test. SKILL.md is well-structured with a mandatory approval gate before execution, clear canonical command patterns, and run policy enforcement. Niche tooling specific to Sourcegraph's CodeScaleBench framework.
Clean skill with no security concerns. Mandatory approval gate before running is a good safety pattern. No scripts included — purely instructional SKILL.md. Niche audience limited to teams running CodeScaleBench benchmarks.