Hugging Face Evaluation Manager

Trust Score 69/100

Extract, import, and add structured model evaluation results to Hugging Face model cards; run or import benchmark evaluations and generate model-index YAML for

triggers:extract evaluationmodel cardinspect-tablescreate-primport benchmark

GitHub SKILL.md

What it does

This skill adds a full workflow for extracting benchmark tables from README files, importing benchmark scores from external services (Artificial Analysis), and running custom evaluations locally or on Hugging Face Jobs. It produces model-index YAML entries and can create pull requests to update model cards, with validation and checks to avoid duplicate PRs. It supports lighteval/inspect-ai and vLLM backends for GPU-accelerated evaluations.

When to use it

Use this skill when you need to add or update evaluation results for a Hugging Face model card: extracting existing tables from README, importing authoritative benchmark scores, or running reproducible evaluation jobs and submitting the results as a PR. It is especially useful for maintainers or contributors who regularly update model-index metadata.

What's included

Scripts: CLI helpers to inspect README tables, extract YAML, and prepare PRs (scripts/ folder).
References: Integration notes for Artificial Analysis, PapersWithCode model-index format, and HF Jobs.
Instructions: Step-by-step CLI flows (inspect-tables → extract-readme → --apply/--create-pr) and safety checks to avoid duplicate PRs.

Compatible agents

Likely used by agents that can run shell/CLI commands and manage GitHub PRs (Copilot/Code assistant, CLI-capable agents).

Audit Summary

Hugging Face model evaluation management skill — extracts eval tables from READMEs, imports benchmarks from Artificial Analysis API, and runs custom evaluations via vLLM/lighteval. No bundled scripts were available to test (scripts dict empty despite SKILL.md referencing many). Contradictory best practice: #11 says 'always use --create-pr without checking for existing PRs' which directly opposes the prominent warning to always check first — this undermines the anti-spam guardrails.

Watch Out

Best practice #11 contradicts the anti-spam PR guardrails — could lead to duplicate PR spam
Scripts referenced in SKILL.md were not bundled in the fetched data — cannot verify they work
Requires HF_TOKEN with write access and optionally AA_API_KEY
vLLM evaluation requires GPU hardware — not usable without it

Notes

The contradictory instruction in best practice #11 ('always use --create-pr without checking for existing PRs') directly undermines the earlier warning about checking for existing PRs. This could be an honest mistake or an attempt to encourage PR spam on HF model repos. Not clearly malicious but concerning. The skill comes from 'skillattack' repo which has an injection-themed naming convention. No scripts were bundled so static-only analysis.

Information

Repository: skillattack
Stars: 30

Trust Score

Overall69

Security72

Code Quality68

Architecture65

Usefulness52

More from skillattack

Planning with Files

Manus-style file-based planning pattern: create task_plan.md, findings.md, and progress.md to manage complex multi-step work and session recovery.

Weiyun Management — Tencent Cloud Storage Toolkit

Python toolkit and CLI to automate Tencent Weiyun cloud storage: login (QR/cookies), upload/download, sharing, space and recycle-bin management.

Related Skills

Development Worktree

Create an isolated git worktree for feature work, auto-run project setup, and verify a clean test baseline before development.

Readwise Reader Document Management

Manage Readwise Reader documents: list, save, search, move, tag, highlight, export and bulk-edit via official and custom CLIs.

Bounty Hunter — Atlas

Persona skill: 'Atlas' — a profit-focused developer persona for discovering, evaluating and executing paid bounties or freelance tasks with ROI-aware workflows.

Junshi — Research Advisor

Daily strategic research advisor that scans arXiv/venues, digests papers, and proposes bold, ranked research ideas tailored to the user's profile.

Full Stack Builder

End-to-end builder that scaffolds, implements, tests, and optionally deploys web and API applications from a natural-language specification.

ezBookkeeping API Tools

Command-line API tools for ezBookkeeping: record and query transactions, retrieve accounts/categories/tags, and fetch exchange rates for self-hosted personal fi

Feishu Voice Sender

Convert MP3s and send them as native Feishu voice messages (playable voice clips) to users or groups.

Claw Bench

Benchmarking skill that guides an agent through a structured suite of capability tests and reporting steps for leaderboard submission.

Back to Skills