
from messydata31
Generate realistic synthetic datasets with configurable anomalies (missing values, duplicates, invalid categories, bad dates, outliers) using YAML configs, CLI,
MessyData produces realistic synthetic DataFrames from a declarative YAML config or Python schema. It injects controlled anomalies (missing values, duplicates, invalid categories, date errors, outliers) to emulate messy real-world datasets for testing, QA, and ML robustness checks.
Trigger this skill when you need synthetic dirty data for testing pipelines, validating data-cleaning code, creating edge-case samples for model training or QA, or scheduling daily backfills of generated datasets.
Useful for agents that can run CLI commands or Python (Bash/uv runner, Python-capable agents like Copilot/Cursor/Claude Code).
MessyData is a synthetic dirty data generator using YAML configs with CLI and Python API. The SKILL.md is comprehensive and well-structured, covering config authoring, validation, generation, and date-aware modes. No bundled scripts to test. Uses uv for isolated execution which is a good security practice. Allowed-tools scope is reasonable but Bash(uv:*) is somewhat broad.
Clean, well-documented reference skill. No security concerns. Good use of uv for sandboxed execution. Comprehensive YAML reference with realistic examples. Could benefit from tighter Bash scoping in allowed-tools.