
from maxtext-slurm27
A post-training analysis workflow that uses TGS tagging, TraceLens, and IRLens to diagnose model training performance, GPU utilization, and kernel-level hotspot
This skill provides a step-by-step performance analysis pipeline for MaxText training runs on Slurm-managed clusters. The dispatcher examines logs and artifacts, runs appropriate tools (tgs_tagger, TraceLens, IRLens), and produces a structured analysis.json and CSV artifacts. The assistant uses these outputs to summarize throughput (TGS/MFU), GPU compute/comm/idle breakdowns, kernel-level time composition, and suggested next steps.
Use when investigating slow training runs, unexplained throughput regressions, per-kernel hotspots, or when comparing two jobs to find system-level or GPU-side causes. It supports both single-job deep dives and multi-job comparison workflows (the SKILL.md links to complementary skills for multi-job TSDB comparisons and profile-drill per-kernel verification).
Best for agents that can run shell/python tooling and inspect generated JSON/CSV outputs. Targets ML infra engineers and performance investigators working with TraceLens/IRLens artifacts.
MaxText/Slurm performance analysis skill for diagnosing ML training job performance using TGS tagging, TraceLens, and IRLens. No bundled scripts — the SKILL.md instructs the agent to run utilities from a separate repo. Well-structured 5-step workflow with good cross-references to sibling skills and honest warnings about TraceLens per-GPU bias. Very niche audience requiring AMD GPU hardware and specific Slurm cluster setup.
TraceLenstgs_taggerxplane protobuf filesAMD GPU cluster with SlurmDomain-specific infra skill with no scripts bundled. Security is solid — no dangerous patterns, explicitly warns against binding dashboard to 0.0.0.0. The skill is well-documented but only useful to a narrow audience running AMD MaxText on Slurm clusters.