
de maxtext-slurm27
A post-training analysis workflow that uses TGS tagging, TraceLens, and IRLens to diagnose model training performance, GPU utilization, and kernel-level hotspot
This skill provides a step-by-step performance analysis pipeline for MaxText training runs on Slurm-managed clusters. The dispatcher examines logs and artifacts, runs appropriate tools (tgs_tagger, TraceLens, IRLens), and produces a structured analysis.json and CSV artifacts. The assistant uses these outputs to summarize throughput (TGS/MFU), GPU compute/comm/idle breakdowns, kernel-level time composition, and suggested next steps.
Use when investigating slow training runs, unexplained throughput regressions, per-kernel hotspots, or when comparing two jobs to find system-level or GPU-side causes. It supports both single-job deep dives and multi-job comparison workflows (the SKILL.md links to complementary skills for multi-job TSDB comparisons and profile-drill per-kernel verification).
Best for agents that can run shell/python tooling and inspect generated JSON/CSV outputs. Targets ML infra engineers and performance investigators working with TraceLens/IRLens artifacts.
Cette compétence n'a pas encore été examinée par notre pipeline d'audit automatisé.