MaxText Performance Analysis

Trust Score 81/100

A post-training analysis workflow that uses TGS tagging, TraceLens, and IRLens to diagnose model training performance, GPU utilization, and kernel-level hotspot

triggers:performance analysisTraceLensIRLenstgs_taggerxplanetraining job analysis

GitHub SKILL.md

What it does

This skill provides a step-by-step performance analysis pipeline for MaxText training runs on Slurm-managed clusters. The dispatcher examines logs and artifacts, runs appropriate tools (tgs_tagger, TraceLens, IRLens), and produces a structured analysis.json and CSV artifacts. The assistant uses these outputs to summarize throughput (TGS/MFU), GPU compute/comm/idle breakdowns, kernel-level time composition, and suggested next steps.

When to use it

Use when investigating slow training runs, unexplained throughput regressions, per-kernel hotspots, or when comparing two jobs to find system-level or GPU-side causes. It supports both single-job deep dives and multi-job comparison workflows (the SKILL.md links to complementary skills for multi-job TSDB comparisons and profile-drill per-kernel verification).

What's included

Scripts: dispatcher utilities (utils/analyze_job.py) and helper scripts to extract key metrics from analysis.json and TraceLens CSVs
References: patch notes and operational guidance for TraceLens; instructions for installing/patching TraceLens when needed
Instructions: runbook for using dispatcher, handling TraceLens install/patch, reading analysis.json, and interpreting CSV outputs (gpu_events_averages.csv, kernel breakdowns)

Compatible agents

Best for agents that can run shell/python tooling and inspect generated JSON/CSV outputs. Targets ML infra engineers and performance investigators working with TraceLens/IRLens artifacts.

Audit Summary

MaxText/Slurm performance analysis skill for diagnosing ML training job performance using TGS tagging, TraceLens, and IRLens. No bundled scripts — the SKILL.md instructs the agent to run utilities from a separate repo. Well-structured 5-step workflow with good cross-references to sibling skills and honest warnings about TraceLens per-GPU bias. Very niche audience requiring AMD GPU hardware and specific Slurm cluster setup.

Watch Out

Requires AMD GPU cluster with Slurm and MaxText framework
TraceLens needs manual patching for TF 2.19+ compatibility
kernel_launchers_summary CSV has ~1.5-2x inflation bug on 1-node profiles
pip install from GitHub without checksum verification

Missing Dependencies

TraceLenstgs_taggerxplane protobuf filesAMD GPU cluster with Slurm

Notes

Domain-specific infra skill with no scripts bundled. Security is solid — no dangerous patterns, explicitly warns against binding dashboard to 0.0.0.0. The skill is well-documented but only useful to a narrow audience running AMD MaxText on Slurm clusters.

Information

Repository: maxtext-slurm
Stars: 27

Trust Score

Overall81

Security88

Code Quality72

Architecture78

Usefulness42

Related Skills

OpenTestAI

Automated, high-confidence AI testing: bug detection, persona feedback, and prioritized test-case generation using many specialized tester profiles.

Go Data Structures

Authoritative guidance on choosing and using Go built-in and standard-library data structures, with practical best practices for slices, maps, arrays, container

React Development Expert

Provides authoritative React guidance on hooks, state patterns, Server Components, performance optimization, and common architectural patterns.

Code Reviewer

Perform structured, prioritized code reviews that find correctness, security, performance, reliability, and testing issues and provide concrete fix suggestions.

Datadog Live Debugger (dd-debugger)

Place live log probes on running services to capture runtime arguments and variables without redeploying (Datadog Live Debugger).

dotLottie Web

Guidelines and patterns for implementing performant dotLottie/Lottie animations on the web (vanilla JS and React), including workers, state machines, and themin

Party Engine Skill

Guidance and examples for using the @cazala/party particle engine (engine lifecycle, modules, WebGPU vs CPU patterns) in custom apps.

Party Skill

Programmatic guide for the @cazala/party library: engine setup, modules, particle APIs, and performance tips for WebGPU and CPU runtimes.

Back to Skills