MiroMindAI · gali-leilei · Mar 11, 2026 · Mar 11, 2026 · Mar 11, 2026
diff --git a/.claude/skills/hta-cli/SKILL.md b/.claude/skills/hta-cli/SKILL.md
@@ -0,0 +1,81 @@
+---
+name: pytorch-profile
+description: >-
+hollistic trace analysis (hta) gives insight about distributed training with pytorch.
+It should be used when the user asks to "analyse pytorch trace",
+or mentions any subcommand like temporal-breakdown, comm-comp-overlap,
+gpu-kernel-breakdown, idle-time-breakdown, critical-path, queue-length, etc.
+---
+
+# Pytorch Profile Data
+
+The hta CLI (`python -m hta`) exposes every major trace analysis as a standalone subcommand. It is designed for CI pipelines, shell scripts, and quick interactive analysis without notebooks.
+
+## Two-Step Workflow
+
+All CLI usage follows a **pre-process then analyze** pattern:
+
+```bash
+# Step 1: Parse raw PyTorch Profiler traces into parquet
+python -m hta pre-process --trace-dir ./raw_traces -o ./preprocessed
+
+# Step 2: Run any analysis subcommand on the preprocessed directory
+python -m hta temporal-breakdown -i ./preprocessed
+python -m hta idle-time-breakdown -i ./preprocessed --ranks 0,1
+```
+
+**Step 1 (`pre-process`)** reads raw JSON traces from `--trace-dir`, writes one `.parquet` file per rank plus a `metadata.json` into `-o`. This only needs to run once per trace set.
+
+**Step 2 (any analysis subcommand)** reads from the pre-processed directory via `-i` / `--input`. Most subcommands print markdown tables to stdout.
+
+## Subcommand Quick Reference
+
+| Subcommand | Description | Key Args (besides `-i`) | Category |
+|---|---|---|---|
+| `pre-process` | Parse raw traces to parquet | `--trace-dir`, `-o` (both required) | Preprocessing |
+| `temporal-breakdown` | Time breakdown (compute, comm, idle) per rank | — | Overview |
+| `comm-comp-overlap` | Communication/computation overlap per rank | — | Overview |
+| `profiler-steps` | List profiler step indices | — | Overview |
+| `potential-stragglers` | Identify slow ranks | `--num-candidates`, `--profiler-steps` | Overview |
+| `gpu-kernel-breakdown` | GPU time by kernel type + top kernels | `--num-kernels`, `--duration-ratio`, `--no-memory-kernels` | GPU Kernels |
+| `gpu-kernels-with-annotations` | GPU kernels with user annotation context | `--rank` (required) | GPU Kernels |
+| `gpu-user-annotation-breakdown` | GPU/CPU time by user annotations | `--cpu`, `--duration-ratio`, `--num-kernels`, `--allowlist-patterns` | GPU Kernels |
+| `frequent-cuda-kernel-sequences` | Frequent CUDA kernel patterns per operator | `--operator-name`, `--output-dir` (both required), `--top-k`, `--rank` | GPU Kernels |
+| `aten-op-kernels-and-delay` | ATen op to GPU kernel mapping with launch delay | `--ranks`, `--sort-by` | GPU Kernels |
+| `cuda-kernel-launch-stats` | CUDA kernel launch duration and delay stats | `--ranks`, `--runtime-cutoff`, `--launch-delay-cutoff` | GPU Kernels |
+| `generate-trace-with-counters` | Augmented trace with queue length / memory BW counters | `--ranks`, `--time-series`, `--output-suffix` | Counters |
+| `queue-length-summary` | Queue length summary stats per rank | `--ranks` | Counters |
+| `queue-length-time-series` | Full queue length time series per rank | `--ranks` | Counters |
+| `blocked-on-full-queue` | Time CPU blocked on full GPU queue | `--ranks`, `--max-queue-length` | Counters |
+| `memory-bw-summary` | Memory bandwidth summary stats per rank | `--ranks` | Counters |
+| `memory-bw-time-series` | Full memory bandwidth time series per rank | `--ranks` | Counters |
+| `idle-time-breakdown` | GPU idle time by category per rank/stream | `--ranks`, `--streams`, `--show-idle-interval-stats` | Idle Time |
+| `cupti-counter-data` | CUPTI hardware counter data with operators | `--ranks` | CUPTI |
+| `critical-path` | Critical path analysis with trace overlay | `--rank`, `--annotation`, `--instance-id`, `--output-dir` (all required) | Critical Path |
+
+## Common Patterns
+
+**Filtering by rank:** Most analysis subcommands accept `--ranks` as a comma-separated list (e.g., `--ranks 0,1,3`). If omitted, all ranks are analyzed.
+
+**Output format:** Most subcommands print markdown tables to stdout. Pipe to a file or use in scripts:
+```bash
+python -m hta temporal-breakdown -i ./preprocessed > results.md
+```
+
+**Getting help:** Run `python -m hta --help` for all subcommands, or `python -m hta <subcommand> --help` for a specific one.
+
+**Running via uv:** In this project, prefix with `uv run`:
+```bash
+uv run python -m hta pre-process --trace-dir ./traces -o ./preprocessed
+uv run python -m hta temporal-breakdown -i ./preprocessed
+```
+
+## Key Source Files
+
+- `hta/__main__.py` — CLI implementation (argument parsing and subcommand handlers)
+- `hta/trace_analysis.py` — `TraceAnalysis` class that backs every subcommand
+- `docs/cli-guide.md` — Human-facing CLI documentation
+
+## Additional Resources
+
+For full argument tables, types, defaults, and detailed output descriptions for every subcommand, see `references/subcommands.md`.