Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 81 additions & 0 deletions .claude/skills/hta-cli/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
---
name: pytorch-profile
description: >-
hollistic trace analysis (hta) gives insight about distributed training with pytorch.
It should be used when the user asks to "analyse pytorch trace",
or mentions any subcommand like temporal-breakdown, comm-comp-overlap,
gpu-kernel-breakdown, idle-time-breakdown, critical-path, queue-length, etc.
---

# Pytorch Profile Data

The hta CLI (`python -m hta`) exposes every major trace analysis as a standalone subcommand. It is designed for CI pipelines, shell scripts, and quick interactive analysis without notebooks.

## Two-Step Workflow

All CLI usage follows a **pre-process then analyze** pattern:

```bash
# Step 1: Parse raw PyTorch Profiler traces into parquet
python -m hta pre-process --trace-dir ./raw_traces -o ./preprocessed

# Step 2: Run any analysis subcommand on the preprocessed directory
python -m hta temporal-breakdown -i ./preprocessed
python -m hta idle-time-breakdown -i ./preprocessed --ranks 0,1
```

**Step 1 (`pre-process`)** reads raw JSON traces from `--trace-dir`, writes one `.parquet` file per rank plus a `metadata.json` into `-o`. This only needs to run once per trace set.

**Step 2 (any analysis subcommand)** reads from the pre-processed directory via `-i` / `--input`. Most subcommands print markdown tables to stdout.

## Subcommand Quick Reference

| Subcommand | Description | Key Args (besides `-i`) | Category |
|---|---|---|---|
| `pre-process` | Parse raw traces to parquet | `--trace-dir`, `-o` (both required) | Preprocessing |
| `temporal-breakdown` | Time breakdown (compute, comm, idle) per rank | — | Overview |
| `comm-comp-overlap` | Communication/computation overlap per rank | — | Overview |
| `profiler-steps` | List profiler step indices | — | Overview |
| `potential-stragglers` | Identify slow ranks | `--num-candidates`, `--profiler-steps` | Overview |
| `gpu-kernel-breakdown` | GPU time by kernel type + top kernels | `--num-kernels`, `--duration-ratio`, `--no-memory-kernels` | GPU Kernels |
| `gpu-kernels-with-annotations` | GPU kernels with user annotation context | `--rank` (required) | GPU Kernels |
| `gpu-user-annotation-breakdown` | GPU/CPU time by user annotations | `--cpu`, `--duration-ratio`, `--num-kernels`, `--allowlist-patterns` | GPU Kernels |
| `frequent-cuda-kernel-sequences` | Frequent CUDA kernel patterns per operator | `--operator-name`, `--output-dir` (both required), `--top-k`, `--rank` | GPU Kernels |
| `aten-op-kernels-and-delay` | ATen op to GPU kernel mapping with launch delay | `--ranks`, `--sort-by` | GPU Kernels |
| `cuda-kernel-launch-stats` | CUDA kernel launch duration and delay stats | `--ranks`, `--runtime-cutoff`, `--launch-delay-cutoff` | GPU Kernels |
| `generate-trace-with-counters` | Augmented trace with queue length / memory BW counters | `--ranks`, `--time-series`, `--output-suffix` | Counters |
| `queue-length-summary` | Queue length summary stats per rank | `--ranks` | Counters |
| `queue-length-time-series` | Full queue length time series per rank | `--ranks` | Counters |
| `blocked-on-full-queue` | Time CPU blocked on full GPU queue | `--ranks`, `--max-queue-length` | Counters |
| `memory-bw-summary` | Memory bandwidth summary stats per rank | `--ranks` | Counters |
| `memory-bw-time-series` | Full memory bandwidth time series per rank | `--ranks` | Counters |
| `idle-time-breakdown` | GPU idle time by category per rank/stream | `--ranks`, `--streams`, `--show-idle-interval-stats` | Idle Time |
| `cupti-counter-data` | CUPTI hardware counter data with operators | `--ranks` | CUPTI |
| `critical-path` | Critical path analysis with trace overlay | `--rank`, `--annotation`, `--instance-id`, `--output-dir` (all required) | Critical Path |

## Common Patterns

**Filtering by rank:** Most analysis subcommands accept `--ranks` as a comma-separated list (e.g., `--ranks 0,1,3`). If omitted, all ranks are analyzed.

**Output format:** Most subcommands print markdown tables to stdout. Pipe to a file or use in scripts:
```bash
python -m hta temporal-breakdown -i ./preprocessed > results.md
```

**Getting help:** Run `python -m hta --help` for all subcommands, or `python -m hta <subcommand> --help` for a specific one.

**Running via uv:** In this project, prefix with `uv run`:
```bash
uv run python -m hta pre-process --trace-dir ./traces -o ./preprocessed
uv run python -m hta temporal-breakdown -i ./preprocessed
```

## Key Source Files

- `hta/__main__.py` — CLI implementation (argument parsing and subcommand handlers)
- `hta/trace_analysis.py` — `TraceAnalysis` class that backs every subcommand
- `docs/cli-guide.md` — Human-facing CLI documentation

## Additional Resources

For full argument tables, types, defaults, and detailed output descriptions for every subcommand, see `references/subcommands.md`.
Loading