diff --git a/.claude/skills/hta-cli/SKILL.md b/.claude/skills/hta-cli/SKILL.md deleted file mode 100644 index cfed73e..0000000 --- a/.claude/skills/hta-cli/SKILL.md +++ /dev/null @@ -1,81 +0,0 @@ ---- -name: pytorch-profile -description: >- -hollistic trace analysis (hta) gives insight about distributed training with pytorch. -It should be used when the user asks to "analyse pytorch trace", -or mentions any subcommand like temporal-breakdown, comm-comp-overlap, -gpu-kernel-breakdown, idle-time-breakdown, critical-path, queue-length, etc. ---- - -# Pytorch Profile Data - -The hta CLI (`python -m hta`) exposes every major trace analysis as a standalone subcommand. It is designed for CI pipelines, shell scripts, and quick interactive analysis without notebooks. - -## Two-Step Workflow - -All CLI usage follows a **pre-process then analyze** pattern: - -```bash -# Step 1: Parse raw PyTorch Profiler traces into parquet -python -m hta pre-process --trace-dir ./raw_traces -o ./preprocessed - -# Step 2: Run any analysis subcommand on the preprocessed directory -python -m hta temporal-breakdown -i ./preprocessed -python -m hta idle-time-breakdown -i ./preprocessed --ranks 0,1 -``` - -**Step 1 (`pre-process`)** reads raw JSON traces from `--trace-dir`, writes one `.parquet` file per rank plus a `metadata.json` into `-o`. This only needs to run once per trace set. - -**Step 2 (any analysis subcommand)** reads from the pre-processed directory via `-i` / `--input`. Most subcommands print markdown tables to stdout. - -## Subcommand Quick Reference - -| Subcommand | Description | Key Args (besides `-i`) | Category | -|---|---|---|---| -| `pre-process` | Parse raw traces to parquet | `--trace-dir`, `-o` (both required) | Preprocessing | -| `temporal-breakdown` | Time breakdown (compute, comm, idle) per rank | — | Overview | -| `comm-comp-overlap` | Communication/computation overlap per rank | — | Overview | -| `profiler-steps` | List profiler step indices | — | Overview | -| `potential-stragglers` | Identify slow ranks | `--num-candidates`, `--profiler-steps` | Overview | -| `gpu-kernel-breakdown` | GPU time by kernel type + top kernels | `--num-kernels`, `--duration-ratio`, `--no-memory-kernels` | GPU Kernels | -| `gpu-kernels-with-annotations` | GPU kernels with user annotation context | `--rank` (required) | GPU Kernels | -| `gpu-user-annotation-breakdown` | GPU/CPU time by user annotations | `--cpu`, `--duration-ratio`, `--num-kernels`, `--allowlist-patterns` | GPU Kernels | -| `frequent-cuda-kernel-sequences` | Frequent CUDA kernel patterns per operator | `--operator-name`, `--output-dir` (both required), `--top-k`, `--rank` | GPU Kernels | -| `aten-op-kernels-and-delay` | ATen op to GPU kernel mapping with launch delay | `--ranks`, `--sort-by` | GPU Kernels | -| `cuda-kernel-launch-stats` | CUDA kernel launch duration and delay stats | `--ranks`, `--runtime-cutoff`, `--launch-delay-cutoff` | GPU Kernels | -| `generate-trace-with-counters` | Augmented trace with queue length / memory BW counters | `--ranks`, `--time-series`, `--output-suffix` | Counters | -| `queue-length-summary` | Queue length summary stats per rank | `--ranks` | Counters | -| `queue-length-time-series` | Full queue length time series per rank | `--ranks` | Counters | -| `blocked-on-full-queue` | Time CPU blocked on full GPU queue | `--ranks`, `--max-queue-length` | Counters | -| `memory-bw-summary` | Memory bandwidth summary stats per rank | `--ranks` | Counters | -| `memory-bw-time-series` | Full memory bandwidth time series per rank | `--ranks` | Counters | -| `idle-time-breakdown` | GPU idle time by category per rank/stream | `--ranks`, `--streams`, `--show-idle-interval-stats` | Idle Time | -| `cupti-counter-data` | CUPTI hardware counter data with operators | `--ranks` | CUPTI | -| `critical-path` | Critical path analysis with trace overlay | `--rank`, `--annotation`, `--instance-id`, `--output-dir` (all required) | Critical Path | - -## Common Patterns - -**Filtering by rank:** Most analysis subcommands accept `--ranks` as a comma-separated list (e.g., `--ranks 0,1,3`). If omitted, all ranks are analyzed. - -**Output format:** Most subcommands print markdown tables to stdout. Pipe to a file or use in scripts: -```bash -python -m hta temporal-breakdown -i ./preprocessed > results.md -``` - -**Getting help:** Run `python -m hta --help` for all subcommands, or `python -m hta --help` for a specific one. - -**Running via uv:** In this project, prefix with `uv run`: -```bash -uv run python -m hta pre-process --trace-dir ./traces -o ./preprocessed -uv run python -m hta temporal-breakdown -i ./preprocessed -``` - -## Key Source Files - -- `hta/__main__.py` — CLI implementation (argument parsing and subcommand handlers) -- `hta/trace_analysis.py` — `TraceAnalysis` class that backs every subcommand -- `docs/cli-guide.md` — Human-facing CLI documentation - -## Additional Resources - -For full argument tables, types, defaults, and detailed output descriptions for every subcommand, see `references/subcommands.md`. diff --git a/.claude/skills/hta-cli/references/subcommands.md b/.claude/skills/hta-cli/references/subcommands.md deleted file mode 100644 index a8b9861..0000000 --- a/.claude/skills/hta-cli/references/subcommands.md +++ /dev/null @@ -1,528 +0,0 @@ -# HTA CLI Subcommand Reference - -Full argument tables, examples, and output descriptions for all 20 HTA CLI subcommands. - -Source of truth: `hta/__main__.py` (argument definitions), `docs/cli-guide.md` (human-facing docs). - ---- - -## Preprocessing - -### `pre-process` - -Parse raw PyTorch Profiler traces and save as parquet for fast repeated analysis. - -```bash -python -m hta pre-process --trace-dir -o [--include-last-profiler-step] -``` - -| Argument | Type | Required | Default | Description | -|---|---|---|---|---| -| `--trace-dir` | str | yes | — | Path to directory containing raw trace JSON files | -| `-o` / `--output` | str | yes | — | Output directory for parquet files and metadata.json | -| `--include-last-profiler-step` | flag | no | false | Include the last profiler step (excluded by default) | - -**Example:** -```bash -python -m hta pre-process --trace-dir ./raw_traces -o ./preprocessed -``` - -**Output:** One `.parquet` file per rank and a `metadata.json` in the output directory. Prints confirmation message. - ---- - -## Overview Analysis - -### `temporal-breakdown` - -Show how time is spent (compute, communication, idle, etc.) for each rank. - -See: `docs/source/features/temporal_breakdown.rst` - -```bash -python -m hta temporal-breakdown -i -``` - -| Argument | Type | Required | Default | Description | -|---|---|---|---|---| -| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory | - -**Example:** -```bash -python -m hta temporal-breakdown -i ./preprocessed -``` - -**Output:** Markdown table with one row per rank showing time percentages for each category (idle, compute, communication, etc.). - ---- - -### `comm-comp-overlap` - -Show the overlap between communication and computation for each rank. - -See: `docs/source/features/comm_comp_overlap.rst` - -```bash -python -m hta comm-comp-overlap -i -``` - -| Argument | Type | Required | Default | Description | -|---|---|---|---|---| -| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory | - -**Example:** -```bash -python -m hta comm-comp-overlap -i ./preprocessed -``` - -**Output:** Markdown table with overlap percentages per rank. - ---- - -### `profiler-steps` - -List the profiler step indices found in the trace. - -```bash -python -m hta profiler-steps -i -``` - -| Argument | Type | Required | Default | Description | -|---|---|---|---|---| -| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory | - -**Example:** -```bash -python -m hta profiler-steps -i ./preprocessed -# 2,3,4,5,6 -``` - -**Output:** Comma-separated list of profiler step integers printed to stdout. - ---- - -### `potential-stragglers` - -Identify ranks that are potential stragglers (slower than peers). - -```bash -python -m hta potential-stragglers -i [--num-candidates N] [--profiler-steps STEPS] -``` - -| Argument | Type | Required | Default | Description | -|---|---|---|---|---| -| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory | -| `--num-candidates` | int | no | None | Maximum number of straggler candidates to return | -| `--profiler-steps` | str | no | None | Comma-separated profiler step indices to analyze | - -**Example:** -```bash -python -m hta potential-stragglers -i ./preprocessed --num-candidates 2 -# 3,7 -``` - -**Output:** Comma-separated list of rank IDs that are potential stragglers. - ---- - -## GPU Kernel Analysis - -### `gpu-kernel-breakdown` - -Break down GPU time by kernel type (computation, communication, memory) and list top kernels. - -See: `docs/source/features/kernel_breakdown.rst` - -```bash -python -m hta gpu-kernel-breakdown -i [--duration-ratio R] [--num-kernels N] [--no-memory-kernels] -``` - -| Argument | Type | Required | Default | Description | -|---|---|---|---|---| -| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory | -| `--duration-ratio` | float | no | None | Minimum fraction of total duration for a kernel to be included | -| `--num-kernels` | int | no | None | Maximum number of top kernels to show | -| `--no-memory-kernels` | flag | no | false | Exclude memory-related kernels from the breakdown | - -**Example:** -```bash -python -m hta gpu-kernel-breakdown -i ./preprocessed --num-kernels 10 -``` - -**Output:** Two markdown tables: -1. **Kernel Type Breakdown** — time per kernel category (compute, communication, memory) -2. **Top Kernels** — individual kernel durations and counts - ---- - -### `gpu-kernels-with-annotations` - -List GPU kernels annotated with their user-defined annotation context (e.g., forward/backward/optimizer). - -See: `docs/source/features/kernel_breakdown.rst` (related) - -```bash -python -m hta gpu-kernels-with-annotations -i --rank R [--no-expand-names] [--no-shorten-names] -``` - -| Argument | Type | Required | Default | Description | -|---|---|---|---|---| -| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory | -| `--rank` | int | yes | — | Rank to analyze | -| `--no-expand-names` | flag | no | false | Do not expand kernel names | -| `--no-shorten-names` | flag | no | false | Do not shorten kernel names | - -**Example:** -```bash -python -m hta gpu-kernels-with-annotations -i ./preprocessed --rank 0 -``` - -**Output:** Markdown table with one row per GPU kernel, including its user annotation context. - ---- - -### `gpu-user-annotation-breakdown` - -Break down GPU (or CPU) time by user-defined annotations. - -See: `docs/source/features/kernel_breakdown.rst` (related) - -```bash -python -m hta gpu-user-annotation-breakdown -i [--cpu] [--duration-ratio R] [--num-kernels N] [--allowlist-patterns PAT ...] -``` - -| Argument | Type | Required | Default | Description | -|---|---|---|---|---| -| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory | -| `--cpu` | flag | no | false | Use CPU time instead of GPU time | -| `--duration-ratio` | float | no | None | Minimum fraction of total duration for inclusion | -| `--num-kernels` | int | no | None | Maximum number of entries to show | -| `--allowlist-patterns` | str (multiple) | no | None | Annotation patterns to keep distinct (space-separated) | - -**Example:** -```bash -python -m hta gpu-user-annotation-breakdown -i ./preprocessed --duration-ratio 0.05 -``` - -**Output:** Markdown table with time breakdown by user annotation. - ---- - -### `frequent-cuda-kernel-sequences` - -Find frequently occurring sequences of CUDA kernels launched by a given operator. - -See: `docs/source/features/frequent_cuda_kernels.rst` - -```bash -python -m hta frequent-cuda-kernel-sequences -i --operator-name NAME --output-dir DIR [--min-pattern-len N] [--rank R] [--top-k K] -``` - -| Argument | Type | Required | Default | Description | -|---|---|---|---|---| -| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory | -| `--operator-name` | str | yes | — | Name of the CPU operator to analyze | -| `--output-dir` | str | yes | — | Directory for output files | -| `--min-pattern-len` | int | no | None | Minimum length of kernel sequence patterns | -| `--rank` | int | no | None | Specific rank to analyze | -| `--top-k` | int | no | None | Number of top frequent patterns to return | - -**Example:** -```bash -python -m hta frequent-cuda-kernel-sequences -i ./preprocessed \ - --operator-name aten::linear --output-dir ./freq_out --top-k 5 -``` - -**Output:** Markdown table of frequent kernel sequence patterns with their counts. - ---- - -### `aten-op-kernels-and-delay` - -Map ATen operators to their launched GPU kernels, showing launch delay. - -See: `docs/source/features/cuda_kernel_launch_stats.rst` (related) - -```bash -python -m hta aten-op-kernels-and-delay -i [--ranks RANKS] [--sort-by COLS] -``` - -| Argument | Type | Required | Default | Description | -|---|---|---|---|---| -| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory | -| `--ranks` | str | no | None | Comma-separated ranks | -| `--sort-by` | str | no | None | Comma-separated column names to sort by | - -**Example:** -```bash -python -m hta aten-op-kernels-and-delay -i ./preprocessed --ranks 0 --sort-by "duration" -``` - -**Output:** Per-rank markdown tables mapping ATen ops to GPU kernels with delay statistics. - ---- - -### `cuda-kernel-launch-stats` - -Compute statistics about CUDA kernel launches (durations, launch delays, short kernels). - -See: `docs/source/features/cuda_kernel_launch_stats.rst` - -```bash -python -m hta cuda-kernel-launch-stats -i [--ranks RANKS] [--runtime-cutoff N] [--launch-delay-cutoff N] [--no-memory-events] -``` - -| Argument | Type | Required | Default | Description | -|---|---|---|---|---| -| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory | -| `--ranks` | str | no | None | Comma-separated ranks | -| `--runtime-cutoff` | int | no | None | Runtime threshold (microseconds) for flagging short kernels | -| `--launch-delay-cutoff` | int | no | None | Launch delay threshold (microseconds) for flagging slow launches | -| `--no-memory-events` | flag | no | false | Exclude memory events from the analysis | - -**Example:** -```bash -python -m hta cuda-kernel-launch-stats -i ./preprocessed --runtime-cutoff 10 -``` - -**Output:** Per-rank markdown tables with kernel launch statistics. - ---- - -## Augmented Counters (Queue Length & Memory Bandwidth) - -### `generate-trace-with-counters` - -Generate an augmented trace file with queue length and/or memory bandwidth counter time series embedded. - -See: `docs/source/features/augmented_counters.rst` - -```bash -python -m hta generate-trace-with-counters -i [--ranks RANKS] [--time-series TYPE] [--output-suffix SUFFIX] -``` - -| Argument | Type | Required | Default | Description | -|---|---|---|---|---| -| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory | -| `--ranks` | str | no | None | Comma-separated ranks | -| `--time-series` | str | no | None | Which counters: `queue_length`, `memcpy_bandwidth`, or `both` | -| `--output-suffix` | str | no | None | Suffix appended to the output trace filename | - -**Example:** -```bash -python -m hta generate-trace-with-counters -i ./preprocessed --time-series both -``` - -**Output:** Augmented trace JSON file(s) in the original trace directory, viewable in `chrome://tracing` or Perfetto. Prints confirmation message. - ---- - -### `queue-length-summary` - -Show summary statistics of the CUDA stream queue length (min, max, mean, etc.) per rank. - -See: `docs/source/features/augmented_counters.rst` - -```bash -python -m hta queue-length-summary -i [--ranks RANKS] -``` - -| Argument | Type | Required | Default | Description | -|---|---|---|---|---| -| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory | -| `--ranks` | str | no | None | Comma-separated ranks | - -**Example:** -```bash -python -m hta queue-length-summary -i ./preprocessed -``` - -**Output:** Markdown table with queue length statistics per rank. - ---- - -### `queue-length-time-series` - -Get the full queue length time series (timestamp, queue_length) per rank. - -See: `docs/source/features/augmented_counters.rst` - -```bash -python -m hta queue-length-time-series -i [--ranks RANKS] -``` - -| Argument | Type | Required | Default | Description | -|---|---|---|---|---| -| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory | -| `--ranks` | str | no | None | Comma-separated ranks | - -**Example:** -```bash -python -m hta queue-length-time-series -i ./preprocessed --ranks 0 -``` - -**Output:** Per-rank markdown tables of (timestamp, queue_length) data points. - ---- - -### `blocked-on-full-queue` - -Compute time the CPU spent blocked because the GPU launch queue was full. - -See: `docs/source/features/augmented_counters.rst` - -```bash -python -m hta blocked-on-full-queue -i [--ranks RANKS] [--max-queue-length N] -``` - -| Argument | Type | Required | Default | Description | -|---|---|---|---|---| -| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory | -| `--ranks` | str | no | None | Comma-separated ranks | -| `--max-queue-length` | int | no | None | Queue length considered "full" (default: NVIDIA limit of 1024) | - -**Example:** -```bash -python -m hta blocked-on-full-queue -i ./preprocessed --max-queue-length 1024 -``` - -**Output:** Markdown table with blocking duration per rank. - ---- - -### `memory-bw-summary` - -Show memory bandwidth summary statistics per rank. - -See: `docs/source/features/augmented_counters.rst` - -```bash -python -m hta memory-bw-summary -i [--ranks RANKS] -``` - -| Argument | Type | Required | Default | Description | -|---|---|---|---|---| -| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory | -| `--ranks` | str | no | None | Comma-separated ranks | - -**Example:** -```bash -python -m hta memory-bw-summary -i ./preprocessed -``` - -**Output:** Markdown table with memory bandwidth statistics per rank. - ---- - -### `memory-bw-time-series` - -Get the full memory bandwidth time series per rank. - -See: `docs/source/features/augmented_counters.rst` - -```bash -python -m hta memory-bw-time-series -i [--ranks RANKS] -``` - -| Argument | Type | Required | Default | Description | -|---|---|---|---|---| -| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory | -| `--ranks` | str | no | None | Comma-separated ranks | - -**Example:** -```bash -python -m hta memory-bw-time-series -i ./preprocessed --ranks 0,1 -``` - -**Output:** Per-rank markdown tables of memory bandwidth data points over time. - ---- - -## Idle Time - -### `idle-time-breakdown` - -Break down GPU idle time by category (host wait, kernel wait, other) per rank and stream. - -See: `docs/source/features/idle_time_breakdown.rst` - -```bash -python -m hta idle-time-breakdown -i [--ranks RANKS] [--streams STREAMS] [--show-idle-interval-stats] [--consecutive-kernel-delay N] -``` - -| Argument | Type | Required | Default | Description | -|---|---|---|---|---| -| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory | -| `--ranks` | str | no | None | Comma-separated ranks | -| `--streams` | str | no | None | Comma-separated CUDA stream IDs | -| `--show-idle-interval-stats` | flag | no | false | Also output statistics about individual idle intervals | -| `--consecutive-kernel-delay` | int | no | None | Threshold (microseconds) for classifying gaps between consecutive kernels | - -**Example:** -```bash -python -m hta idle-time-breakdown -i ./preprocessed --show-idle-interval-stats -``` - -**Output:** Markdown table "Idle Time Breakdown" with idle time categories per rank/stream. If `--show-idle-interval-stats` is set, a second table "Idle Interval Statistics" is also printed. - ---- - -## CUPTI Counters - -### `cupti-counter-data` - -Extract CUPTI hardware performance counter data joined with operator information. - -See: `docs/source/features/cupti_counter_analysis.rst` - -```bash -python -m hta cupti-counter-data -i [--ranks RANKS] -``` - -| Argument | Type | Required | Default | Description | -|---|---|---|---|---| -| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory | -| `--ranks` | str | no | None | Comma-separated ranks | - -**Example:** -```bash -python -m hta cupti-counter-data -i ./preprocessed --ranks 0 -``` - -**Output:** Indexed markdown tables of CUPTI counter data with associated operator information. - ---- - -## Critical Path - -### `critical-path` - -Run critical path analysis on a specific annotation instance and overlay the result onto a trace file. - -See: `docs/source/features/lightweight_critical_path_analysis.rst` - -```bash -python -m hta critical-path -i --rank R --annotation ANN --instance-id ID --output-dir DIR [--data-load-events EVT ...] [--show-all-edges] -``` - -| Argument | Type | Required | Default | Description | -|---|---|---|---|---| -| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory | -| `--rank` | int | yes | — | Rank to analyze | -| `--annotation` | str | yes | — | User annotation name (e.g., `ProfilerStep`) | -| `--instance-id` | str | yes | — | Single int (e.g., `3`) or `start,end` range (e.g., `3,5`) | -| `--output-dir` | str | yes | — | Directory for the overlaid trace output | -| `--data-load-events` | str (multiple) | no | None | Names of data loading events (space-separated) | -| `--show-all-edges` | flag | no | false | Show all edges in the overlaid trace, not just the critical path | - -**Example:** -```bash -python -m hta critical-path -i ./preprocessed \ - --rank 0 --annotation ProfilerStep --instance-id 3 \ - --output-dir ./cp_output -``` - -**Output:** Three sections printed to stdout: -1. **Critical Path Summary** — high-level statistics (total time, breakdown percentages) -2. **Critical Path Breakdown** — per-category time on the critical path -3. **Overlaid trace path** — file path to the generated trace JSON with the critical path overlaid, viewable in `chrome://tracing` or Perfetto diff --git a/.claude/skills/trace-blame b/.claude/skills/trace-blame new file mode 120000 index 0000000..c54fe13 --- /dev/null +++ b/.claude/skills/trace-blame @@ -0,0 +1 @@ +../../cmd/trace-blame/skill \ No newline at end of file diff --git a/README.md b/README.md index e69de29..5f62e6c 100644 --- a/README.md +++ b/README.md @@ -0,0 +1,41 @@ +# trace-blame + +A Go CLI for analyzing PyTorch Profiler traces. + +Reimplements [HolisticTraceAnalysis](https://github.com/facebookresearch/HolisticTraceAnalysis) with following features: + +1. `install-skill` supports agent usage. +2. single go binary. +3. use sqlite table to store intermediate state. +4. markdown output for cli usage. + +## Quick Start + +```bash +go build -o trace-blame ./cmd/trace-blame/ + +# Install accompany claude skills +trace-blame install-skills + +# Parse raw traces into a SQLite database +trace-blame pre-process --trace-dir ./traces --output trace.db + +# Run analyses +trace-blame temporal-breakdown --db trace.db +trace-blame gpu-kernel-breakdown --db trace.db +trace-blame idle-time-breakdown --db trace.db --ranks 0,1 +``` + +## Subcommands + +| Category | Subcommands | +|---|---| +| Preprocessing | `pre-process` | +| Overview | `temporal-breakdown`, `comm-comp-overlap`, `profiler-steps`, `potential-stragglers` | +| GPU Kernels | `gpu-kernel-breakdown`, `gpu-kernels-with-annotations`, `cuda-kernel-launch-stats`, `aten-op-kernels-and-delay`, `frequent-cuda-kernel-sequences` | +| Counters | `queue-length-summary`, `queue-length-time-series`, `blocked-on-full-queue`, `memory-bw-summary`, `memory-bw-time-series`, `generate-trace-with-counters` | +| Idle Time | `idle-time-breakdown` | +| Critical Path | `critical-path` | +| CUPTI | `cupti-counter-data` | + +Run `trace-blame` with no arguments for usage, or `trace-blame -h` for flag details. diff --git a/cmd/trace-blame/install_skill.go b/cmd/trace-blame/install_skill.go new file mode 100644 index 0000000..5ee6dc5 --- /dev/null +++ b/cmd/trace-blame/install_skill.go @@ -0,0 +1,51 @@ +package main + +import ( + "embed" + "fmt" + "io/fs" + "log" + "os" + "path/filepath" +) + +//go:embed all:skill +var skillFS embed.FS + +func cmdInstallSkill() { + home, err := os.UserHomeDir() + if err != nil { + log.Fatalf("get home dir: %v", err) + } + + destDir := filepath.Join(home, ".claude", "skills", "trace-blame") + + if err := os.MkdirAll(destDir, 0o755); err != nil { + log.Fatalf("create dir: %v", err) + } + + err = fs.WalkDir(skillFS, "skill", func(path string, d fs.DirEntry, err error) error { + if err != nil { + return err + } + + // "skill/SKILL.md" -> "SKILL.md" + rel, _ := filepath.Rel("skill", path) + dest := filepath.Join(destDir, rel) + + if d.IsDir() { + return os.MkdirAll(dest, 0o755) + } + + data, err := skillFS.ReadFile(path) + if err != nil { + return err + } + return os.WriteFile(dest, data, 0o644) + }) + if err != nil { + log.Fatalf("install skill: %v", err) + } + + fmt.Printf("Installed skill to %s\n", destDir) +} diff --git a/cmd/tracepyre/main.go b/cmd/trace-blame/main.go similarity index 98% rename from cmd/tracepyre/main.go rename to cmd/trace-blame/main.go index 27d75e0..bca2dcc 100644 --- a/cmd/tracepyre/main.go +++ b/cmd/trace-blame/main.go @@ -9,14 +9,14 @@ import ( "strconv" "strings" - "hta/pkg/analysis" - "hta/pkg/analysis/criticalpath" - "hta/pkg/analysis/kernel" - "hta/pkg/analysis/resource" - "hta/pkg/analysis/straggler" - "hta/pkg/analysis/temporal" - "hta/pkg/pipeline" - "hta/pkg/store" + "trace-blame/pkg/analysis" + "trace-blame/pkg/analysis/criticalpath" + "trace-blame/pkg/analysis/kernel" + "trace-blame/pkg/analysis/resource" + "trace-blame/pkg/analysis/straggler" + "trace-blame/pkg/analysis/temporal" + "trace-blame/pkg/pipeline" + "trace-blame/pkg/store" ) func main() { @@ -66,6 +66,8 @@ func main() { cmdCriticalPath(os.Args[2:]) case "cupti-counter-data": cmdCUPTICounterData(os.Args[2:]) + case "install-skill": + cmdInstallSkill() default: fmt.Fprintf(os.Stderr, "unknown subcommand: %s\n", os.Args[1]) usage() @@ -74,7 +76,7 @@ func main() { } func usage() { - fmt.Fprintln(os.Stderr, "Usage: hta [flags]") + fmt.Fprintln(os.Stderr, "Usage: trace-blame [flags]") fmt.Fprintln(os.Stderr, " pre-process Parse traces → SQLite DB") fmt.Fprintln(os.Stderr, " temporal-breakdown GPU temporal breakdown from DB") fmt.Fprintln(os.Stderr, " gpu-kernel-breakdown GPU kernel breakdown from DB") @@ -94,6 +96,7 @@ func usage() { fmt.Fprintln(os.Stderr, " frequent-cuda-kernel-sequences Find frequent GPU kernel launch patterns") fmt.Fprintln(os.Stderr, " critical-path Critical path analysis for a single rank") fmt.Fprintln(os.Stderr, " cupti-counter-data CUPTI profiler counter data with operator stacks") + fmt.Fprintln(os.Stderr, " install-skill Install Claude Code skill to ~/.claude/skills/") } func cmdPreProcess(args []string) { diff --git a/cmd/trace-blame/skill/SKILL.md b/cmd/trace-blame/skill/SKILL.md new file mode 100644 index 0000000..0b0bf8b --- /dev/null +++ b/cmd/trace-blame/skill/SKILL.md @@ -0,0 +1,89 @@ +--- +name: pytorch-profile +description: >- + Holistic Trace Analysis (HTA) CLI in Go gives insight about distributed training with PyTorch. + It should be used when the user asks to "analyse pytorch trace", + or mentions any subcommand like temporal-breakdown, comm-comp-overlap, + gpu-kernel-breakdown, idle-time-breakdown, critical-path, queue-length, etc. +--- + +# Pytorch Profile Data + +The `trace-blame` CLI (built from `cmd/trace-blame/main.go`) exposes every major trace analysis as a standalone subcommand. It is a Go reimplementation designed for CI pipelines, shell scripts, and quick interactive analysis. + +## Two-Step Workflow + +All CLI usage follows a **pre-process then analyze** pattern: + +```bash +# Step 1: Parse raw PyTorch Profiler traces into a SQLite database +trace-blame pre-process --trace-dir ./raw_traces --output trace.db + +# Step 2: Run any analysis subcommand on the database +trace-blame temporal-breakdown --db trace.db +trace-blame idle-time-breakdown --db trace.db --ranks 0,1 +``` + +**Step 1 (`pre-process`)** reads raw JSON/GZ trace files from `--trace-dir`, writes a single SQLite database to `--output` (default: `trace.db`). This only needs to run once per trace set. + +**Step 2 (any analysis subcommand)** reads from the SQLite database via `--db` (default: `trace.db`). Most subcommands print markdown tables to stdout. + +## Subcommand Quick Reference + +| Subcommand | Description | Key Args (besides `--db`) | Category | +|---|---|---|---| +| `pre-process` | Parse raw traces to SQLite DB | `--trace-dir` (required), `--output` | Preprocessing | +| `temporal-breakdown` | Time breakdown (compute, comm, idle) per rank | — | Overview | +| `comm-comp-overlap` | Communication/computation overlap per rank | — | Overview | +| `profiler-steps` | List profiler step indices | — | Overview | +| `potential-stragglers` | Identify slow ranks | `--num-candidates`, `--profiler-steps` | Overview | +| `gpu-kernel-breakdown` | GPU time by kernel type + top kernels | `--num-kernels`, `--duration-ratio`, `--no-memory-kernels` | GPU Kernels | +| `gpu-kernels-with-annotations` | GPU kernels with user annotation context | `--rank` (required), `--no-expand-names`, `--no-shorten-names` | GPU Kernels | +| `frequent-cuda-kernel-sequences` | Frequent CUDA kernel patterns per operator | `--operator-name` (required), `--output-dir`, `--top-k`, `--rank`, `--min-pattern-len` | GPU Kernels | +| `aten-op-kernels-and-delay` | ATen op to GPU kernel mapping with launch delay | `--ranks`, `--sort-by` | GPU Kernels | +| `cuda-kernel-launch-stats` | CUDA kernel launch duration and delay stats | `--ranks`, `--runtime-cutoff`, `--launch-delay-cutoff`, `--no-memory-events` | GPU Kernels | +| `generate-trace-with-counters` | Augmented trace with queue length / memory BW counters | `--ranks`, `--time-series`, `--output-suffix` | Counters | +| `queue-length-summary` | Queue length summary stats per rank/stream | `--ranks` | Counters | +| `queue-length-time-series` | Full queue length time series per rank | `--ranks` | Counters | +| `blocked-on-full-queue` | Time CPU blocked on full GPU queue | `--ranks`, `--max-queue-length` | Counters | +| `memory-bw-summary` | Memory bandwidth summary stats per rank | `--ranks` | Counters | +| `memory-bw-time-series` | Full memory bandwidth time series per rank | `--ranks` | Counters | +| `idle-time-breakdown` | GPU idle time by category per rank/stream | `--ranks`, `--streams`, `--show-idle-interval-stats`, `--consecutive-kernel-delay` | Idle Time | +| `cupti-counter-data` | CUPTI hardware counter data with operators | `--ranks` | CUPTI | +| `critical-path` | Critical path analysis with trace overlay | `--rank`, `--annotation`, `--instance-id`, `--output-dir` (all required), `--data-load-events`, `--show-all-edges` | Critical Path | + +## Common Patterns + +**Filtering by rank:** Most analysis subcommands accept `--ranks` as a comma-separated list (e.g., `--ranks 0,1,3`). If omitted, all ranks are analyzed. Some subcommands use `--rank` (singular) for a single required rank. + +**Output format:** Most subcommands print markdown tables to stdout. Pipe to a file or use in scripts: +```bash +trace-blame temporal-breakdown --db trace.db > results.md +``` + +**Getting help:** Run `trace-blame` with no arguments for the subcommand list, or `trace-blame -h` for a specific subcommand's flags. + +**Building:** The binary is built from Go source: +```bash +go build -o trace-blame ./cmd/trace-blame/ +``` + +## Key Source Files + +- `cmd/trace-blame/main.go` — CLI implementation (argument parsing and subcommand handlers) +- `pkg/pipeline/` — Pre-processing pipeline (trace parsing → SQLite) +- `pkg/store/` — SQLite database layer +- `pkg/analysis/` — Analysis implementations (temporal, kernel, resource, straggler, criticalpath) + +## Additional Resources + +For full argument tables, types, defaults, and detailed output descriptions, see the reference files organized by category: + +- `references/subcommands.md` — Index linking to all category files +- `references/preprocessing.md` — `pre-process` +- `references/overview.md` — `temporal-breakdown`, `comm-comp-overlap`, `profiler-steps`, `potential-stragglers` +- `references/gpu-kernels.md` — `gpu-kernel-breakdown`, `gpu-kernels-with-annotations`, `frequent-cuda-kernel-sequences`, `aten-op-kernels-and-delay`, `cuda-kernel-launch-stats` +- `references/counters.md` — `generate-trace-with-counters`, `queue-length-summary`, `queue-length-time-series`, `blocked-on-full-queue`, `memory-bw-summary`, `memory-bw-time-series` +- `references/idle-time.md` — `idle-time-breakdown` +- `references/cupti-counters.md` — `cupti-counter-data` +- `references/critical-path.md` — `critical-path` diff --git a/cmd/trace-blame/skill/references/counters.md b/cmd/trace-blame/skill/references/counters.md new file mode 100644 index 0000000..3e2fc9e --- /dev/null +++ b/cmd/trace-blame/skill/references/counters.md @@ -0,0 +1,134 @@ +# Augmented Counters (Queue Length & Memory Bandwidth) + +### `generate-trace-with-counters` + +Generate an augmented trace file with queue length and/or memory bandwidth counter time series embedded. + +```bash +trace-blame generate-trace-with-counters [--db ] [--ranks RANKS] [--time-series TYPE] [--output-suffix SUFFIX] +``` + +| Argument | Type | Required | Default | Description | +|---|---|---|---|---| +| `--db` | string | no | `trace.db` | SQLite database path | +| `--ranks` | string | no | all | Comma-separated ranks | +| `--time-series` | string | no | `both` | Which counters: `queue_length`, `memcpy_bandwidth`, or `both` | +| `--output-suffix` | string | no | `_with_counters` | Suffix for output file names | + +**Example:** +```bash +trace-blame generate-trace-with-counters --db trace.db --time-series both +``` + +**Output:** Prints output file paths. Generated trace files are viewable in `chrome://tracing` or Perfetto. + +--- + +### `queue-length-summary` + +Show summary statistics of the CUDA stream queue length per rank and stream. + +```bash +trace-blame queue-length-summary [--db ] [--ranks RANKS] +``` + +| Argument | Type | Required | Default | Description | +|---|---|---|---|---| +| `--db` | string | no | `trace.db` | SQLite database path | +| `--ranks` | string | no | all | Comma-separated ranks | + +**Example:** +```bash +trace-blame queue-length-summary --db trace.db +``` + +**Output:** Markdown table: `| rank | stream | count | min | max | std | 25% | 50% | 75% |` + +--- + +### `queue-length-time-series` + +Get the full queue length time series per rank. + +```bash +trace-blame queue-length-time-series [--db ] [--ranks RANKS] +``` + +| Argument | Type | Required | Default | Description | +|---|---|---|---|---| +| `--db` | string | no | `trace.db` | SQLite database path | +| `--ranks` | string | no | all | Comma-separated ranks | + +**Example:** +```bash +trace-blame queue-length-time-series --db trace.db --ranks 0 +``` + +**Output:** Per-rank markdown tables: `| ts | stream | queue_length |` + +--- + +### `blocked-on-full-queue` + +Compute time the CPU spent blocked because the GPU launch queue was full. + +```bash +trace-blame blocked-on-full-queue [--db ] [--ranks RANKS] [--max-queue-length N] +``` + +| Argument | Type | Required | Default | Description | +|---|---|---|---|---| +| `--db` | string | no | `trace.db` | SQLite database path | +| `--ranks` | string | no | all | Comma-separated ranks | +| `--max-queue-length` | int | no | 1024 | Max CUDA launch queue length per stream | + +**Example:** +```bash +trace-blame blocked-on-full-queue --db trace.db --max-queue-length 1024 +``` + +**Output:** Markdown table: `| rank | stream | duration_at_max_queue_length | relative_duration |`. Prints a message if no streams reached maximum queue length. + +--- + +### `memory-bw-summary` + +Show memory bandwidth summary statistics per rank. + +```bash +trace-blame memory-bw-summary [--db ] [--ranks RANKS] +``` + +| Argument | Type | Required | Default | Description | +|---|---|---|---|---| +| `--db` | string | no | `trace.db` | SQLite database path | +| `--ranks` | string | no | all | Comma-separated ranks | + +**Example:** +```bash +trace-blame memory-bw-summary --db trace.db +``` + +**Output:** Markdown table: `| rank | name | count | mean | std | min | 25% | 50% | 75% | max |` + +--- + +### `memory-bw-time-series` + +Get the full memory bandwidth time series per rank. + +```bash +trace-blame memory-bw-time-series [--db ] [--ranks RANKS] +``` + +| Argument | Type | Required | Default | Description | +|---|---|---|---|---| +| `--db` | string | no | `trace.db` | SQLite database path | +| `--ranks` | string | no | all | Comma-separated ranks | + +**Example:** +```bash +trace-blame memory-bw-time-series --db trace.db --ranks 0,1 +``` + +**Output:** Per-rank markdown tables: `| ts | pid | name | memory_bw_gbps |` diff --git a/cmd/trace-blame/skill/references/critical-path.md b/cmd/trace-blame/skill/references/critical-path.md new file mode 100644 index 0000000..f7acac6 --- /dev/null +++ b/cmd/trace-blame/skill/references/critical-path.md @@ -0,0 +1,26 @@ +# Critical Path + +### `critical-path` + +Run critical path analysis on a specific annotation instance and optionally overlay the result onto a trace file. + +```bash +trace-blame critical-path --rank R --annotation ANN --instance-id ID --output-dir DIR [--db ] [--data-load-events EVT] [--show-all-edges] +``` + +| Argument | Type | Required | Default | Description | +|---|---|---|---|---| +| `--db` | string | no | `trace.db` | SQLite database path | +| `--rank` | int | yes | — | Rank to analyze | +| `--annotation` | string | yes | — | Annotation name to match (e.g., `ProfilerStep`) | +| `--instance-id` | string | yes | — | Single int (e.g., `3`) or `start,end` range (e.g., `3,5`) | +| `--output-dir` | string | yes | — | Directory for the overlay trace output | +| `--data-load-events` | string | no | — | Comma-separated regex patterns for data loading ops | +| `--show-all-edges` | flag | no | false | Show all edges in overlay (not just critical path) | + +**Example:** +```bash +trace-blame critical-path --db trace.db --rank 0 --annotation ProfilerStep --instance-id 3 --output-dir ./cp_output +``` + +**Output:** Prints critical path summary (nodes, edges, path length), breakdown by bound type table, and overlay trace file path. diff --git a/cmd/trace-blame/skill/references/cupti-counters.md b/cmd/trace-blame/skill/references/cupti-counters.md new file mode 100644 index 0000000..a753553 --- /dev/null +++ b/cmd/trace-blame/skill/references/cupti-counters.md @@ -0,0 +1,21 @@ +# CUPTI Counters + +### `cupti-counter-data` + +Extract CUPTI hardware performance counter data joined with operator information. + +```bash +trace-blame cupti-counter-data [--db ] [--ranks RANKS] +``` + +| Argument | Type | Required | Default | Description | +|---|---|---|---|---| +| `--db` | string | no | `trace.db` | SQLite database path | +| `--ranks` | string | no | all | Comma-separated ranks | + +**Example:** +```bash +trace-blame cupti-counter-data --db trace.db --ranks 0 +``` + +**Output:** Per-rank markdown tables with kernel name, operator stack, and dynamic counter columns. diff --git a/cmd/trace-blame/skill/references/gpu-kernels.md b/cmd/trace-blame/skill/references/gpu-kernels.md new file mode 100644 index 0000000..b113e9a --- /dev/null +++ b/cmd/trace-blame/skill/references/gpu-kernels.md @@ -0,0 +1,123 @@ +# GPU Kernel Analysis + +### `gpu-kernel-breakdown` + +Break down GPU time by kernel type (computation, communication, memory) and list top kernels. + +```bash +trace-blame gpu-kernel-breakdown [--db ] [--duration-ratio R] [--num-kernels N] [--no-memory-kernels] +``` + +| Argument | Type | Required | Default | Description | +|---|---|---|---|---| +| `--db` | string | no | `trace.db` | SQLite database path | +| `--duration-ratio` | float | no | 0.8 | Cumulative duration ratio cutoff | +| `--num-kernels` | int | no | 10 | Max kernels per type per rank | +| `--no-memory-kernels` | flag | no | false | Exclude MEMORY kernel type | + +**Example:** +```bash +trace-blame gpu-kernel-breakdown --db trace.db --num-kernels 10 +``` + +**Output:** Two markdown tables: +1. **Kernel Type Breakdown** — `| kernel_type | sum(us) | percentage |` +2. **Top Kernels** — `| name | sum(us) | max(us) | min(us) | mean(us) | stddev | kernel_type | rank |` + +--- + +### `gpu-kernels-with-annotations` + +List GPU kernels annotated with their user-defined annotation context (e.g., forward/backward/optimizer). + +```bash +trace-blame gpu-kernels-with-annotations --rank R [--db ] [--no-expand-names] [--no-shorten-names] +``` + +| Argument | Type | Required | Default | Description | +|---|---|---|---|---| +| `--db` | string | no | `trace.db` | SQLite database path | +| `--rank` | int | yes | — | Rank to analyze | +| `--no-expand-names` | flag | no | false | Skip expanding symbol IDs to names | +| `--no-shorten-names` | flag | no | false | Skip shortening kernel names | + +**Example:** +```bash +trace-blame gpu-kernels-with-annotations --db trace.db --rank 0 +``` + +**Output:** Markdown table: `| started_at | ended_at | kernel | annotation |` + +--- + +### `frequent-cuda-kernel-sequences` + +Find frequently occurring sequences of CUDA kernels launched by a given operator. + +```bash +trace-blame frequent-cuda-kernel-sequences --operator-name NAME [--db ] [--output-dir DIR] [--min-pattern-len N] [--rank R] [--top-k K] +``` + +| Argument | Type | Required | Default | Description | +|---|---|---|---|---| +| `--db` | string | no | `trace.db` | SQLite database path | +| `--operator-name` | string | yes | — | CPU operator name substring to match | +| `--output-dir` | string | no | — | Directory for overlay trace output | +| `--min-pattern-len` | int | no | 3 | Minimum pattern length (operator + kernels) | +| `--rank` | int | no | 0 | Rank to analyze | +| `--top-k` | int | no | 5 | Number of top patterns to return | + +**Example:** +```bash +trace-blame frequent-cuda-kernel-sequences --db trace.db --operator-name aten::linear --top-k 5 +``` + +**Output:** Markdown table: `| pattern | count | GPU kernel duration (us) | CPU op duration (us) |` + +--- + +### `aten-op-kernels-and-delay` + +Map ATen operators to their launched GPU kernels, showing launch delay. + +```bash +trace-blame aten-op-kernels-and-delay [--db ] [--ranks RANKS] [--sort-by COLS] +``` + +| Argument | Type | Required | Default | Description | +|---|---|---|---|---| +| `--db` | string | no | `trace.db` | SQLite database path | +| `--ranks` | string | no | all | Comma-separated ranks | +| `--sort-by` | string | no | `occurrence_count` | Comma-separated column names to sort by | + +**Example:** +```bash +trace-blame aten-op-kernels-and-delay --db trace.db --ranks 0 +``` + +**Output:** Per-rank markdown tables: `| aten_op_name | kernel_sequence | occurrence_count | avg_aten_op_launch_delay | avg_runtime_delay |` + +--- + +### `cuda-kernel-launch-stats` + +Compute statistics about CUDA kernel launches (durations, launch delays). + +```bash +trace-blame cuda-kernel-launch-stats [--db ] [--ranks RANKS] [--runtime-cutoff N] [--launch-delay-cutoff N] [--no-memory-events] +``` + +| Argument | Type | Required | Default | Description | +|---|---|---|---|---| +| `--db` | string | no | `trace.db` | SQLite database path | +| `--ranks` | string | no | all | Comma-separated ranks | +| `--runtime-cutoff` | int | no | 50 | Runtime duration cutoff in µs | +| `--launch-delay-cutoff` | int | no | 100 | Launch delay cutoff in µs | +| `--no-memory-events` | flag | no | false | Exclude cudaMemcpyAsync/cudaMemsetAsync | + +**Example:** +```bash +trace-blame cuda-kernel-launch-stats --db trace.db --runtime-cutoff 10 +``` + +**Output:** Per-rank markdown tables: `| correlation | cpu_duration | gpu_duration | launch_delay |` diff --git a/cmd/trace-blame/skill/references/idle-time.md b/cmd/trace-blame/skill/references/idle-time.md new file mode 100644 index 0000000..cee981c --- /dev/null +++ b/cmd/trace-blame/skill/references/idle-time.md @@ -0,0 +1,24 @@ +# Idle Time + +### `idle-time-breakdown` + +Break down GPU idle time by category per rank and stream. + +```bash +trace-blame idle-time-breakdown [--db ] [--ranks RANKS] [--streams STREAMS] [--show-idle-interval-stats] [--consecutive-kernel-delay N] +``` + +| Argument | Type | Required | Default | Description | +|---|---|---|---|---| +| `--db` | string | no | `trace.db` | SQLite database path | +| `--ranks` | string | no | all | Comma-separated ranks | +| `--streams` | string | no | all | Comma-separated CUDA stream IDs | +| `--show-idle-interval-stats` | flag | no | false | Also output statistics about individual idle intervals | +| `--consecutive-kernel-delay` | int64 | no | 30 | Threshold (µs) for classifying gaps between consecutive kernels | + +**Example:** +```bash +trace-blame idle-time-breakdown --db trace.db --show-idle-interval-stats +``` + +**Output:** Markdown table: `| rank | stream | idle_category | idle_time(us) | idle_time_ratio |`. If `--show-idle-interval-stats` is set, a second table with interval statistics is also printed. diff --git a/cmd/trace-blame/skill/references/overview.md b/cmd/trace-blame/skill/references/overview.md new file mode 100644 index 0000000..ed78983 --- /dev/null +++ b/cmd/trace-blame/skill/references/overview.md @@ -0,0 +1,93 @@ +# Overview Analysis + +### `temporal-breakdown` + +Show how time is spent (compute, communication, idle) for each rank. + +```bash +trace-blame temporal-breakdown [--db ] +``` + +| Argument | Type | Required | Default | Description | +|---|---|---|---|---| +| `--db` | string | no | `trace.db` | SQLite database path | + +**Example:** +```bash +trace-blame temporal-breakdown --db trace.db +``` + +**Output:** Markdown table with one row per rank: +``` +| rank | idle_time(us) | compute_time(us) | non_compute_time(us) | kernel_time(us) | idle_time_pctg | compute_time_pctg | non_compute_time_pctg | +``` + +--- + +### `comm-comp-overlap` + +Show the overlap between communication and computation for each rank. + +```bash +trace-blame comm-comp-overlap [--db ] +``` + +| Argument | Type | Required | Default | Description | +|---|---|---|---|---| +| `--db` | string | no | `trace.db` | SQLite database path | + +**Example:** +```bash +trace-blame comm-comp-overlap --db trace.db +``` + +**Output:** Markdown table with overlap percentages per rank: +``` +| rank | overlap_pctg | +``` + +--- + +### `profiler-steps` + +List the profiler step indices found in the trace. + +```bash +trace-blame profiler-steps [--db ] +``` + +| Argument | Type | Required | Default | Description | +|---|---|---|---|---| +| `--db` | string | no | `trace.db` | SQLite database path | + +**Example:** +```bash +trace-blame profiler-steps --db trace.db +# 15,16,17,18,19 +``` + +**Output:** Comma-separated list of profiler step integers printed to stdout. + +--- + +### `potential-stragglers` + +Identify ranks that are potential stragglers (slower than peers). + +```bash +trace-blame potential-stragglers [--db ] [--num-candidates N] [--profiler-steps STEPS] +``` + +| Argument | Type | Required | Default | Description | +|---|---|---|---|---| +| `--db` | string | no | `trace.db` | SQLite database path | +| `--num-candidates` | int | no | 2 | Top K straggler candidates to return | +| `--profiler-steps` | string | no | all | Comma-separated profiler step indices to analyze | + +**Example:** +```bash +trace-blame potential-stragglers --db trace.db --num-candidates 2 +# 3,7 +``` + +**Output:** Comma-separated list of rank IDs that are potential stragglers. Prints a message if no stragglers are detected. diff --git a/cmd/trace-blame/skill/references/preprocessing.md b/cmd/trace-blame/skill/references/preprocessing.md new file mode 100644 index 0000000..74fe1fc --- /dev/null +++ b/cmd/trace-blame/skill/references/preprocessing.md @@ -0,0 +1,21 @@ +# Preprocessing + +### `pre-process` + +Parse raw PyTorch Profiler traces (JSON/GZ) and store into a SQLite database for fast repeated analysis. + +```bash +trace-blame pre-process --trace-dir [--output ] +``` + +| Argument | Type | Required | Default | Description | +|---|---|---|---|---| +| `--trace-dir` | string | yes | — | Directory containing trace JSON/GZ files | +| `--output` | string | no | `trace.db` | Output SQLite database path | + +**Example:** +```bash +trace-blame pre-process --trace-dir ./raw_traces --output trace.db +``` + +**Output:** Logs per-rank event counts, writes a single SQLite database file. Only needs to run once per trace set. diff --git a/cmd/trace-blame/skill/references/subcommands.md b/cmd/trace-blame/skill/references/subcommands.md new file mode 100644 index 0000000..b796200 --- /dev/null +++ b/cmd/trace-blame/skill/references/subcommands.md @@ -0,0 +1,15 @@ +# HTA CLI Subcommand Reference + +Full argument tables, examples, and output descriptions for all 19 HTA CLI subcommands, organized by analysis category. + +Source of truth: `cmd/trace-blame/main.go` (argument definitions and subcommand handlers). + +| File | Subcommands | Description | +|---|---|---| +| [preprocessing.md](preprocessing.md) | `pre-process` | Parse raw traces to SQLite DB | +| [overview.md](overview.md) | `temporal-breakdown`, `comm-comp-overlap`, `profiler-steps`, `potential-stragglers` | High-level training overview | +| [gpu-kernels.md](gpu-kernels.md) | `gpu-kernel-breakdown`, `gpu-kernels-with-annotations`, `frequent-cuda-kernel-sequences`, `aten-op-kernels-and-delay`, `cuda-kernel-launch-stats` | GPU kernel analysis | +| [counters.md](counters.md) | `generate-trace-with-counters`, `queue-length-summary`, `queue-length-time-series`, `blocked-on-full-queue`, `memory-bw-summary`, `memory-bw-time-series` | Queue length & memory bandwidth | +| [idle-time.md](idle-time.md) | `idle-time-breakdown` | GPU idle time classification | +| [cupti-counters.md](cupti-counters.md) | `cupti-counter-data` | CUPTI hardware counter data | +| [critical-path.md](critical-path.md) | `critical-path` | Critical path analysis | diff --git a/go.mod b/go.mod index b85aed1..62ff1ce 100644 --- a/go.mod +++ b/go.mod @@ -1,4 +1,4 @@ -module hta +module trace-blame go 1.24.11 diff --git a/pkg/analysis/criticalpath/critical_path.go b/pkg/analysis/criticalpath/critical_path.go index 79b61ea..0561985 100644 --- a/pkg/analysis/criticalpath/critical_path.go +++ b/pkg/analysis/criticalpath/critical_path.go @@ -12,11 +12,11 @@ import ( "sort" "strings" - "hta/pkg/analysis" - "hta/pkg/analysis/kernel" - "hta/pkg/analysis/resource" - "hta/pkg/store" - "hta/pkg/symbol" + "trace-blame/pkg/analysis" + "trace-blame/pkg/analysis/kernel" + "trace-blame/pkg/analysis/resource" + "trace-blame/pkg/store" + "trace-blame/pkg/symbol" ) // --------------------------------------------------------------------------- diff --git a/pkg/analysis/criticalpath/critical_path_test.go b/pkg/analysis/criticalpath/critical_path_test.go index fdd7d29..4403136 100644 --- a/pkg/analysis/criticalpath/critical_path_test.go +++ b/pkg/analysis/criticalpath/critical_path_test.go @@ -5,9 +5,9 @@ import ( "path/filepath" "testing" - "hta/pkg/analysis" - "hta/pkg/pipeline" - "hta/pkg/store" + "trace-blame/pkg/analysis" + "trace-blame/pkg/pipeline" + "trace-blame/pkg/store" ) func TestCriticalPathAlexnet(t *testing.T) { diff --git a/pkg/analysis/kernel/annotation.go b/pkg/analysis/kernel/annotation.go index e3fbba2..16ea1a3 100644 --- a/pkg/analysis/kernel/annotation.go +++ b/pkg/analysis/kernel/annotation.go @@ -4,8 +4,8 @@ import ( "database/sql" "strconv" - "hta/pkg/analysis" - "hta/pkg/store" + "trace-blame/pkg/analysis" + "trace-blame/pkg/store" ) // AnnotationOpts configures the GPUKernelsWithAnnotations analysis. diff --git a/pkg/analysis/kernel/aten_delay.go b/pkg/analysis/kernel/aten_delay.go index 94b4514..d211999 100644 --- a/pkg/analysis/kernel/aten_delay.go +++ b/pkg/analysis/kernel/aten_delay.go @@ -7,8 +7,8 @@ import ( "sort" "strings" - "hta/pkg/store" - "hta/pkg/symbol" + "trace-blame/pkg/store" + "trace-blame/pkg/symbol" ) // AtenDelayOpts controls the ATen op kernels and delay analysis. diff --git a/pkg/analysis/kernel/helpers_test.go b/pkg/analysis/kernel/helpers_test.go index 3ef3105..0699970 100644 --- a/pkg/analysis/kernel/helpers_test.go +++ b/pkg/analysis/kernel/helpers_test.go @@ -7,7 +7,7 @@ import ( "runtime" "testing" - "hta/pkg/store" + "trace-blame/pkg/store" ) func testDataDir(t *testing.T) string { diff --git a/pkg/analysis/kernel/kernel_breakdown.go b/pkg/analysis/kernel/kernel_breakdown.go index 25db548..93ece48 100644 --- a/pkg/analysis/kernel/kernel_breakdown.go +++ b/pkg/analysis/kernel/kernel_breakdown.go @@ -6,8 +6,8 @@ import ( "math" "sort" - "hta/pkg/analysis" - "hta/pkg/store" + "trace-blame/pkg/analysis" + "trace-blame/pkg/store" ) // KernelBreakdownOpts configures the GPU kernel breakdown analysis. diff --git a/pkg/analysis/kernel/kernel_breakdown_test.go b/pkg/analysis/kernel/kernel_breakdown_test.go index b5c6a83..0b9c6ff 100644 --- a/pkg/analysis/kernel/kernel_breakdown_test.go +++ b/pkg/analysis/kernel/kernel_breakdown_test.go @@ -4,7 +4,7 @@ import ( "math" "testing" - "hta/pkg/analysis" + "trace-blame/pkg/analysis" ) func TestQuantileLinear(t *testing.T) { diff --git a/pkg/analysis/kernel/kernel_sequences.go b/pkg/analysis/kernel/kernel_sequences.go index ad7dd9a..98c8371 100644 --- a/pkg/analysis/kernel/kernel_sequences.go +++ b/pkg/analysis/kernel/kernel_sequences.go @@ -10,7 +10,7 @@ import ( "sort" "strings" - "hta/pkg/store" + "trace-blame/pkg/store" ) // KernelSeqOpts controls the frequent CUDA kernel sequences analysis. diff --git a/pkg/analysis/kernel/kernel_sequences_test.go b/pkg/analysis/kernel/kernel_sequences_test.go index 89ca1ad..4b3884f 100644 --- a/pkg/analysis/kernel/kernel_sequences_test.go +++ b/pkg/analysis/kernel/kernel_sequences_test.go @@ -4,7 +4,7 @@ import ( "os" "testing" - "hta/pkg/store" + "trace-blame/pkg/store" ) func TestFindRootOperators(t *testing.T) { diff --git a/pkg/analysis/kernel/launch_stats.go b/pkg/analysis/kernel/launch_stats.go index 8a30a7b..7925afd 100644 --- a/pkg/analysis/kernel/launch_stats.go +++ b/pkg/analysis/kernel/launch_stats.go @@ -4,7 +4,7 @@ import ( "database/sql" "fmt" - "hta/pkg/store" + "trace-blame/pkg/store" ) // LaunchStatsOpts controls the CUDA kernel launch statistics analysis. diff --git a/pkg/analysis/kernel/testmain_test.go b/pkg/analysis/kernel/testmain_test.go index 4a758c2..04c3049 100644 --- a/pkg/analysis/kernel/testmain_test.go +++ b/pkg/analysis/kernel/testmain_test.go @@ -8,8 +8,8 @@ import ( "runtime" "testing" - "hta/pkg/pipeline" - "hta/pkg/store" + "trace-blame/pkg/pipeline" + "trace-blame/pkg/store" ) // sharedVTDBPath and sharedNSDBPath hold paths to pre-built SQLite DBs diff --git a/pkg/analysis/profiler_steps.go b/pkg/analysis/profiler_steps.go index f17186b..d33176b 100644 --- a/pkg/analysis/profiler_steps.go +++ b/pkg/analysis/profiler_steps.go @@ -7,7 +7,7 @@ import ( "sort" "strconv" - "hta/pkg/store" + "trace-blame/pkg/store" ) var ProfilerStepRe = regexp.MustCompile(`ProfilerStep\s*#\s*(\d+)`) diff --git a/pkg/analysis/profiler_steps_test.go b/pkg/analysis/profiler_steps_test.go index dc12b21..246ee85 100644 --- a/pkg/analysis/profiler_steps_test.go +++ b/pkg/analysis/profiler_steps_test.go @@ -4,8 +4,8 @@ import ( "path/filepath" "testing" - "hta/pkg/pipeline" - "hta/pkg/store" + "trace-blame/pkg/pipeline" + "trace-blame/pkg/store" ) func TestProfilerStepsRegex(t *testing.T) { diff --git a/pkg/analysis/resource/cupti_counters.go b/pkg/analysis/resource/cupti_counters.go index 0a0caa1..fb367b5 100644 --- a/pkg/analysis/resource/cupti_counters.go +++ b/pkg/analysis/resource/cupti_counters.go @@ -6,7 +6,7 @@ import ( "log" "sort" - "hta/pkg/store" + "trace-blame/pkg/store" ) // CUPTICounterOpts controls the CUPTI counter data analysis. diff --git a/pkg/analysis/resource/cupti_counters_test.go b/pkg/analysis/resource/cupti_counters_test.go index be3ee76..0740c3f 100644 --- a/pkg/analysis/resource/cupti_counters_test.go +++ b/pkg/analysis/resource/cupti_counters_test.go @@ -3,7 +3,7 @@ package resource import ( "testing" - "hta/pkg/store" + "trace-blame/pkg/store" ) func TestCUPTICounterDataIntegration(t *testing.T) { diff --git a/pkg/analysis/resource/helpers_test.go b/pkg/analysis/resource/helpers_test.go index d29d1a1..09f3a30 100644 --- a/pkg/analysis/resource/helpers_test.go +++ b/pkg/analysis/resource/helpers_test.go @@ -7,7 +7,7 @@ import ( "runtime" "testing" - "hta/pkg/store" + "trace-blame/pkg/store" ) func testDataDir(t *testing.T) string { diff --git a/pkg/analysis/resource/memory_bw.go b/pkg/analysis/resource/memory_bw.go index 9069e4c..0abcadd 100644 --- a/pkg/analysis/resource/memory_bw.go +++ b/pkg/analysis/resource/memory_bw.go @@ -6,9 +6,9 @@ import ( "math" "sort" - "hta/pkg/analysis" - "hta/pkg/store" - "hta/pkg/symbol" + "trace-blame/pkg/analysis" + "trace-blame/pkg/store" + "trace-blame/pkg/symbol" ) // MemoryBWPoint is a single point in the memory bandwidth time series. diff --git a/pkg/analysis/resource/memory_bw_test.go b/pkg/analysis/resource/memory_bw_test.go index 5585195..ad9ea0d 100644 --- a/pkg/analysis/resource/memory_bw_test.go +++ b/pkg/analysis/resource/memory_bw_test.go @@ -3,7 +3,7 @@ package resource import ( "testing" - "hta/pkg/analysis" + "trace-blame/pkg/analysis" ) func TestMemoryBWSummary(t *testing.T) { diff --git a/pkg/analysis/resource/queue_length.go b/pkg/analysis/resource/queue_length.go index 941ed0f..6a8f67e 100644 --- a/pkg/analysis/resource/queue_length.go +++ b/pkg/analysis/resource/queue_length.go @@ -6,9 +6,9 @@ import ( "math" "sort" - "hta/pkg/analysis" - "hta/pkg/store" - "hta/pkg/symbol" + "trace-blame/pkg/analysis" + "trace-blame/pkg/store" + "trace-blame/pkg/symbol" ) // QueueLengthPoint is a single point in the queue-length time series. diff --git a/pkg/analysis/resource/testmain_test.go b/pkg/analysis/resource/testmain_test.go index 4d70ce9..3cfb961 100644 --- a/pkg/analysis/resource/testmain_test.go +++ b/pkg/analysis/resource/testmain_test.go @@ -8,8 +8,8 @@ import ( "runtime" "testing" - "hta/pkg/pipeline" - "hta/pkg/store" + "trace-blame/pkg/pipeline" + "trace-blame/pkg/store" ) // sharedVTDBPath and sharedCUPTIDBPath hold paths to pre-built SQLite DBs diff --git a/pkg/analysis/resource/trace_with_counters.go b/pkg/analysis/resource/trace_with_counters.go index cd94fb7..0e22fe3 100644 --- a/pkg/analysis/resource/trace_with_counters.go +++ b/pkg/analysis/resource/trace_with_counters.go @@ -11,7 +11,7 @@ import ( "sort" "strings" - "hta/pkg/store" + "trace-blame/pkg/store" ) // CounterType is a bitmask selecting which counter time series to embed. diff --git a/pkg/analysis/straggler/straggler.go b/pkg/analysis/straggler/straggler.go index 5232983..b2586ad 100644 --- a/pkg/analysis/straggler/straggler.go +++ b/pkg/analysis/straggler/straggler.go @@ -8,8 +8,8 @@ import ( "strconv" "strings" - "hta/pkg/analysis" - "hta/pkg/store" + "trace-blame/pkg/analysis" + "trace-blame/pkg/store" ) // StragglerOpts configures the potential stragglers analysis. diff --git a/pkg/analysis/straggler/straggler_test.go b/pkg/analysis/straggler/straggler_test.go index fcf8b94..660d5a4 100644 --- a/pkg/analysis/straggler/straggler_test.go +++ b/pkg/analysis/straggler/straggler_test.go @@ -7,9 +7,9 @@ import ( "sort" "testing" - "hta/pkg/analysis" - "hta/pkg/pipeline" - "hta/pkg/store" + "trace-blame/pkg/analysis" + "trace-blame/pkg/pipeline" + "trace-blame/pkg/store" ) func testDataDir(t *testing.T) string { diff --git a/pkg/analysis/temporal/idle_time.go b/pkg/analysis/temporal/idle_time.go index 3515cd6..ae75bbb 100644 --- a/pkg/analysis/temporal/idle_time.go +++ b/pkg/analysis/temporal/idle_time.go @@ -6,8 +6,8 @@ import ( "math" "slices" - "hta/pkg/analysis" - "hta/pkg/store" + "trace-blame/pkg/analysis" + "trace-blame/pkg/store" ) // IdleTimeOpts configures idle-time breakdown analysis. diff --git a/pkg/analysis/temporal/idle_time_test.go b/pkg/analysis/temporal/idle_time_test.go index 187fdc7..da163fd 100644 --- a/pkg/analysis/temporal/idle_time_test.go +++ b/pkg/analysis/temporal/idle_time_test.go @@ -5,8 +5,8 @@ import ( "path/filepath" "testing" - "hta/pkg/pipeline" - "hta/pkg/store" + "trace-blame/pkg/pipeline" + "trace-blame/pkg/store" ) func TestIdleTimeBreakdownIntegration(t *testing.T) { diff --git a/pkg/analysis/temporal/overlap.go b/pkg/analysis/temporal/overlap.go index 15fa294..a367b01 100644 --- a/pkg/analysis/temporal/overlap.go +++ b/pkg/analysis/temporal/overlap.go @@ -5,8 +5,8 @@ import ( "fmt" "sort" - "hta/pkg/analysis" - "hta/pkg/store" + "trace-blame/pkg/analysis" + "trace-blame/pkg/store" ) // OverlapResult holds the comm-comp overlap percentage for a single rank. diff --git a/pkg/analysis/temporal/overlap_test.go b/pkg/analysis/temporal/overlap_test.go index 4fcdec2..2a22e8e 100644 --- a/pkg/analysis/temporal/overlap_test.go +++ b/pkg/analysis/temporal/overlap_test.go @@ -5,8 +5,8 @@ import ( "path/filepath" "testing" - "hta/pkg/pipeline" - "hta/pkg/store" + "trace-blame/pkg/pipeline" + "trace-blame/pkg/store" ) func TestCommCompOverlapIntegration(t *testing.T) { diff --git a/pkg/analysis/temporal/temporal.go b/pkg/analysis/temporal/temporal.go index 09b64bb..2b22787 100644 --- a/pkg/analysis/temporal/temporal.go +++ b/pkg/analysis/temporal/temporal.go @@ -5,9 +5,9 @@ import ( "fmt" "sort" - "hta/pkg/analysis" - "hta/pkg/store" - "hta/pkg/symbol" + "trace-blame/pkg/analysis" + "trace-blame/pkg/store" + "trace-blame/pkg/symbol" ) // TemporalResult holds the temporal breakdown for a single rank. diff --git a/pkg/analysis/temporal/temporal_test.go b/pkg/analysis/temporal/temporal_test.go index 2536a80..fffa919 100644 --- a/pkg/analysis/temporal/temporal_test.go +++ b/pkg/analysis/temporal/temporal_test.go @@ -7,8 +7,8 @@ import ( "runtime" "testing" - "hta/pkg/pipeline" - "hta/pkg/store" + "trace-blame/pkg/pipeline" + "trace-blame/pkg/store" ) func testDataDir(t *testing.T) string { diff --git a/pkg/pipeline/preprocess.go b/pkg/pipeline/preprocess.go index 7fb1040..38712cd 100644 --- a/pkg/pipeline/preprocess.go +++ b/pkg/pipeline/preprocess.go @@ -7,9 +7,9 @@ import ( "math" "regexp" - "hta/pkg/store" - "hta/pkg/symbol" - "hta/pkg/trace" + "trace-blame/pkg/store" + "trace-blame/pkg/symbol" + "trace-blame/pkg/trace" ) var profilerStepRe = regexp.MustCompile(`^ProfilerStep#\d+`) diff --git a/pkg/store/reader.go b/pkg/store/reader.go index 92e4e31..6038af0 100644 --- a/pkg/store/reader.go +++ b/pkg/store/reader.go @@ -5,7 +5,7 @@ import ( "fmt" "strings" - "hta/pkg/symbol" + "trace-blame/pkg/symbol" ) // LoadSymbolTable reads all symbols from the DB into a SymbolTable.