diff --git a/.claude/skills/hta-cli/SKILL.md b/.claude/skills/hta-cli/SKILL.md
deleted file mode 100644
index cfed73e..0000000
--- a/.claude/skills/hta-cli/SKILL.md
+++ /dev/null
@@ -1,81 +0,0 @@
----
-name: pytorch-profile
-description: >-
-hollistic trace analysis (hta) gives insight about distributed training with pytorch.
-It should be used when the user asks to "analyse pytorch trace",
-or mentions any subcommand like temporal-breakdown, comm-comp-overlap,
-gpu-kernel-breakdown, idle-time-breakdown, critical-path, queue-length, etc.
----
-
-# Pytorch Profile Data
-
-The hta CLI (`python -m hta`) exposes every major trace analysis as a standalone subcommand. It is designed for CI pipelines, shell scripts, and quick interactive analysis without notebooks.
-
-## Two-Step Workflow
-
-All CLI usage follows a **pre-process then analyze** pattern:
-
-```bash
-# Step 1: Parse raw PyTorch Profiler traces into parquet
-python -m hta pre-process --trace-dir ./raw_traces -o ./preprocessed
-
-# Step 2: Run any analysis subcommand on the preprocessed directory
-python -m hta temporal-breakdown -i ./preprocessed
-python -m hta idle-time-breakdown -i ./preprocessed --ranks 0,1
-```
-
-**Step 1 (`pre-process`)** reads raw JSON traces from `--trace-dir`, writes one `.parquet` file per rank plus a `metadata.json` into `-o`. This only needs to run once per trace set.
-
-**Step 2 (any analysis subcommand)** reads from the pre-processed directory via `-i` / `--input`. Most subcommands print markdown tables to stdout.
-
-## Subcommand Quick Reference
-
-| Subcommand | Description | Key Args (besides `-i`) | Category |
-|---|---|---|---|
-| `pre-process` | Parse raw traces to parquet | `--trace-dir`, `-o` (both required) | Preprocessing |
-| `temporal-breakdown` | Time breakdown (compute, comm, idle) per rank | — | Overview |
-| `comm-comp-overlap` | Communication/computation overlap per rank | — | Overview |
-| `profiler-steps` | List profiler step indices | — | Overview |
-| `potential-stragglers` | Identify slow ranks | `--num-candidates`, `--profiler-steps` | Overview |
-| `gpu-kernel-breakdown` | GPU time by kernel type + top kernels | `--num-kernels`, `--duration-ratio`, `--no-memory-kernels` | GPU Kernels |
-| `gpu-kernels-with-annotations` | GPU kernels with user annotation context | `--rank` (required) | GPU Kernels |
-| `gpu-user-annotation-breakdown` | GPU/CPU time by user annotations | `--cpu`, `--duration-ratio`, `--num-kernels`, `--allowlist-patterns` | GPU Kernels |
-| `frequent-cuda-kernel-sequences` | Frequent CUDA kernel patterns per operator | `--operator-name`, `--output-dir` (both required), `--top-k`, `--rank` | GPU Kernels |
-| `aten-op-kernels-and-delay` | ATen op to GPU kernel mapping with launch delay | `--ranks`, `--sort-by` | GPU Kernels |
-| `cuda-kernel-launch-stats` | CUDA kernel launch duration and delay stats | `--ranks`, `--runtime-cutoff`, `--launch-delay-cutoff` | GPU Kernels |
-| `generate-trace-with-counters` | Augmented trace with queue length / memory BW counters | `--ranks`, `--time-series`, `--output-suffix` | Counters |
-| `queue-length-summary` | Queue length summary stats per rank | `--ranks` | Counters |
-| `queue-length-time-series` | Full queue length time series per rank | `--ranks` | Counters |
-| `blocked-on-full-queue` | Time CPU blocked on full GPU queue | `--ranks`, `--max-queue-length` | Counters |
-| `memory-bw-summary` | Memory bandwidth summary stats per rank | `--ranks` | Counters |
-| `memory-bw-time-series` | Full memory bandwidth time series per rank | `--ranks` | Counters |
-| `idle-time-breakdown` | GPU idle time by category per rank/stream | `--ranks`, `--streams`, `--show-idle-interval-stats` | Idle Time |
-| `cupti-counter-data` | CUPTI hardware counter data with operators | `--ranks` | CUPTI |
-| `critical-path` | Critical path analysis with trace overlay | `--rank`, `--annotation`, `--instance-id`, `--output-dir` (all required) | Critical Path |
-
-## Common Patterns
-
-**Filtering by rank:** Most analysis subcommands accept `--ranks` as a comma-separated list (e.g., `--ranks 0,1,3`). If omitted, all ranks are analyzed.
-
-**Output format:** Most subcommands print markdown tables to stdout. Pipe to a file or use in scripts:
-```bash
-python -m hta temporal-breakdown -i ./preprocessed > results.md
-```
-
-**Getting help:** Run `python -m hta --help` for all subcommands, or `python -m hta <subcommand> --help` for a specific one.
-
-**Running via uv:** In this project, prefix with `uv run`:
-```bash
-uv run python -m hta pre-process --trace-dir ./traces -o ./preprocessed
-uv run python -m hta temporal-breakdown -i ./preprocessed
-```
-
-## Key Source Files
-
-- `hta/__main__.py` — CLI implementation (argument parsing and subcommand handlers)
-- `hta/trace_analysis.py` — `TraceAnalysis` class that backs every subcommand
-- `docs/cli-guide.md` — Human-facing CLI documentation
-
-## Additional Resources
-
-For full argument tables, types, defaults, and detailed output descriptions for every subcommand, see `references/subcommands.md`.
diff --git a/.claude/skills/hta-cli/references/subcommands.md b/.claude/skills/hta-cli/references/subcommands.md
deleted file mode 100644
index a8b9861..0000000
--- a/.claude/skills/hta-cli/references/subcommands.md
+++ /dev/null
@@ -1,528 +0,0 @@
-# HTA CLI Subcommand Reference
-
-Full argument tables, examples, and output descriptions for all 20 HTA CLI subcommands.
-
-Source of truth: `hta/__main__.py` (argument definitions), `docs/cli-guide.md` (human-facing docs).
-
----
-
-## Preprocessing
-
-### `pre-process`
-
-Parse raw PyTorch Profiler traces and save as parquet for fast repeated analysis.
-
-```bash
-python -m hta pre-process --trace-dir <dir> -o <output> [--include-last-profiler-step]
-```
-
-| Argument | Type | Required | Default | Description |
-|---|---|---|---|---|
-| `--trace-dir` | str | yes | — | Path to directory containing raw trace JSON files |
-| `-o` / `--output` | str | yes | — | Output directory for parquet files and metadata.json |
-| `--include-last-profiler-step` | flag | no | false | Include the last profiler step (excluded by default) |
-
-**Example:**
-```bash
-python -m hta pre-process --trace-dir ./raw_traces -o ./preprocessed
-```
-
-**Output:** One `<rank>.parquet` file per rank and a `metadata.json` in the output directory. Prints confirmation message.
-
----
-
-## Overview Analysis
-
-### `temporal-breakdown`
-
-Show how time is spent (compute, communication, idle, etc.) for each rank.
-
-See: `docs/source/features/temporal_breakdown.rst`
-
-```bash
-python -m hta temporal-breakdown -i <preprocessed-dir>
-```
-
-| Argument | Type | Required | Default | Description |
-|---|---|---|---|---|
-| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory |
-
-**Example:**
-```bash
-python -m hta temporal-breakdown -i ./preprocessed
-```
-
-**Output:** Markdown table with one row per rank showing time percentages for each category (idle, compute, communication, etc.).
-
----
-
-### `comm-comp-overlap`
-
-Show the overlap between communication and computation for each rank.
-
-See: `docs/source/features/comm_comp_overlap.rst`
-
-```bash
-python -m hta comm-comp-overlap -i <preprocessed-dir>
-```
-
-| Argument | Type | Required | Default | Description |
-|---|---|---|---|---|
-| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory |
-
-**Example:**
-```bash
-python -m hta comm-comp-overlap -i ./preprocessed
-```
-
-**Output:** Markdown table with overlap percentages per rank.
-
----
-
-### `profiler-steps`
-
-List the profiler step indices found in the trace.
-
-```bash
-python -m hta profiler-steps -i <preprocessed-dir>
-```
-
-| Argument | Type | Required | Default | Description |
-|---|---|---|---|---|
-| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory |
-
-**Example:**
-```bash
-python -m hta profiler-steps -i ./preprocessed
-# 2,3,4,5,6
-```
-
-**Output:** Comma-separated list of profiler step integers printed to stdout.
-
----
-
-### `potential-stragglers`
-
-Identify ranks that are potential stragglers (slower than peers).
-
-```bash
-python -m hta potential-stragglers -i <preprocessed-dir> [--num-candidates N] [--profiler-steps STEPS]
-```
-
-| Argument | Type | Required | Default | Description |
-|---|---|---|---|---|
-| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory |
-| `--num-candidates` | int | no | None | Maximum number of straggler candidates to return |
-| `--profiler-steps` | str | no | None | Comma-separated profiler step indices to analyze |
-
-**Example:**
-```bash
-python -m hta potential-stragglers -i ./preprocessed --num-candidates 2
-# 3,7
-```
-
-**Output:** Comma-separated list of rank IDs that are potential stragglers.
-
----
-
-## GPU Kernel Analysis
-
-### `gpu-kernel-breakdown`
-
-Break down GPU time by kernel type (computation, communication, memory) and list top kernels.
-
-See: `docs/source/features/kernel_breakdown.rst`
-
-```bash
-python -m hta gpu-kernel-breakdown -i <preprocessed-dir> [--duration-ratio R] [--num-kernels N] [--no-memory-kernels]
-```
-
-| Argument | Type | Required | Default | Description |
-|---|---|---|---|---|
-| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory |
-| `--duration-ratio` | float | no | None | Minimum fraction of total duration for a kernel to be included |
-| `--num-kernels` | int | no | None | Maximum number of top kernels to show |
-| `--no-memory-kernels` | flag | no | false | Exclude memory-related kernels from the breakdown |
-
-**Example:**
-```bash
-python -m hta gpu-kernel-breakdown -i ./preprocessed --num-kernels 10
-```
-
-**Output:** Two markdown tables:
-1. **Kernel Type Breakdown** — time per kernel category (compute, communication, memory)
-2. **Top Kernels** — individual kernel durations and counts
-
----
-
-### `gpu-kernels-with-annotations`
-
-List GPU kernels annotated with their user-defined annotation context (e.g., forward/backward/optimizer).
-
-See: `docs/source/features/kernel_breakdown.rst` (related)
-
-```bash
-python -m hta gpu-kernels-with-annotations -i <preprocessed-dir> --rank R [--no-expand-names] [--no-shorten-names]
-```
-
-| Argument | Type | Required | Default | Description |
-|---|---|---|---|---|
-| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory |
-| `--rank` | int | yes | — | Rank to analyze |
-| `--no-expand-names` | flag | no | false | Do not expand kernel names |
-| `--no-shorten-names` | flag | no | false | Do not shorten kernel names |
-
-**Example:**
-```bash
-python -m hta gpu-kernels-with-annotations -i ./preprocessed --rank 0
-```
-
-**Output:** Markdown table with one row per GPU kernel, including its user annotation context.
-
----
-
-### `gpu-user-annotation-breakdown`
-
-Break down GPU (or CPU) time by user-defined annotations.
-
-See: `docs/source/features/kernel_breakdown.rst` (related)
-
-```bash
-python -m hta gpu-user-annotation-breakdown -i <preprocessed-dir> [--cpu] [--duration-ratio R] [--num-kernels N] [--allowlist-patterns PAT ...]
-```
-
-| Argument | Type | Required | Default | Description |
-|---|---|---|---|---|
-| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory |
-| `--cpu` | flag | no | false | Use CPU time instead of GPU time |
-| `--duration-ratio` | float | no | None | Minimum fraction of total duration for inclusion |
-| `--num-kernels` | int | no | None | Maximum number of entries to show |
-| `--allowlist-patterns` | str (multiple) | no | None | Annotation patterns to keep distinct (space-separated) |
-
-**Example:**
-```bash
-python -m hta gpu-user-annotation-breakdown -i ./preprocessed --duration-ratio 0.05
-```
-
-**Output:** Markdown table with time breakdown by user annotation.
-
----
-
-### `frequent-cuda-kernel-sequences`
-
-Find frequently occurring sequences of CUDA kernels launched by a given operator.
-
-See: `docs/source/features/frequent_cuda_kernels.rst`
-
-```bash
-python -m hta frequent-cuda-kernel-sequences -i <preprocessed-dir> --operator-name NAME --output-dir DIR [--min-pattern-len N] [--rank R] [--top-k K]
-```
-
-| Argument | Type | Required | Default | Description |
-|---|---|---|---|---|
-| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory |
-| `--operator-name` | str | yes | — | Name of the CPU operator to analyze |
-| `--output-dir` | str | yes | — | Directory for output files |
-| `--min-pattern-len` | int | no | None | Minimum length of kernel sequence patterns |
-| `--rank` | int | no | None | Specific rank to analyze |
-| `--top-k` | int | no | None | Number of top frequent patterns to return |
-
-**Example:**
-```bash
-python -m hta frequent-cuda-kernel-sequences -i ./preprocessed \
-    --operator-name aten::linear --output-dir ./freq_out --top-k 5
-```
-
-**Output:** Markdown table of frequent kernel sequence patterns with their counts.
-
----
-
-### `aten-op-kernels-and-delay`
-
-Map ATen operators to their launched GPU kernels, showing launch delay.
-
-See: `docs/source/features/cuda_kernel_launch_stats.rst` (related)
-
-```bash
-python -m hta aten-op-kernels-and-delay -i <preprocessed-dir> [--ranks RANKS] [--sort-by COLS]
-```
-
-| Argument | Type | Required | Default | Description |
-|---|---|---|---|---|
-| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory |
-| `--ranks` | str | no | None | Comma-separated ranks |
-| `--sort-by` | str | no | None | Comma-separated column names to sort by |
-
-**Example:**
-```bash
-python -m hta aten-op-kernels-and-delay -i ./preprocessed --ranks 0 --sort-by "duration"
-```
-
-**Output:** Per-rank markdown tables mapping ATen ops to GPU kernels with delay statistics.
-
----
-
-### `cuda-kernel-launch-stats`
-
-Compute statistics about CUDA kernel launches (durations, launch delays, short kernels).
-
-See: `docs/source/features/cuda_kernel_launch_stats.rst`
-
-```bash
-python -m hta cuda-kernel-launch-stats -i <preprocessed-dir> [--ranks RANKS] [--runtime-cutoff N] [--launch-delay-cutoff N] [--no-memory-events]
-```
-
-| Argument | Type | Required | Default | Description |
-|---|---|---|---|---|
-| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory |
-| `--ranks` | str | no | None | Comma-separated ranks |
-| `--runtime-cutoff` | int | no | None | Runtime threshold (microseconds) for flagging short kernels |
-| `--launch-delay-cutoff` | int | no | None | Launch delay threshold (microseconds) for flagging slow launches |
-| `--no-memory-events` | flag | no | false | Exclude memory events from the analysis |
-
-**Example:**
-```bash
-python -m hta cuda-kernel-launch-stats -i ./preprocessed --runtime-cutoff 10
-```
-
-**Output:** Per-rank markdown tables with kernel launch statistics.
-
----
-
-## Augmented Counters (Queue Length & Memory Bandwidth)
-
-### `generate-trace-with-counters`
-
-Generate an augmented trace file with queue length and/or memory bandwidth counter time series embedded.
-
-See: `docs/source/features/augmented_counters.rst`
-
-```bash
-python -m hta generate-trace-with-counters -i <preprocessed-dir> [--ranks RANKS] [--time-series TYPE] [--output-suffix SUFFIX]
-```
-
-| Argument | Type | Required | Default | Description |
-|---|---|---|---|---|
-| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory |
-| `--ranks` | str | no | None | Comma-separated ranks |
-| `--time-series` | str | no | None | Which counters: `queue_length`, `memcpy_bandwidth`, or `both` |
-| `--output-suffix` | str | no | None | Suffix appended to the output trace filename |
-
-**Example:**
-```bash
-python -m hta generate-trace-with-counters -i ./preprocessed --time-series both
-```
-
-**Output:** Augmented trace JSON file(s) in the original trace directory, viewable in `chrome://tracing` or Perfetto. Prints confirmation message.
-
----
-
-### `queue-length-summary`
-
-Show summary statistics of the CUDA stream queue length (min, max, mean, etc.) per rank.
-
-See: `docs/source/features/augmented_counters.rst`
-
-```bash
-python -m hta queue-length-summary -i <preprocessed-dir> [--ranks RANKS]
-```
-
-| Argument | Type | Required | Default | Description |
-|---|---|---|---|---|
-| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory |
-| `--ranks` | str | no | None | Comma-separated ranks |
-
-**Example:**
-```bash
-python -m hta queue-length-summary -i ./preprocessed
-```
-
-**Output:** Markdown table with queue length statistics per rank.
-
----
-
-### `queue-length-time-series`
-
-Get the full queue length time series (timestamp, queue_length) per rank.
-
-See: `docs/source/features/augmented_counters.rst`
-
-```bash
-python -m hta queue-length-time-series -i <preprocessed-dir> [--ranks RANKS]
-```
-
-| Argument | Type | Required | Default | Description |
-|---|---|---|---|---|
-| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory |
-| `--ranks` | str | no | None | Comma-separated ranks |
-
-**Example:**
-```bash
-python -m hta queue-length-time-series -i ./preprocessed --ranks 0
-```
-
-**Output:** Per-rank markdown tables of (timestamp, queue_length) data points.
-
----
-
-### `blocked-on-full-queue`
-
-Compute time the CPU spent blocked because the GPU launch queue was full.
-
-See: `docs/source/features/augmented_counters.rst`
-
-```bash
-python -m hta blocked-on-full-queue -i <preprocessed-dir> [--ranks RANKS] [--max-queue-length N]
-```
-
-| Argument | Type | Required | Default | Description |
-|---|---|---|---|---|
-| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory |
-| `--ranks` | str | no | None | Comma-separated ranks |
-| `--max-queue-length` | int | no | None | Queue length considered "full" (default: NVIDIA limit of 1024) |
-
-**Example:**
-```bash
-python -m hta blocked-on-full-queue -i ./preprocessed --max-queue-length 1024
-```
-
-**Output:** Markdown table with blocking duration per rank.
-
----
-
-### `memory-bw-summary`
-
-Show memory bandwidth summary statistics per rank.
-
-See: `docs/source/features/augmented_counters.rst`
-
-```bash
-python -m hta memory-bw-summary -i <preprocessed-dir> [--ranks RANKS]
-```
-
-| Argument | Type | Required | Default | Description |
-|---|---|---|---|---|
-| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory |
-| `--ranks` | str | no | None | Comma-separated ranks |
-
-**Example:**
-```bash
-python -m hta memory-bw-summary -i ./preprocessed
-```
-
-**Output:** Markdown table with memory bandwidth statistics per rank.
-
----
-
-### `memory-bw-time-series`
-
-Get the full memory bandwidth time series per rank.
-
-See: `docs/source/features/augmented_counters.rst`
-
-```bash
-python -m hta memory-bw-time-series -i <preprocessed-dir> [--ranks RANKS]
-```
-
-| Argument | Type | Required | Default | Description |
-|---|---|---|---|---|
-| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory |
-| `--ranks` | str | no | None | Comma-separated ranks |
-
-**Example:**
-```bash
-python -m hta memory-bw-time-series -i ./preprocessed --ranks 0,1
-```
-
-**Output:** Per-rank markdown tables of memory bandwidth data points over time.
-
----
-
-## Idle Time
-
-### `idle-time-breakdown`
-
-Break down GPU idle time by category (host wait, kernel wait, other) per rank and stream.
-
-See: `docs/source/features/idle_time_breakdown.rst`
-
-```bash
-python -m hta idle-time-breakdown -i <preprocessed-dir> [--ranks RANKS] [--streams STREAMS] [--show-idle-interval-stats] [--consecutive-kernel-delay N]
-```
-
-| Argument | Type | Required | Default | Description |
-|---|---|---|---|---|
-| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory |
-| `--ranks` | str | no | None | Comma-separated ranks |
-| `--streams` | str | no | None | Comma-separated CUDA stream IDs |
-| `--show-idle-interval-stats` | flag | no | false | Also output statistics about individual idle intervals |
-| `--consecutive-kernel-delay` | int | no | None | Threshold (microseconds) for classifying gaps between consecutive kernels |
-
-**Example:**
-```bash
-python -m hta idle-time-breakdown -i ./preprocessed --show-idle-interval-stats
-```
-
-**Output:** Markdown table "Idle Time Breakdown" with idle time categories per rank/stream. If `--show-idle-interval-stats` is set, a second table "Idle Interval Statistics" is also printed.
-
----
-
-## CUPTI Counters
-
-### `cupti-counter-data`
-
-Extract CUPTI hardware performance counter data joined with operator information.
-
-See: `docs/source/features/cupti_counter_analysis.rst`
-
-```bash
-python -m hta cupti-counter-data -i <preprocessed-dir> [--ranks RANKS]
-```
-
-| Argument | Type | Required | Default | Description |
-|---|---|---|---|---|
-| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory |
-| `--ranks` | str | no | None | Comma-separated ranks |
-
-**Example:**
-```bash
-python -m hta cupti-counter-data -i ./preprocessed --ranks 0
-```
-
-**Output:** Indexed markdown tables of CUPTI counter data with associated operator information.
-
----
-
-## Critical Path
-
-### `critical-path`
-
-Run critical path analysis on a specific annotation instance and overlay the result onto a trace file.
-
-See: `docs/source/features/lightweight_critical_path_analysis.rst`
-
-```bash
-python -m hta critical-path -i <preprocessed-dir> --rank R --annotation ANN --instance-id ID --output-dir DIR [--data-load-events EVT ...] [--show-all-edges]
-```
-
-| Argument | Type | Required | Default | Description |
-|---|---|---|---|---|
-| `-i` / `--input` | str | yes | — | Path to pre-processed trace directory |
-| `--rank` | int | yes | — | Rank to analyze |
-| `--annotation` | str | yes | — | User annotation name (e.g., `ProfilerStep`) |
-| `--instance-id` | str | yes | — | Single int (e.g., `3`) or `start,end` range (e.g., `3,5`) |
-| `--output-dir` | str | yes | — | Directory for the overlaid trace output |
-| `--data-load-events` | str (multiple) | no | None | Names of data loading events (space-separated) |
-| `--show-all-edges` | flag | no | false | Show all edges in the overlaid trace, not just the critical path |
-
-**Example:**
-```bash
-python -m hta critical-path -i ./preprocessed \
-    --rank 0 --annotation ProfilerStep --instance-id 3 \
-    --output-dir ./cp_output
-```
-
-**Output:** Three sections printed to stdout:
-1. **Critical Path Summary** — high-level statistics (total time, breakdown percentages)
-2. **Critical Path Breakdown** — per-category time on the critical path
-3. **Overlaid trace path** — file path to the generated trace JSON with the critical path overlaid, viewable in `chrome://tracing` or Perfetto
diff --git a/.claude/skills/trace-blame b/.claude/skills/trace-blame
new file mode 120000
index 0000000..c54fe13
--- /dev/null
+++ b/.claude/skills/trace-blame
@@ -0,0 +1 @@
+../../cmd/trace-blame/skill
\ No newline at end of file
diff --git a/README.md b/README.md
index e69de29..5f62e6c 100644
--- a/README.md
+++ b/README.md
@@ -0,0 +1,41 @@
+# trace-blame
+
+A Go CLI for analyzing PyTorch Profiler traces. 
+
+Reimplements [HolisticTraceAnalysis](https://github.com/facebookresearch/HolisticTraceAnalysis) with following features:
+
+1. `install-skill` supports agent usage.
+2. single go binary.
+3. use sqlite table to store intermediate state.
+4. markdown output for cli usage.
+
+## Quick Start
+
+```bash
+go build -o trace-blame ./cmd/trace-blame/
+
+# Install accompany claude skills
+trace-blame install-skills
+
+# Parse raw traces into a SQLite database
+trace-blame pre-process --trace-dir ./traces --output trace.db
+
+# Run analyses
+trace-blame temporal-breakdown --db trace.db
+trace-blame gpu-kernel-breakdown --db trace.db
+trace-blame idle-time-breakdown --db trace.db --ranks 0,1
+```
+
+## Subcommands
+
+| Category | Subcommands |
+|---|---|
+| Preprocessing | `pre-process` |
+| Overview | `temporal-breakdown`, `comm-comp-overlap`, `profiler-steps`, `potential-stragglers` |
+| GPU Kernels | `gpu-kernel-breakdown`, `gpu-kernels-with-annotations`, `cuda-kernel-launch-stats`, `aten-op-kernels-and-delay`, `frequent-cuda-kernel-sequences` |
+| Counters | `queue-length-summary`, `queue-length-time-series`, `blocked-on-full-queue`, `memory-bw-summary`, `memory-bw-time-series`, `generate-trace-with-counters` |
+| Idle Time | `idle-time-breakdown` |
+| Critical Path | `critical-path` |
+| CUPTI | `cupti-counter-data` |
+
+Run `trace-blame` with no arguments for usage, or `trace-blame <subcommand> -h` for flag details.
diff --git a/cmd/trace-blame/install_skill.go b/cmd/trace-blame/install_skill.go
new file mode 100644
index 0000000..5ee6dc5
--- /dev/null
+++ b/cmd/trace-blame/install_skill.go
@@ -0,0 +1,51 @@
+package main
+
+import (
+	"embed"
+	"fmt"
+	"io/fs"
+	"log"
+	"os"
+	"path/filepath"
+)
+
+//go:embed all:skill
+var skillFS embed.FS
+
+func cmdInstallSkill() {
+	home, err := os.UserHomeDir()
+	if err != nil {
+		log.Fatalf("get home dir: %v", err)
+	}
+
+	destDir := filepath.Join(home, ".claude", "skills", "trace-blame")
+
+	if err := os.MkdirAll(destDir, 0o755); err != nil {
+		log.Fatalf("create dir: %v", err)
+	}
+
+	err = fs.WalkDir(skillFS, "skill", func(path string, d fs.DirEntry, err error) error {
+		if err != nil {
+			return err
+		}
+
+		// "skill/SKILL.md" -> "SKILL.md"
+		rel, _ := filepath.Rel("skill", path)
+		dest := filepath.Join(destDir, rel)
+
+		if d.IsDir() {
+			return os.MkdirAll(dest, 0o755)
+		}
+
+		data, err := skillFS.ReadFile(path)
+		if err != nil {
+			return err
+		}
+		return os.WriteFile(dest, data, 0o644)
+	})
+	if err != nil {
+		log.Fatalf("install skill: %v", err)
+	}
+
+	fmt.Printf("Installed skill to %s\n", destDir)
+}
diff --git a/cmd/tracepyre/main.go b/cmd/trace-blame/main.go
similarity index 98%
rename from cmd/tracepyre/main.go
rename to cmd/trace-blame/main.go
index 27d75e0..bca2dcc 100644
--- a/cmd/tracepyre/main.go
+++ b/cmd/trace-blame/main.go
@@ -9,14 +9,14 @@ import (
 	"strconv"
 	"strings"
 
-	"hta/pkg/analysis"
-	"hta/pkg/analysis/criticalpath"
-	"hta/pkg/analysis/kernel"
-	"hta/pkg/analysis/resource"
-	"hta/pkg/analysis/straggler"
-	"hta/pkg/analysis/temporal"
-	"hta/pkg/pipeline"
-	"hta/pkg/store"
+	"trace-blame/pkg/analysis"
+	"trace-blame/pkg/analysis/criticalpath"
+	"trace-blame/pkg/analysis/kernel"
+	"trace-blame/pkg/analysis/resource"
+	"trace-blame/pkg/analysis/straggler"
+	"trace-blame/pkg/analysis/temporal"
+	"trace-blame/pkg/pipeline"
+	"trace-blame/pkg/store"
 )
 
 func main() {
@@ -66,6 +66,8 @@ func main() {
 		cmdCriticalPath(os.Args[2:])
 	case "cupti-counter-data":
 		cmdCUPTICounterData(os.Args[2:])
+	case "install-skill":
+		cmdInstallSkill()
 	default:
 		fmt.Fprintf(os.Stderr, "unknown subcommand: %s\n", os.Args[1])
 		usage()
@@ -74,7 +76,7 @@ func main() {
 }
 
 func usage() {
-	fmt.Fprintln(os.Stderr, "Usage: hta <subcommand> [flags]")
+	fmt.Fprintln(os.Stderr, "Usage: trace-blame <subcommand> [flags]")
 	fmt.Fprintln(os.Stderr, "  pre-process              Parse traces → SQLite DB")
 	fmt.Fprintln(os.Stderr, "  temporal-breakdown       GPU temporal breakdown from DB")
 	fmt.Fprintln(os.Stderr, "  gpu-kernel-breakdown     GPU kernel breakdown from DB")
@@ -94,6 +96,7 @@ func usage() {
 	fmt.Fprintln(os.Stderr, "  frequent-cuda-kernel-sequences  Find frequent GPU kernel launch patterns")
 	fmt.Fprintln(os.Stderr, "  critical-path                   Critical path analysis for a single rank")
 	fmt.Fprintln(os.Stderr, "  cupti-counter-data       CUPTI profiler counter data with operator stacks")
+	fmt.Fprintln(os.Stderr, "  install-skill            Install Claude Code skill to ~/.claude/skills/")
 }
 
 func cmdPreProcess(args []string) {
diff --git a/cmd/trace-blame/skill/SKILL.md b/cmd/trace-blame/skill/SKILL.md
new file mode 100644
index 0000000..0b0bf8b
--- /dev/null
+++ b/cmd/trace-blame/skill/SKILL.md
@@ -0,0 +1,89 @@
+---
+name: pytorch-profile
+description: >-
+  Holistic Trace Analysis (HTA) CLI in Go gives insight about distributed training with PyTorch.
+  It should be used when the user asks to "analyse pytorch trace",
+  or mentions any subcommand like temporal-breakdown, comm-comp-overlap,
+  gpu-kernel-breakdown, idle-time-breakdown, critical-path, queue-length, etc.
+---
+
+# Pytorch Profile Data
+
+The `trace-blame` CLI (built from `cmd/trace-blame/main.go`) exposes every major trace analysis as a standalone subcommand. It is a Go reimplementation designed for CI pipelines, shell scripts, and quick interactive analysis.
+
+## Two-Step Workflow
+
+All CLI usage follows a **pre-process then analyze** pattern:
+
+```bash
+# Step 1: Parse raw PyTorch Profiler traces into a SQLite database
+trace-blame pre-process --trace-dir ./raw_traces --output trace.db
+
+# Step 2: Run any analysis subcommand on the database
+trace-blame temporal-breakdown --db trace.db
+trace-blame idle-time-breakdown --db trace.db --ranks 0,1
+```
+
+**Step 1 (`pre-process`)** reads raw JSON/GZ trace files from `--trace-dir`, writes a single SQLite database to `--output` (default: `trace.db`). This only needs to run once per trace set.
+
+**Step 2 (any analysis subcommand)** reads from the SQLite database via `--db` (default: `trace.db`). Most subcommands print markdown tables to stdout.
+
+## Subcommand Quick Reference
+
+| Subcommand | Description | Key Args (besides `--db`) | Category |
+|---|---|---|---|
+| `pre-process` | Parse raw traces to SQLite DB | `--trace-dir` (required), `--output` | Preprocessing |
+| `temporal-breakdown` | Time breakdown (compute, comm, idle) per rank | — | Overview |
+| `comm-comp-overlap` | Communication/computation overlap per rank | — | Overview |
+| `profiler-steps` | List profiler step indices | — | Overview |
+| `potential-stragglers` | Identify slow ranks | `--num-candidates`, `--profiler-steps` | Overview |
+| `gpu-kernel-breakdown` | GPU time by kernel type + top kernels | `--num-kernels`, `--duration-ratio`, `--no-memory-kernels` | GPU Kernels |
+| `gpu-kernels-with-annotations` | GPU kernels with user annotation context | `--rank` (required), `--no-expand-names`, `--no-shorten-names` | GPU Kernels |
+| `frequent-cuda-kernel-sequences` | Frequent CUDA kernel patterns per operator | `--operator-name` (required), `--output-dir`, `--top-k`, `--rank`, `--min-pattern-len` | GPU Kernels |
+| `aten-op-kernels-and-delay` | ATen op to GPU kernel mapping with launch delay | `--ranks`, `--sort-by` | GPU Kernels |
+| `cuda-kernel-launch-stats` | CUDA kernel launch duration and delay stats | `--ranks`, `--runtime-cutoff`, `--launch-delay-cutoff`, `--no-memory-events` | GPU Kernels |
+| `generate-trace-with-counters` | Augmented trace with queue length / memory BW counters | `--ranks`, `--time-series`, `--output-suffix` | Counters |
+| `queue-length-summary` | Queue length summary stats per rank/stream | `--ranks` | Counters |
+| `queue-length-time-series` | Full queue length time series per rank | `--ranks` | Counters |
+| `blocked-on-full-queue` | Time CPU blocked on full GPU queue | `--ranks`, `--max-queue-length` | Counters |
+| `memory-bw-summary` | Memory bandwidth summary stats per rank | `--ranks` | Counters |
+| `memory-bw-time-series` | Full memory bandwidth time series per rank | `--ranks` | Counters |
+| `idle-time-breakdown` | GPU idle time by category per rank/stream | `--ranks`, `--streams`, `--show-idle-interval-stats`, `--consecutive-kernel-delay` | Idle Time |
+| `cupti-counter-data` | CUPTI hardware counter data with operators | `--ranks` | CUPTI |
+| `critical-path` | Critical path analysis with trace overlay | `--rank`, `--annotation`, `--instance-id`, `--output-dir` (all required), `--data-load-events`, `--show-all-edges` | Critical Path |
+
+## Common Patterns
+
+**Filtering by rank:** Most analysis subcommands accept `--ranks` as a comma-separated list (e.g., `--ranks 0,1,3`). If omitted, all ranks are analyzed. Some subcommands use `--rank` (singular) for a single required rank.
+
+**Output format:** Most subcommands print markdown tables to stdout. Pipe to a file or use in scripts:
+```bash
+trace-blame temporal-breakdown --db trace.db > results.md
+```
+
+**Getting help:** Run `trace-blame` with no arguments for the subcommand list, or `trace-blame <subcommand> -h` for a specific subcommand's flags.
+
+**Building:** The binary is built from Go source:
+```bash
+go build -o trace-blame ./cmd/trace-blame/
+```
+
+## Key Source Files
+
+- `cmd/trace-blame/main.go` — CLI implementation (argument parsing and subcommand handlers)
+- `pkg/pipeline/` — Pre-processing pipeline (trace parsing → SQLite)
+- `pkg/store/` — SQLite database layer
+- `pkg/analysis/` — Analysis implementations (temporal, kernel, resource, straggler, criticalpath)
+
+## Additional Resources
+
+For full argument tables, types, defaults, and detailed output descriptions, see the reference files organized by category:
+
+- `references/subcommands.md` — Index linking to all category files
+- `references/preprocessing.md` — `pre-process`
+- `references/overview.md` — `temporal-breakdown`, `comm-comp-overlap`, `profiler-steps`, `potential-stragglers`
+- `references/gpu-kernels.md` — `gpu-kernel-breakdown`, `gpu-kernels-with-annotations`, `frequent-cuda-kernel-sequences`, `aten-op-kernels-and-delay`, `cuda-kernel-launch-stats`
+- `references/counters.md` — `generate-trace-with-counters`, `queue-length-summary`, `queue-length-time-series`, `blocked-on-full-queue`, `memory-bw-summary`, `memory-bw-time-series`
+- `references/idle-time.md` — `idle-time-breakdown`
+- `references/cupti-counters.md` — `cupti-counter-data`
+- `references/critical-path.md` — `critical-path`
diff --git a/cmd/trace-blame/skill/references/counters.md b/cmd/trace-blame/skill/references/counters.md
new file mode 100644
index 0000000..3e2fc9e
--- /dev/null
+++ b/cmd/trace-blame/skill/references/counters.md
@@ -0,0 +1,134 @@
+# Augmented Counters (Queue Length & Memory Bandwidth)
+
+### `generate-trace-with-counters`
+
+Generate an augmented trace file with queue length and/or memory bandwidth counter time series embedded.
+
+```bash
+trace-blame generate-trace-with-counters [--db <path>] [--ranks RANKS] [--time-series TYPE] [--output-suffix SUFFIX]
+```
+
+| Argument | Type | Required | Default | Description |
+|---|---|---|---|---|
+| `--db` | string | no | `trace.db` | SQLite database path |
+| `--ranks` | string | no | all | Comma-separated ranks |
+| `--time-series` | string | no | `both` | Which counters: `queue_length`, `memcpy_bandwidth`, or `both` |
+| `--output-suffix` | string | no | `_with_counters` | Suffix for output file names |
+
+**Example:**
+```bash
+trace-blame generate-trace-with-counters --db trace.db --time-series both
+```
+
+**Output:** Prints output file paths. Generated trace files are viewable in `chrome://tracing` or Perfetto.
+
+---
+
+### `queue-length-summary`
+
+Show summary statistics of the CUDA stream queue length per rank and stream.
+
+```bash
+trace-blame queue-length-summary [--db <path>] [--ranks RANKS]
+```
+
+| Argument | Type | Required | Default | Description |
+|---|---|---|---|---|
+| `--db` | string | no | `trace.db` | SQLite database path |
+| `--ranks` | string | no | all | Comma-separated ranks |
+
+**Example:**
+```bash
+trace-blame queue-length-summary --db trace.db
+```
+
+**Output:** Markdown table: `| rank | stream | count | min | max | std | 25% | 50% | 75% |`
+
+---
+
+### `queue-length-time-series`
+
+Get the full queue length time series per rank.
+
+```bash
+trace-blame queue-length-time-series [--db <path>] [--ranks RANKS]
+```
+
+| Argument | Type | Required | Default | Description |
+|---|---|---|---|---|
+| `--db` | string | no | `trace.db` | SQLite database path |
+| `--ranks` | string | no | all | Comma-separated ranks |
+
+**Example:**
+```bash
+trace-blame queue-length-time-series --db trace.db --ranks 0
+```
+
+**Output:** Per-rank markdown tables: `| ts | stream | queue_length |`
+
+---
+
+### `blocked-on-full-queue`
+
+Compute time the CPU spent blocked because the GPU launch queue was full.
+
+```bash
+trace-blame blocked-on-full-queue [--db <path>] [--ranks RANKS] [--max-queue-length N]
+```
+
+| Argument | Type | Required | Default | Description |
+|---|---|---|---|---|
+| `--db` | string | no | `trace.db` | SQLite database path |
+| `--ranks` | string | no | all | Comma-separated ranks |
+| `--max-queue-length` | int | no | 1024 | Max CUDA launch queue length per stream |
+
+**Example:**
+```bash
+trace-blame blocked-on-full-queue --db trace.db --max-queue-length 1024
+```
+
+**Output:** Markdown table: `| rank | stream | duration_at_max_queue_length | relative_duration |`. Prints a message if no streams reached maximum queue length.
+
+---
+
+### `memory-bw-summary`
+
+Show memory bandwidth summary statistics per rank.
+
+```bash
+trace-blame memory-bw-summary [--db <path>] [--ranks RANKS]
+```
+
+| Argument | Type | Required | Default | Description |
+|---|---|---|---|---|
+| `--db` | string | no | `trace.db` | SQLite database path |
+| `--ranks` | string | no | all | Comma-separated ranks |
+
+**Example:**
+```bash
+trace-blame memory-bw-summary --db trace.db
+```
+
+**Output:** Markdown table: `| rank | name | count | mean | std | min | 25% | 50% | 75% | max |`
+
+---
+
+### `memory-bw-time-series`
+
+Get the full memory bandwidth time series per rank.
+
+```bash
+trace-blame memory-bw-time-series [--db <path>] [--ranks RANKS]
+```
+
+| Argument | Type | Required | Default | Description |
+|---|---|---|---|---|
+| `--db` | string | no | `trace.db` | SQLite database path |
+| `--ranks` | string | no | all | Comma-separated ranks |
+
+**Example:**
+```bash
+trace-blame memory-bw-time-series --db trace.db --ranks 0,1
+```
+
+**Output:** Per-rank markdown tables: `| ts | pid | name | memory_bw_gbps |`
diff --git a/cmd/trace-blame/skill/references/critical-path.md b/cmd/trace-blame/skill/references/critical-path.md
new file mode 100644
index 0000000..f7acac6
--- /dev/null
+++ b/cmd/trace-blame/skill/references/critical-path.md
@@ -0,0 +1,26 @@
+# Critical Path
+
+### `critical-path`
+
+Run critical path analysis on a specific annotation instance and optionally overlay the result onto a trace file.
+
+```bash
+trace-blame critical-path --rank R --annotation ANN --instance-id ID --output-dir DIR [--db <path>] [--data-load-events EVT] [--show-all-edges]
+```
+
+| Argument | Type | Required | Default | Description |
+|---|---|---|---|---|
+| `--db` | string | no | `trace.db` | SQLite database path |
+| `--rank` | int | yes | — | Rank to analyze |
+| `--annotation` | string | yes | — | Annotation name to match (e.g., `ProfilerStep`) |
+| `--instance-id` | string | yes | — | Single int (e.g., `3`) or `start,end` range (e.g., `3,5`) |
+| `--output-dir` | string | yes | — | Directory for the overlay trace output |
+| `--data-load-events` | string | no | — | Comma-separated regex patterns for data loading ops |
+| `--show-all-edges` | flag | no | false | Show all edges in overlay (not just critical path) |
+
+**Example:**
+```bash
+trace-blame critical-path --db trace.db --rank 0 --annotation ProfilerStep --instance-id 3 --output-dir ./cp_output
+```
+
+**Output:** Prints critical path summary (nodes, edges, path length), breakdown by bound type table, and overlay trace file path.
diff --git a/cmd/trace-blame/skill/references/cupti-counters.md b/cmd/trace-blame/skill/references/cupti-counters.md
new file mode 100644
index 0000000..a753553
--- /dev/null
+++ b/cmd/trace-blame/skill/references/cupti-counters.md
@@ -0,0 +1,21 @@
+# CUPTI Counters
+
+### `cupti-counter-data`
+
+Extract CUPTI hardware performance counter data joined with operator information.
+
+```bash
+trace-blame cupti-counter-data [--db <path>] [--ranks RANKS]
+```
+
+| Argument | Type | Required | Default | Description |
+|---|---|---|---|---|
+| `--db` | string | no | `trace.db` | SQLite database path |
+| `--ranks` | string | no | all | Comma-separated ranks |
+
+**Example:**
+```bash
+trace-blame cupti-counter-data --db trace.db --ranks 0
+```
+
+**Output:** Per-rank markdown tables with kernel name, operator stack, and dynamic counter columns.
diff --git a/cmd/trace-blame/skill/references/gpu-kernels.md b/cmd/trace-blame/skill/references/gpu-kernels.md
new file mode 100644
index 0000000..b113e9a
--- /dev/null
+++ b/cmd/trace-blame/skill/references/gpu-kernels.md
@@ -0,0 +1,123 @@
+# GPU Kernel Analysis
+
+### `gpu-kernel-breakdown`
+
+Break down GPU time by kernel type (computation, communication, memory) and list top kernels.
+
+```bash
+trace-blame gpu-kernel-breakdown [--db <path>] [--duration-ratio R] [--num-kernels N] [--no-memory-kernels]
+```
+
+| Argument | Type | Required | Default | Description |
+|---|---|---|---|---|
+| `--db` | string | no | `trace.db` | SQLite database path |
+| `--duration-ratio` | float | no | 0.8 | Cumulative duration ratio cutoff |
+| `--num-kernels` | int | no | 10 | Max kernels per type per rank |
+| `--no-memory-kernels` | flag | no | false | Exclude MEMORY kernel type |
+
+**Example:**
+```bash
+trace-blame gpu-kernel-breakdown --db trace.db --num-kernels 10
+```
+
+**Output:** Two markdown tables:
+1. **Kernel Type Breakdown** — `| kernel_type | sum(us) | percentage |`
+2. **Top Kernels** — `| name | sum(us) | max(us) | min(us) | mean(us) | stddev | kernel_type | rank |`
+
+---
+
+### `gpu-kernels-with-annotations`
+
+List GPU kernels annotated with their user-defined annotation context (e.g., forward/backward/optimizer).
+
+```bash
+trace-blame gpu-kernels-with-annotations --rank R [--db <path>] [--no-expand-names] [--no-shorten-names]
+```
+
+| Argument | Type | Required | Default | Description |
+|---|---|---|---|---|
+| `--db` | string | no | `trace.db` | SQLite database path |
+| `--rank` | int | yes | — | Rank to analyze |
+| `--no-expand-names` | flag | no | false | Skip expanding symbol IDs to names |
+| `--no-shorten-names` | flag | no | false | Skip shortening kernel names |
+
+**Example:**
+```bash
+trace-blame gpu-kernels-with-annotations --db trace.db --rank 0
+```
+
+**Output:** Markdown table: `| started_at | ended_at | kernel | annotation |`
+
+---
+
+### `frequent-cuda-kernel-sequences`
+
+Find frequently occurring sequences of CUDA kernels launched by a given operator.
+
+```bash
+trace-blame frequent-cuda-kernel-sequences --operator-name NAME [--db <path>] [--output-dir DIR] [--min-pattern-len N] [--rank R] [--top-k K]
+```
+
+| Argument | Type | Required | Default | Description |
+|---|---|---|---|---|
+| `--db` | string | no | `trace.db` | SQLite database path |
+| `--operator-name` | string | yes | — | CPU operator name substring to match |
+| `--output-dir` | string | no | — | Directory for overlay trace output |
+| `--min-pattern-len` | int | no | 3 | Minimum pattern length (operator + kernels) |
+| `--rank` | int | no | 0 | Rank to analyze |
+| `--top-k` | int | no | 5 | Number of top patterns to return |
+
+**Example:**
+```bash
+trace-blame frequent-cuda-kernel-sequences --db trace.db --operator-name aten::linear --top-k 5
+```
+
+**Output:** Markdown table: `| pattern | count | GPU kernel duration (us) | CPU op duration (us) |`
+
+---
+
+### `aten-op-kernels-and-delay`
+
+Map ATen operators to their launched GPU kernels, showing launch delay.
+
+```bash
+trace-blame aten-op-kernels-and-delay [--db <path>] [--ranks RANKS] [--sort-by COLS]
+```
+
+| Argument | Type | Required | Default | Description |
+|---|---|---|---|---|
+| `--db` | string | no | `trace.db` | SQLite database path |
+| `--ranks` | string | no | all | Comma-separated ranks |
+| `--sort-by` | string | no | `occurrence_count` | Comma-separated column names to sort by |
+
+**Example:**
+```bash
+trace-blame aten-op-kernels-and-delay --db trace.db --ranks 0
+```
+
+**Output:** Per-rank markdown tables: `| aten_op_name | kernel_sequence | occurrence_count | avg_aten_op_launch_delay | avg_runtime_delay |`
+
+---
+
+### `cuda-kernel-launch-stats`
+
+Compute statistics about CUDA kernel launches (durations, launch delays).
+
+```bash
+trace-blame cuda-kernel-launch-stats [--db <path>] [--ranks RANKS] [--runtime-cutoff N] [--launch-delay-cutoff N] [--no-memory-events]
+```
+
+| Argument | Type | Required | Default | Description |
+|---|---|---|---|---|
+| `--db` | string | no | `trace.db` | SQLite database path |
+| `--ranks` | string | no | all | Comma-separated ranks |
+| `--runtime-cutoff` | int | no | 50 | Runtime duration cutoff in µs |
+| `--launch-delay-cutoff` | int | no | 100 | Launch delay cutoff in µs |
+| `--no-memory-events` | flag | no | false | Exclude cudaMemcpyAsync/cudaMemsetAsync |
+
+**Example:**
+```bash
+trace-blame cuda-kernel-launch-stats --db trace.db --runtime-cutoff 10
+```
+
+**Output:** Per-rank markdown tables: `| correlation | cpu_duration | gpu_duration | launch_delay |`
diff --git a/cmd/trace-blame/skill/references/idle-time.md b/cmd/trace-blame/skill/references/idle-time.md
new file mode 100644
index 0000000..cee981c
--- /dev/null
+++ b/cmd/trace-blame/skill/references/idle-time.md
@@ -0,0 +1,24 @@
+# Idle Time
+
+### `idle-time-breakdown`
+
+Break down GPU idle time by category per rank and stream.
+
+```bash
+trace-blame idle-time-breakdown [--db <path>] [--ranks RANKS] [--streams STREAMS] [--show-idle-interval-stats] [--consecutive-kernel-delay N]
+```
+
+| Argument | Type | Required | Default | Description |
+|---|---|---|---|---|
+| `--db` | string | no | `trace.db` | SQLite database path |
+| `--ranks` | string | no | all | Comma-separated ranks |
+| `--streams` | string | no | all | Comma-separated CUDA stream IDs |
+| `--show-idle-interval-stats` | flag | no | false | Also output statistics about individual idle intervals |
+| `--consecutive-kernel-delay` | int64 | no | 30 | Threshold (µs) for classifying gaps between consecutive kernels |
+
+**Example:**
+```bash
+trace-blame idle-time-breakdown --db trace.db --show-idle-interval-stats
+```
+
+**Output:** Markdown table: `| rank | stream | idle_category | idle_time(us) | idle_time_ratio |`. If `--show-idle-interval-stats` is set, a second table with interval statistics is also printed.
diff --git a/cmd/trace-blame/skill/references/overview.md b/cmd/trace-blame/skill/references/overview.md
new file mode 100644
index 0000000..ed78983
--- /dev/null
+++ b/cmd/trace-blame/skill/references/overview.md
@@ -0,0 +1,93 @@
+# Overview Analysis
+
+### `temporal-breakdown`
+
+Show how time is spent (compute, communication, idle) for each rank.
+
+```bash
+trace-blame temporal-breakdown [--db <path>]
+```
+
+| Argument | Type | Required | Default | Description |
+|---|---|---|---|---|
+| `--db` | string | no | `trace.db` | SQLite database path |
+
+**Example:**
+```bash
+trace-blame temporal-breakdown --db trace.db
+```
+
+**Output:** Markdown table with one row per rank:
+```
+| rank | idle_time(us) | compute_time(us) | non_compute_time(us) | kernel_time(us) | idle_time_pctg | compute_time_pctg | non_compute_time_pctg |
+```
+
+---
+
+### `comm-comp-overlap`
+
+Show the overlap between communication and computation for each rank.
+
+```bash
+trace-blame comm-comp-overlap [--db <path>]
+```
+
+| Argument | Type | Required | Default | Description |
+|---|---|---|---|---|
+| `--db` | string | no | `trace.db` | SQLite database path |
+
+**Example:**
+```bash
+trace-blame comm-comp-overlap --db trace.db
+```
+
+**Output:** Markdown table with overlap percentages per rank:
+```
+| rank | overlap_pctg |
+```
+
+---
+
+### `profiler-steps`
+
+List the profiler step indices found in the trace.
+
+```bash
+trace-blame profiler-steps [--db <path>]
+```
+
+| Argument | Type | Required | Default | Description |
+|---|---|---|---|---|
+| `--db` | string | no | `trace.db` | SQLite database path |
+
+**Example:**
+```bash
+trace-blame profiler-steps --db trace.db
+# 15,16,17,18,19
+```
+
+**Output:** Comma-separated list of profiler step integers printed to stdout.
+
+---
+
+### `potential-stragglers`
+
+Identify ranks that are potential stragglers (slower than peers).
+
+```bash
+trace-blame potential-stragglers [--db <path>] [--num-candidates N] [--profiler-steps STEPS]
+```
+
+| Argument | Type | Required | Default | Description |
+|---|---|---|---|---|
+| `--db` | string | no | `trace.db` | SQLite database path |
+| `--num-candidates` | int | no | 2 | Top K straggler candidates to return |
+| `--profiler-steps` | string | no | all | Comma-separated profiler step indices to analyze |
+
+**Example:**
+```bash
+trace-blame potential-stragglers --db trace.db --num-candidates 2
+# 3,7
+```
+
+**Output:** Comma-separated list of rank IDs that are potential stragglers. Prints a message if no stragglers are detected.
diff --git a/cmd/trace-blame/skill/references/preprocessing.md b/cmd/trace-blame/skill/references/preprocessing.md
new file mode 100644
index 0000000..74fe1fc
--- /dev/null
+++ b/cmd/trace-blame/skill/references/preprocessing.md
@@ -0,0 +1,21 @@
+# Preprocessing
+
+### `pre-process`
+
+Parse raw PyTorch Profiler traces (JSON/GZ) and store into a SQLite database for fast repeated analysis.
+
+```bash
+trace-blame pre-process --trace-dir <dir> [--output <path>]
+```
+
+| Argument | Type | Required | Default | Description |
+|---|---|---|---|---|
+| `--trace-dir` | string | yes | — | Directory containing trace JSON/GZ files |
+| `--output` | string | no | `trace.db` | Output SQLite database path |
+
+**Example:**
+```bash
+trace-blame pre-process --trace-dir ./raw_traces --output trace.db
+```
+
+**Output:** Logs per-rank event counts, writes a single SQLite database file. Only needs to run once per trace set.
diff --git a/cmd/trace-blame/skill/references/subcommands.md b/cmd/trace-blame/skill/references/subcommands.md
new file mode 100644
index 0000000..b796200
--- /dev/null
+++ b/cmd/trace-blame/skill/references/subcommands.md
@@ -0,0 +1,15 @@
+# HTA CLI Subcommand Reference
+
+Full argument tables, examples, and output descriptions for all 19 HTA CLI subcommands, organized by analysis category.
+
+Source of truth: `cmd/trace-blame/main.go` (argument definitions and subcommand handlers).
+
+| File | Subcommands | Description |
+|---|---|---|
+| [preprocessing.md](preprocessing.md) | `pre-process` | Parse raw traces to SQLite DB |
+| [overview.md](overview.md) | `temporal-breakdown`, `comm-comp-overlap`, `profiler-steps`, `potential-stragglers` | High-level training overview |
+| [gpu-kernels.md](gpu-kernels.md) | `gpu-kernel-breakdown`, `gpu-kernels-with-annotations`, `frequent-cuda-kernel-sequences`, `aten-op-kernels-and-delay`, `cuda-kernel-launch-stats` | GPU kernel analysis |
+| [counters.md](counters.md) | `generate-trace-with-counters`, `queue-length-summary`, `queue-length-time-series`, `blocked-on-full-queue`, `memory-bw-summary`, `memory-bw-time-series` | Queue length & memory bandwidth |
+| [idle-time.md](idle-time.md) | `idle-time-breakdown` | GPU idle time classification |
+| [cupti-counters.md](cupti-counters.md) | `cupti-counter-data` | CUPTI hardware counter data |
+| [critical-path.md](critical-path.md) | `critical-path` | Critical path analysis |
diff --git a/go.mod b/go.mod
index b85aed1..62ff1ce 100644
--- a/go.mod
+++ b/go.mod
@@ -1,4 +1,4 @@
-module hta
+module trace-blame
 
 go 1.24.11
 
diff --git a/pkg/analysis/criticalpath/critical_path.go b/pkg/analysis/criticalpath/critical_path.go
index 79b61ea..0561985 100644
--- a/pkg/analysis/criticalpath/critical_path.go
+++ b/pkg/analysis/criticalpath/critical_path.go
@@ -12,11 +12,11 @@ import (
 	"sort"
 	"strings"
 
-	"hta/pkg/analysis"
-	"hta/pkg/analysis/kernel"
-	"hta/pkg/analysis/resource"
-	"hta/pkg/store"
-	"hta/pkg/symbol"
+	"trace-blame/pkg/analysis"
+	"trace-blame/pkg/analysis/kernel"
+	"trace-blame/pkg/analysis/resource"
+	"trace-blame/pkg/store"
+	"trace-blame/pkg/symbol"
 )
 
 // ---------------------------------------------------------------------------
diff --git a/pkg/analysis/criticalpath/critical_path_test.go b/pkg/analysis/criticalpath/critical_path_test.go
index fdd7d29..4403136 100644
--- a/pkg/analysis/criticalpath/critical_path_test.go
+++ b/pkg/analysis/criticalpath/critical_path_test.go
@@ -5,9 +5,9 @@ import (
 	"path/filepath"
 	"testing"
 
-	"hta/pkg/analysis"
-	"hta/pkg/pipeline"
-	"hta/pkg/store"
+	"trace-blame/pkg/analysis"
+	"trace-blame/pkg/pipeline"
+	"trace-blame/pkg/store"
 )
 
 func TestCriticalPathAlexnet(t *testing.T) {
diff --git a/pkg/analysis/kernel/annotation.go b/pkg/analysis/kernel/annotation.go
index e3fbba2..16ea1a3 100644
--- a/pkg/analysis/kernel/annotation.go
+++ b/pkg/analysis/kernel/annotation.go
@@ -4,8 +4,8 @@ import (
 	"database/sql"
 	"strconv"
 
-	"hta/pkg/analysis"
-	"hta/pkg/store"
+	"trace-blame/pkg/analysis"
+	"trace-blame/pkg/store"
 )
 
 // AnnotationOpts configures the GPUKernelsWithAnnotations analysis.
diff --git a/pkg/analysis/kernel/aten_delay.go b/pkg/analysis/kernel/aten_delay.go
index 94b4514..d211999 100644
--- a/pkg/analysis/kernel/aten_delay.go
+++ b/pkg/analysis/kernel/aten_delay.go
@@ -7,8 +7,8 @@ import (
 	"sort"
 	"strings"
 
-	"hta/pkg/store"
-	"hta/pkg/symbol"
+	"trace-blame/pkg/store"
+	"trace-blame/pkg/symbol"
 )
 
 // AtenDelayOpts controls the ATen op kernels and delay analysis.
diff --git a/pkg/analysis/kernel/helpers_test.go b/pkg/analysis/kernel/helpers_test.go
index 3ef3105..0699970 100644
--- a/pkg/analysis/kernel/helpers_test.go
+++ b/pkg/analysis/kernel/helpers_test.go
@@ -7,7 +7,7 @@ import (
 	"runtime"
 	"testing"
 
-	"hta/pkg/store"
+	"trace-blame/pkg/store"
 )
 
 func testDataDir(t *testing.T) string {
diff --git a/pkg/analysis/kernel/kernel_breakdown.go b/pkg/analysis/kernel/kernel_breakdown.go
index 25db548..93ece48 100644
--- a/pkg/analysis/kernel/kernel_breakdown.go
+++ b/pkg/analysis/kernel/kernel_breakdown.go
@@ -6,8 +6,8 @@ import (
 	"math"
 	"sort"
 
-	"hta/pkg/analysis"
-	"hta/pkg/store"
+	"trace-blame/pkg/analysis"
+	"trace-blame/pkg/store"
 )
 
 // KernelBreakdownOpts configures the GPU kernel breakdown analysis.
diff --git a/pkg/analysis/kernel/kernel_breakdown_test.go b/pkg/analysis/kernel/kernel_breakdown_test.go
index b5c6a83..0b9c6ff 100644
--- a/pkg/analysis/kernel/kernel_breakdown_test.go
+++ b/pkg/analysis/kernel/kernel_breakdown_test.go
@@ -4,7 +4,7 @@ import (
 	"math"
 	"testing"
 
-	"hta/pkg/analysis"
+	"trace-blame/pkg/analysis"
 )
 
 func TestQuantileLinear(t *testing.T) {
diff --git a/pkg/analysis/kernel/kernel_sequences.go b/pkg/analysis/kernel/kernel_sequences.go
index ad7dd9a..98c8371 100644
--- a/pkg/analysis/kernel/kernel_sequences.go
+++ b/pkg/analysis/kernel/kernel_sequences.go
@@ -10,7 +10,7 @@ import (
 	"sort"
 	"strings"
 
-	"hta/pkg/store"
+	"trace-blame/pkg/store"
 )
 
 // KernelSeqOpts controls the frequent CUDA kernel sequences analysis.
diff --git a/pkg/analysis/kernel/kernel_sequences_test.go b/pkg/analysis/kernel/kernel_sequences_test.go
index 89ca1ad..4b3884f 100644
--- a/pkg/analysis/kernel/kernel_sequences_test.go
+++ b/pkg/analysis/kernel/kernel_sequences_test.go
@@ -4,7 +4,7 @@ import (
 	"os"
 	"testing"
 
-	"hta/pkg/store"
+	"trace-blame/pkg/store"
 )
 
 func TestFindRootOperators(t *testing.T) {
diff --git a/pkg/analysis/kernel/launch_stats.go b/pkg/analysis/kernel/launch_stats.go
index 8a30a7b..7925afd 100644
--- a/pkg/analysis/kernel/launch_stats.go
+++ b/pkg/analysis/kernel/launch_stats.go
@@ -4,7 +4,7 @@ import (
 	"database/sql"
 	"fmt"
 
-	"hta/pkg/store"
+	"trace-blame/pkg/store"
 )
 
 // LaunchStatsOpts controls the CUDA kernel launch statistics analysis.
diff --git a/pkg/analysis/kernel/testmain_test.go b/pkg/analysis/kernel/testmain_test.go
index 4a758c2..04c3049 100644
--- a/pkg/analysis/kernel/testmain_test.go
+++ b/pkg/analysis/kernel/testmain_test.go
@@ -8,8 +8,8 @@ import (
 	"runtime"
 	"testing"
 
-	"hta/pkg/pipeline"
-	"hta/pkg/store"
+	"trace-blame/pkg/pipeline"
+	"trace-blame/pkg/store"
 )
 
 // sharedVTDBPath and sharedNSDBPath hold paths to pre-built SQLite DBs
diff --git a/pkg/analysis/profiler_steps.go b/pkg/analysis/profiler_steps.go
index f17186b..d33176b 100644
--- a/pkg/analysis/profiler_steps.go
+++ b/pkg/analysis/profiler_steps.go
@@ -7,7 +7,7 @@ import (
 	"sort"
 	"strconv"
 
-	"hta/pkg/store"
+	"trace-blame/pkg/store"
 )
 
 var ProfilerStepRe = regexp.MustCompile(`ProfilerStep\s*#\s*(\d+)`)
diff --git a/pkg/analysis/profiler_steps_test.go b/pkg/analysis/profiler_steps_test.go
index dc12b21..246ee85 100644
--- a/pkg/analysis/profiler_steps_test.go
+++ b/pkg/analysis/profiler_steps_test.go
@@ -4,8 +4,8 @@ import (
 	"path/filepath"
 	"testing"
 
-	"hta/pkg/pipeline"
-	"hta/pkg/store"
+	"trace-blame/pkg/pipeline"
+	"trace-blame/pkg/store"
 )
 
 func TestProfilerStepsRegex(t *testing.T) {
diff --git a/pkg/analysis/resource/cupti_counters.go b/pkg/analysis/resource/cupti_counters.go
index 0a0caa1..fb367b5 100644
--- a/pkg/analysis/resource/cupti_counters.go
+++ b/pkg/analysis/resource/cupti_counters.go
@@ -6,7 +6,7 @@ import (
 	"log"
 	"sort"
 
-	"hta/pkg/store"
+	"trace-blame/pkg/store"
 )
 
 // CUPTICounterOpts controls the CUPTI counter data analysis.
diff --git a/pkg/analysis/resource/cupti_counters_test.go b/pkg/analysis/resource/cupti_counters_test.go
index be3ee76..0740c3f 100644
--- a/pkg/analysis/resource/cupti_counters_test.go
+++ b/pkg/analysis/resource/cupti_counters_test.go
@@ -3,7 +3,7 @@ package resource
 import (
 	"testing"
 
-	"hta/pkg/store"
+	"trace-blame/pkg/store"
 )
 
 func TestCUPTICounterDataIntegration(t *testing.T) {
diff --git a/pkg/analysis/resource/helpers_test.go b/pkg/analysis/resource/helpers_test.go
index d29d1a1..09f3a30 100644
--- a/pkg/analysis/resource/helpers_test.go
+++ b/pkg/analysis/resource/helpers_test.go
@@ -7,7 +7,7 @@ import (
 	"runtime"
 	"testing"
 
-	"hta/pkg/store"
+	"trace-blame/pkg/store"
 )
 
 func testDataDir(t *testing.T) string {
diff --git a/pkg/analysis/resource/memory_bw.go b/pkg/analysis/resource/memory_bw.go
index 9069e4c..0abcadd 100644
--- a/pkg/analysis/resource/memory_bw.go
+++ b/pkg/analysis/resource/memory_bw.go
@@ -6,9 +6,9 @@ import (
 	"math"
 	"sort"
 
-	"hta/pkg/analysis"
-	"hta/pkg/store"
-	"hta/pkg/symbol"
+	"trace-blame/pkg/analysis"
+	"trace-blame/pkg/store"
+	"trace-blame/pkg/symbol"
 )
 
 // MemoryBWPoint is a single point in the memory bandwidth time series.
diff --git a/pkg/analysis/resource/memory_bw_test.go b/pkg/analysis/resource/memory_bw_test.go
index 5585195..ad9ea0d 100644
--- a/pkg/analysis/resource/memory_bw_test.go
+++ b/pkg/analysis/resource/memory_bw_test.go
@@ -3,7 +3,7 @@ package resource
 import (
 	"testing"
 
-	"hta/pkg/analysis"
+	"trace-blame/pkg/analysis"
 )
 
 func TestMemoryBWSummary(t *testing.T) {
diff --git a/pkg/analysis/resource/queue_length.go b/pkg/analysis/resource/queue_length.go
index 941ed0f..6a8f67e 100644
--- a/pkg/analysis/resource/queue_length.go
+++ b/pkg/analysis/resource/queue_length.go
@@ -6,9 +6,9 @@ import (
 	"math"
 	"sort"
 
-	"hta/pkg/analysis"
-	"hta/pkg/store"
-	"hta/pkg/symbol"
+	"trace-blame/pkg/analysis"
+	"trace-blame/pkg/store"
+	"trace-blame/pkg/symbol"
 )
 
 // QueueLengthPoint is a single point in the queue-length time series.
diff --git a/pkg/analysis/resource/testmain_test.go b/pkg/analysis/resource/testmain_test.go
index 4d70ce9..3cfb961 100644
--- a/pkg/analysis/resource/testmain_test.go
+++ b/pkg/analysis/resource/testmain_test.go
@@ -8,8 +8,8 @@ import (
 	"runtime"
 	"testing"
 
-	"hta/pkg/pipeline"
-	"hta/pkg/store"
+	"trace-blame/pkg/pipeline"
+	"trace-blame/pkg/store"
 )
 
 // sharedVTDBPath and sharedCUPTIDBPath hold paths to pre-built SQLite DBs
diff --git a/pkg/analysis/resource/trace_with_counters.go b/pkg/analysis/resource/trace_with_counters.go
index cd94fb7..0e22fe3 100644
--- a/pkg/analysis/resource/trace_with_counters.go
+++ b/pkg/analysis/resource/trace_with_counters.go
@@ -11,7 +11,7 @@ import (
 	"sort"
 	"strings"
 
-	"hta/pkg/store"
+	"trace-blame/pkg/store"
 )
 
 // CounterType is a bitmask selecting which counter time series to embed.
diff --git a/pkg/analysis/straggler/straggler.go b/pkg/analysis/straggler/straggler.go
index 5232983..b2586ad 100644
--- a/pkg/analysis/straggler/straggler.go
+++ b/pkg/analysis/straggler/straggler.go
@@ -8,8 +8,8 @@ import (
 	"strconv"
 	"strings"
 
-	"hta/pkg/analysis"
-	"hta/pkg/store"
+	"trace-blame/pkg/analysis"
+	"trace-blame/pkg/store"
 )
 
 // StragglerOpts configures the potential stragglers analysis.
diff --git a/pkg/analysis/straggler/straggler_test.go b/pkg/analysis/straggler/straggler_test.go
index fcf8b94..660d5a4 100644
--- a/pkg/analysis/straggler/straggler_test.go
+++ b/pkg/analysis/straggler/straggler_test.go
@@ -7,9 +7,9 @@ import (
 	"sort"
 	"testing"
 
-	"hta/pkg/analysis"
-	"hta/pkg/pipeline"
-	"hta/pkg/store"
+	"trace-blame/pkg/analysis"
+	"trace-blame/pkg/pipeline"
+	"trace-blame/pkg/store"
 )
 
 func testDataDir(t *testing.T) string {
diff --git a/pkg/analysis/temporal/idle_time.go b/pkg/analysis/temporal/idle_time.go
index 3515cd6..ae75bbb 100644
--- a/pkg/analysis/temporal/idle_time.go
+++ b/pkg/analysis/temporal/idle_time.go
@@ -6,8 +6,8 @@ import (
 	"math"
 	"slices"
 
-	"hta/pkg/analysis"
-	"hta/pkg/store"
+	"trace-blame/pkg/analysis"
+	"trace-blame/pkg/store"
 )
 
 // IdleTimeOpts configures idle-time breakdown analysis.
diff --git a/pkg/analysis/temporal/idle_time_test.go b/pkg/analysis/temporal/idle_time_test.go
index 187fdc7..da163fd 100644
--- a/pkg/analysis/temporal/idle_time_test.go
+++ b/pkg/analysis/temporal/idle_time_test.go
@@ -5,8 +5,8 @@ import (
 	"path/filepath"
 	"testing"
 
-	"hta/pkg/pipeline"
-	"hta/pkg/store"
+	"trace-blame/pkg/pipeline"
+	"trace-blame/pkg/store"
 )
 
 func TestIdleTimeBreakdownIntegration(t *testing.T) {
diff --git a/pkg/analysis/temporal/overlap.go b/pkg/analysis/temporal/overlap.go
index 15fa294..a367b01 100644
--- a/pkg/analysis/temporal/overlap.go
+++ b/pkg/analysis/temporal/overlap.go
@@ -5,8 +5,8 @@ import (
 	"fmt"
 	"sort"
 
-	"hta/pkg/analysis"
-	"hta/pkg/store"
+	"trace-blame/pkg/analysis"
+	"trace-blame/pkg/store"
 )
 
 // OverlapResult holds the comm-comp overlap percentage for a single rank.
diff --git a/pkg/analysis/temporal/overlap_test.go b/pkg/analysis/temporal/overlap_test.go
index 4fcdec2..2a22e8e 100644
--- a/pkg/analysis/temporal/overlap_test.go
+++ b/pkg/analysis/temporal/overlap_test.go
@@ -5,8 +5,8 @@ import (
 	"path/filepath"
 	"testing"
 
-	"hta/pkg/pipeline"
-	"hta/pkg/store"
+	"trace-blame/pkg/pipeline"
+	"trace-blame/pkg/store"
 )
 
 func TestCommCompOverlapIntegration(t *testing.T) {
diff --git a/pkg/analysis/temporal/temporal.go b/pkg/analysis/temporal/temporal.go
index 09b64bb..2b22787 100644
--- a/pkg/analysis/temporal/temporal.go
+++ b/pkg/analysis/temporal/temporal.go
@@ -5,9 +5,9 @@ import (
 	"fmt"
 	"sort"
 
-	"hta/pkg/analysis"
-	"hta/pkg/store"
-	"hta/pkg/symbol"
+	"trace-blame/pkg/analysis"
+	"trace-blame/pkg/store"
+	"trace-blame/pkg/symbol"
 )
 
 // TemporalResult holds the temporal breakdown for a single rank.
diff --git a/pkg/analysis/temporal/temporal_test.go b/pkg/analysis/temporal/temporal_test.go
index 2536a80..fffa919 100644
--- a/pkg/analysis/temporal/temporal_test.go
+++ b/pkg/analysis/temporal/temporal_test.go
@@ -7,8 +7,8 @@ import (
 	"runtime"
 	"testing"
 
-	"hta/pkg/pipeline"
-	"hta/pkg/store"
+	"trace-blame/pkg/pipeline"
+	"trace-blame/pkg/store"
 )
 
 func testDataDir(t *testing.T) string {
diff --git a/pkg/pipeline/preprocess.go b/pkg/pipeline/preprocess.go
index 7fb1040..38712cd 100644
--- a/pkg/pipeline/preprocess.go
+++ b/pkg/pipeline/preprocess.go
@@ -7,9 +7,9 @@ import (
 	"math"
 	"regexp"
 
-	"hta/pkg/store"
-	"hta/pkg/symbol"
-	"hta/pkg/trace"
+	"trace-blame/pkg/store"
+	"trace-blame/pkg/symbol"
+	"trace-blame/pkg/trace"
 )
 
 var profilerStepRe = regexp.MustCompile(`^ProfilerStep#\d+`)
diff --git a/pkg/store/reader.go b/pkg/store/reader.go
index 92e4e31..6038af0 100644
--- a/pkg/store/reader.go
+++ b/pkg/store/reader.go
@@ -5,7 +5,7 @@ import (
 	"fmt"
 	"strings"
 
-	"hta/pkg/symbol"
+	"trace-blame/pkg/symbol"
 )
 
 // LoadSymbolTable reads all symbols from the DB into a SymbolTable.