Skip to content

Fix/issue 633 offpolicy wait metrics(issue #633)#636

Open
LeeLeno wants to merge 4 commits into
mainfrom
fix/issue-633-offpolicy-wait-metrics
Open

Fix/issue 633 offpolicy wait metrics(issue #633)#636
LeeLeno wants to merge 4 commits into
mainfrom
fix/issue-633-offpolicy-wait-metrics

Conversation

@LeeLeno

@LeeLeno LeeLeno commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

背景

off-policy 训练的 learner wait_time(终端 Wait / timing/learner_wait_ms)把"等数据"和同步点的 barrier / 轮询 / logger 刷新混在一起,不能反映 learner 真正阻塞等待 collector 产数据的时间。本 PR 按 #633 重新定义并实现该指标。

计算逻辑变化对照(面板视角)

Learner 面板的指标计算逻辑发生变化;Collector 面板的指标计算完全未动,本 PR 只为其补充文档。

Learner 面板

指标 变化类型 计算逻辑
Wait → Collector Wait 重命名 + 提纯 单卡:剔除 warmup log_buffer_fill 刷新 + 循环内 sync 握手 put;多卡:在初始 dist.barrier() 之前测量;APPO/HORA:在 iter-1 logger.start()available() 之前测量
Replay Batch Wait 新增 replay pack / H2D batch-ready 轮询耗时(double-buffer / 多卡),预取命中时 ≈ 0
Rank Barrier 新增(从 Wait+Train 抽出) 多卡 dist.barrier()(初始 + 最终)耗时之和
Sync Coordination 新增(从 Wait 抽出) trainer_done 同步握手耗时(warmup 循环内 + 末尾释放)
Train 多卡重定义(单卡不变) 改为纯 SGD 计算,不再包含 param sync 与最终 barrier
Param Sync 归属 + 标签变化(值不变) 算法不变;不再重复计入 Train;标签 Param Sync (in Train)Param Sync
H2D Copy / Weight Sync / Iter Wall 不变 计算未改(Iter Wall 仍为原始 iteration_time,非各项之和)
派生 perf/learner_pipeline_ms 重定义 = H2D + Train + Param Sync + Weight Sync(原先不含 Param Sync)

Collector 面板

指标 变化类型
weight_sync_ms / action_select_ms / env_step_ms / replay_ms / sync_coordination_ms(SAC/TD3) 计算不变,仅新增文档
env_step_total_ms / mlp_infer_ms(APPO) 计算不变,仅新增文档(仍为单步 EMA)

改动(对照验收标准)

① 文档化每个 off-policy timing 字段docs/.../1-training/3-logging.md 新增「Off-Policy 计时字段」:learner 9 项 + collector(SAC/TD3 5 项、APPO 2 项),均列终端字段、TensorBoard/W&B key 与含义。

② collector wait 不混入 barrier / pack / H2D / logger 刷新 — 见上方对照表 Collector Wait 行;4 分量各自独立上报,不合并。

③ 单卡与多卡语义一致,多卡额外拆 rank barrier / param sync — 单/多卡产出同一套 4 分量;多卡额外 rank_barrier + param_synctrain 两边均为纯计算;非 rank 0 计时恒 0。

④ 测试覆盖三类场景 — collector 未及时产出(单卡 async)、batch 已 ready(多卡 spawn replay_batch_wait==0)、多卡 barrier(rank_barrier 单列且 collector_wait 不含它);另加不依赖 torch 的 logger 契约测试。

破坏性变更

  • timing/learner_wait_mstiming/learner_collector_wait_ms,旧看板/查询需更新。
  • 多卡 timing/learner_train_ms 改为纯计算口径(不再含 param sync / barrier),跨历史 run 不可直接比较。

影响

  • 纯计时/日志改动,未触及 loss / 优化 / 采样 / V-trace,训练结果不变。
  • 所有 log_step 调用方(5 个 runner + 实验跟踪测试)已同步更新。
  • collector_wait 提纯仅影响 warmup 期(稳态本就纯),collector_wait ≥ 0 有保证,未新增任何同步。

验证

  • make test-all 全绿(ruff / ruff format / mypy / pyright / pytest)。

LeeLeno and others added 4 commits June 23, 2026 14:33
Redefine the off-policy learner wait metric (issue #633) into four
independent components so real blocking is no longer hidden by merging:
collector_wait, replay_batch_wait, rank_barrier, sync_coordination.
Multi-GPU measures the initial/final dist.barrier() outside
collector_wait/train, and train_time becomes pure SGD compute. Renames
timing/learner_wait_ms -> timing/learner_collector_wait_ms (breaking).
Updates SAC/TD3 single + double-buffer + multi-GPU runners, APPO and
HORA APPO callers, the OffPolicyLogger contract, tests, and the timing
field docs.

Refs #633. Pending: run make test-all on a torch device before opening a PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Exclude sync-coordination handshakes and warmup buffer-fill refreshes from
collector_wait across the SAC/TD3 single, double-buffer and multi-GPU runners,
and move the APPO/HORA collector_wait measurement ahead of the iteration-1
logger init so it no longer absorbs one-off display setup. Document the
collector-side timing fields (timing/collector_*) in the off-policy timing
reference.

Refs #633.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant