Skip to content

Conversation

mori360
Copy link
Contributor

@mori360 mori360 commented Oct 21, 2025

Remove them from known issues
cp
TEST_BACKEND=nccl TRAIN_FILE=torchtitan.experiments.torchcomms.train CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh --parallelism.context_parallel_degree 2

[rank0]:[titan] 2025-10-21 11:50:24,918 - root - INFO - step: 1 loss: 8.2808 grad_norm: 1.4344 memory: 0.60GiB(0.63%) tps: 884 tflops: 0.06 mfu: 0.01%
[rank0]:[titan] 2025-10-21 11:50:24,918 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-10-21 11:50:25,010 - root - INFO - step: 2 loss: 7.9722 grad_norm: 1.5075 memory: 0.65GiB(0.68%) tps: 89,241 tflops: 6.39 mfu: 0.65%
[rank0]:[titan] 2025-10-21 11:50:25,133 - root - INFO - step: 3 loss: 7.2171 grad_norm: 1.9729 memory: 0.67GiB(0.70%) tps: 66,834 tflops: 4.78 mfu: 0.48%
[rank0]:[titan] 2025-10-21 11:50:25,261 - root - INFO - step: 4 loss: 6.3402 grad_norm: 2.3603 memory: 0.67GiB(0.70%) tps: 64,239 tflops: 4.60 mfu: 0.46%
[rank0]:[titan] 2025-10-21 11:50:25,355 - root - INFO - step: 5 loss: 5.3055 grad_norm: 2.6067 memory: 0.67GiB(0.70%) tps: 87,354 tflops: 6.25 mfu: 0.63%
[rank0]:[titan] 2025-10-21 11:50:25,447 - root - INFO - step: 6 loss: 4.7225 grad_norm: 2.6398 memory: 0.67GiB(0.70%) tps: 89,556 tflops: 6.41 mfu: 0.65%
[rank0]:[titan] 2025-10-21 11:50:25,888 - root - INFO - step: 7 loss: 4.3229 grad_norm: 2.1676 memory: 0.67GiB(0.70%) tps: 18,607 tflops: 1.33 mfu: 0.13%
[rank0]:[titan] 2025-10-21 11:50:25,971 - root - INFO - step: 8 loss: 4.0035 grad_norm: 1.6869 memory: 0.67GiB(0.70%) tps: 98,424 tflops: 7.05 mfu: 0.71%
[rank0]:[titan] 2025-10-21 11:50:26,059 - root - INFO - step: 9 loss: 3.9520 grad_norm: 1.3777 memory: 0.67GiB(0.70%) tps: 93,815 tflops: 6.72 mfu: 0.68%
[rank0]:[titan] 2025-10-21 11:50:26,173 - root - INFO - step: 10 loss: 3.6494 grad_norm: 1.3774 memory: 0.67GiB(0.70%) tps: 72,228 tflops: 5.17 mfu: 0.52%
[rank0]:[titan] 2025-10-21 11:50:26,336 - root - INFO - Dumping profiler traces at step 10
[rank0]:[titan] 2025-10-21 11:50:26,372 - root - INFO - Finished dumping profiler traces in 0.04 seconds
[rank0]:[titan] 2025-10-21 11:50:26,373 - root - INFO - Dumping memory snapshot at step 10

async tp
Screenshot 2025-10-21 at 11 48 11 AM

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 21, 2025
@mori360 mori360 marked this pull request as ready for review October 21, 2025 18:51
@mori360 mori360 requested a review from fduwjj October 21, 2025 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant