Skip to content
This repository was archived by the owner on Mar 14, 2024. It is now read-only.

Commit 53ec1b1

Browse files
lwfacebook-github-bot
authored andcommitted
Fix race condition in writing config to checkpoint
Summary: We used to have _all_ trainers write the config to the checkpoint, at the same time. This is already problematic but what's worse is that only trainer 0 was creating the checkpoint directory. Thus if it didn't exist and a non-0 trainer was the first to reach that point the write would fail. I'm fixing it in the same way we fixed all other similar issues: have only the rank-0 trainer write this. Reviewed By: adamlerer Differential Revision: D17787303 fbshipit-source-id: c3464dd9929ff95d54865ed03f041388d85c6f0d
1 parent 3ee2838 commit 53ec1b1

File tree

1 file changed

+2
-1
lines changed

1 file changed

+2
-1
lines changed

torchbiggraph/train.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -470,7 +470,8 @@ def make_optimizer(params: Iterable[torch.nn.Parameter], is_emb: bool) -> Optimi
470470
subprocess_init=subprocess_init,
471471
)
472472
checkpoint_manager.register_metadata_provider(ConfigMetadataProvider(config))
473-
checkpoint_manager.write_config(config)
473+
if rank == 0:
474+
checkpoint_manager.write_config(config)
474475

475476
if config.num_edge_chunks is not None:
476477
num_edge_chunks = config.num_edge_chunks

0 commit comments

Comments
 (0)