Skip to content
This repository has been archived by the owner on Mar 14, 2024. It is now read-only.

Commit

Permalink
Fix race condition in writing config to checkpoint
Browse files Browse the repository at this point in the history
Summary:
We used to have _all_ trainers write the config to the checkpoint, at the same time. This is already problematic but what's worse is that only trainer 0 was creating the checkpoint directory. Thus if it didn't exist and a non-0 trainer was the first to reach that point the write would fail.

I'm fixing it in the same way we fixed all other similar issues: have only the rank-0 trainer write this.

Reviewed By: adamlerer

Differential Revision: D17787303

fbshipit-source-id: c3464dd9929ff95d54865ed03f041388d85c6f0d
  • Loading branch information
lw authored and facebook-github-bot committed Oct 7, 2019
1 parent 3ee2838 commit 53ec1b1
Showing 1 changed file with 2 additions and 1 deletion.
3 changes: 2 additions & 1 deletion torchbiggraph/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -470,7 +470,8 @@ def make_optimizer(params: Iterable[torch.nn.Parameter], is_emb: bool) -> Optimi
subprocess_init=subprocess_init,
)
checkpoint_manager.register_metadata_provider(ConfigMetadataProvider(config))
checkpoint_manager.write_config(config)
if rank == 0:
checkpoint_manager.write_config(config)

if config.num_edge_chunks is not None:
num_edge_chunks = config.num_edge_chunks
Expand Down

0 comments on commit 53ec1b1

Please sign in to comment.