Fix race condition in writing config to checkpoint

lw · facebook-github-bot · commit 53ec1b154bf2 · 2019-10-07T07:24:52.000-07:00
Summary:
We used to have _all_ trainers write the config to the checkpoint, at the same time. This is already problematic but what's worse is that only trainer 0 was creating the checkpoint directory. Thus if it didn't exist and a non-0 trainer was the first to reach that point the write would fail.

I'm fixing it in the same way we fixed all other similar issues: have only the rank-0 trainer write this.

Reviewed By: adamlerer

Differential Revision: D17787303

fbshipit-source-id: c3464dd9929ff95d54865ed03f041388d85c6f0d
diff --git a/torchbiggraph/train.py b/torchbiggraph/train.py
@@ -470,7 +470,8 @@ def make_optimizer(params: Iterable[torch.nn.Parameter], is_emb: bool) -> Optimi
         subprocess_init=subprocess_init,
     )
     checkpoint_manager.register_metadata_provider(ConfigMetadataProvider(config))
-    checkpoint_manager.write_config(config)
+    if rank == 0:
+        checkpoint_manager.write_config(config)
 
     if config.num_edge_chunks is not None:
         num_edge_chunks = config.num_edge_chunks