Models feature/keep batch sharded #121

japols · 2025-02-06T13:14:53Z

[Draft] This PR improves scalability at large input grid sizes by keeping the input batch and output of the model sharded, never materializing the full in/out grid in gpu memory. The loss/validation metrics are computed locally to then get the global loss via all-reduce across shards.

Preview at 9km:

japols added 7 commits January 22, 2025 10:57

fix: only load shards of grid into cpu mem if possible

2bb9677

fix: validation in_place to avoid copies of batch

7bc6cb4

Merge remote-tracking branch 'origin' into 9km-patch

b1bf749

Merge branch 'main' into 9km-patch

a75f657

feat: keep batch sharded v0

4c2338f

sharded loss v0

4946407

mse sharded loss, keep_batch_sharded configurable

3a4ed63

japols added the enhancement New feature or request label Feb 6, 2025

japols self-assigned this Feb 6, 2025

github-actions bot added training models labels Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Models feature/keep batch sharded #121

Models feature/keep batch sharded #121

japols commented Feb 6, 2025 •

edited

Loading

Models feature/keep batch sharded #121

Are you sure you want to change the base?

Models feature/keep batch sharded #121

Conversation

japols commented Feb 6, 2025 • edited Loading

japols commented Feb 6, 2025 •

edited

Loading