GraphCast improvements - Part I #510

mnabian · 2024-05-21T01:11:18Z

Modulus Pull Request

Description

Closes #506, #505, #486, #508, #509, #511, #516, #517

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.

Dependencies

mnabian · 2024-05-21T01:12:14Z

/blossom-ci

stadlmax · 2024-05-21T09:23:23Z

@mnabian
Since you are revisiting GraphCast now, adding a few comments

Can we add the option to use transformer_engine.LayerNorm? In AIFS benchmarks, we just could get a 1.3x end-to-end improvement from doing so since the PyTorch implementation is rather bad for the sizes we encounter in these workloads.
Can you check whether the current combination of MeshGraphNodeBlock and MeshGraphEdgeBlock actually matches the paper (https://github.com/NVIDIA/modulus/blob/main/modulus/models/graphcast/graph_cast_processor.py#L97-L98) I created a schematic of the GraphCast architecture for some Arch folks last week, and I think the order of residuals over the edges does not match the paper here. I might have made a mistake when trying to use shared primitives here the last time. The issue here is that in MeshGraphNet, EdgeBlock already applies the "residual" on the edge features, while the NodeBlock would expect then the features including the residual connection prior to message-passing while in GraphCast, all residual connections are only applied after both the updated edge and node features are computed (at least according to the paper).
What would you think of splitting the GraphCastNet into a GraphCastNetERA5 and a GraphCastNet model? The current issue I see with GraphCastNet is that it is very specific to the nature of the ERA5 dataset (e.g. when it comes to preparing the input and output to switch between the HxW layout and the typical "serial" graph layout. GraphCastNet then could be a rather data-agnostic model defining the operations on (g2m_graph, mesh_graph, m2g_graph), while GraphCastNetERA5 defines the things somewhat specific to the workload like checkpointing, input/output conversions, etc.. In the longer term, I think it really could make sense to try to make things a bit more modular. In particular, this also includes things like "history" or the actual "prediction" mode, i.e. whether GraphCastNetERA5 predicts y_t = f(x_t-1) or y_t = x_t - 1 + f(x_t-1). It could make sense if the "backbone" is agnostic to these things while having a specialized prediction wrapper.

mnabian · 2024-05-21T16:45:26Z

@mnabian Since you are revisiting GraphCast now, adding a few comments

Can we add the option to use transformer_engine.LayerNorm? In AIFS benchmarks, we just could get a 1.3x end-to-end improvement from doing so since the PyTorch implementation is rather bad for the sizes we encounter in these workloads.

Can you check whether the current combination of MeshGraphNodeBlock and MeshGraphEdgeBlock actually matches the paper (https://github.com/NVIDIA/modulus/blob/main/modulus/models/graphcast/graph_cast_processor.py#L97-L98) I created a schematic of the GraphCast architecture for some Arch folks last week, and I think the order of residuals over the edges does not match the paper here. I might have made a mistake when trying to use shared primitives here the last time. The issue here is that in MeshGraphNet, EdgeBlock already applies the "residual" on the edge features, while the NodeBlock would expect then the features including the residual connection prior to message-passing while in GraphCast, all residual connections are only applied after both the updated edge and node features are computed (at least according to the paper).

What would you think of splitting the GraphCastNet into a GraphCastNetERA5 and a GraphCastNet model? The current issue I see with GraphCastNet is that it is very specific to the nature of the ERA5 dataset (e.g. when it comes to preparing the input and output to switch between the HxW layout and the typical "serial" graph layout. GraphCastNet then could be a rather data-agnostic model defining the operations on (g2m_graph, mesh_graph, m2g_graph), while GraphCastNetERA5 defines the things somewhat specific to the workload like checkpointing, input/output conversions, etc.. In the longer term, I think it really could make sense to try to make things a bit more modular. In particular, this also includes things like "history" or the actual "prediction" mode, i.e. whether GraphCastNetERA5 predicts y_t = f(x_t-1) or y_t = x_t - 1 + f(x_t-1). It could make sense if the "backbone" is agnostic to these things while having a specialized prediction wrapper.

Thanks @stadlmax , I'll add your comments to my epic and consider them all.

mnabian · 2024-05-21T17:02:41Z

Note to myself: API updates breaks GraphCast tests. Need to update them all.

mnabian · 2024-05-21T20:11:13Z

@stadlmax as far as I remember, we were using fused layernorm and that gave us nice speedup:https://github.com/NVIDIA/modulus/blob/main/modulus/models/gnn_layers/mesh_graph_mlp.py#L157... Did you also compare transformer_engine.LayerNorm with fused layernorm?

stadlmax · 2024-05-21T20:15:38Z

@stadlmax as far as I remember, we were using fused layernorm and that gave us nice speedup (although I can't find it in the most recent code)... Did you also compare transformer_engine.LayerNorm with fused layernorm?

Yes, for AIFS, I found TE > APEX > PyTorch throughout a bunch of usual sizes AIFS had in their RFI benchmark. Especially the backward kernels in TE are much better for our cases. (reported numbers are runtimes, lower is better)

num_channels = 256

layer_norm_impl	1626240 x 256	327660 x 256	40962 x 256	542080 x 256	814540 x 256
apex	9.75127	2.03821	0.371149	3.32072	4.9402
pytorch	10.752	4.17265	0.957743	3.63721	10.2774
transformer_engine	2.59236	0.580879	0.801795	0.916124	1.33596

num_channels = 384

layer_norm_impl	1626240 x 384	327660 x 384	40962 x 384	542080 x 384	814540 x 384
apex	11.2164	2.3109	0.359366	3.79922	5.64847
pytorch	11.8419	4.33466	0.583828	3.99414	10.6802
transformer_engine	3.98762	0.849599	0.396184	1.38306	2.022

num_channels = 512

layer_norm_impl	1626240 x 512	327660 x 512	40962 x 512	542080 x 512	814540 x 512
apex	12.1739	2.50785	0.37578	4.11927	6.14573
pytorch	12.7752	4.5477	0.615464	4.30874	11.2191
transformer_engine	4.90182	1.04243	0.391352	1.6877	2.4967

mnabian · 2024-05-21T20:29:38Z

@stadlmax as far as I remember, we were using fused layernorm and that gave us nice speedup (although I can't find it in the most recent code)... Did you also compare transformer_engine.LayerNorm with fused layernorm?

Yes, for AIFS, I found TE > APEX > PyTorch throughout a bunch of usual sizes AIFS had in their RFI benchmark. Especially the backward kernels in TE are much better for our cases. (reported numbers are runtimes, lower is better)

num_channels = 256

layer_norm_impl 1626240 x 256 327660 x 256 40962 x 256 542080 x 256 814540 x 256
apex 9.75127 2.03821 0.371149 3.32072 4.9402
pytorch 10.752 4.17265 0.957743 3.63721 10.2774
transformer_engine 2.59236 0.580879 0.801795 0.916124 1.33596
num_channels = 384

layer_norm_impl 1626240 x 384 327660 x 384 40962 x 384 542080 x 384 814540 x 384
apex 11.2164 2.3109 0.359366 3.79922 5.64847
pytorch 11.8419 4.33466 0.583828 3.99414 10.6802
transformer_engine 3.98762 0.849599 0.396184 1.38306 2.022
num_channels = 512

layer_norm_impl 1626240 x 512 327660 x 512 40962 x 512 542080 x 512 814540 x 512
apex 12.1739 2.50785 0.37578 4.11927 6.14573
pytorch 12.7752 4.5477 0.615464 4.30874 11.2191
transformer_engine 4.90182 1.04243 0.391352 1.6877 2.4967

This is great comparison, thanks! I'll switch to te then. Do we have any reason to still keep fused layernorm from apex, or we should just remove it?

stadlmax · 2024-05-21T20:32:17Z

This is great comparison, thanks! I'll switch to te then. Do we have any reason to still keep fused layernorm from apex, or we should just remove it?

I guess, no, not really. TE also should be decently covered when it comes to development specifically for Blackwell and beyond. I know a few POCs that try to optimize The LN in TE even further.
If we are based on the DLFW containers, TE also should come pre-installed.

mnabian · 2024-05-21T21:03:01Z

@stadlmax added support for TE layernorm.

mnabian · 2024-05-21T21:25:42Z

Note to myself: API updates breaks GraphCast tests. Need to update them all.

Done

mnabian · 2024-05-21T21:25:55Z

/blossom-ci

mnabian · 2024-05-21T22:06:53Z

/blossom-ci

modulus/utils/graphcast/graph.py

examples/weather/graphcast/conf/config.yaml

modulus/datapipes/climate/era5_hdf5.py

modulus/models/graphcast/graph_cast_net.py

examples/weather/graphcast/conf/config.yaml

modulus/models/gnn_layers/mesh_graph_mlp.py

mnabian · 2024-05-22T17:23:49Z

/blossom-ci

modulus/models/graphcast/graph_cast_net.py

mnabian · 2024-05-22T17:41:39Z

/blossom-ci

mnabian · 2024-05-22T18:27:47Z

/blossom-ci

mnabian · 2024-05-22T19:24:24Z

/blossom-ci

examples/weather/graphcast/conf/config_small.yaml

stadlmax · 2024-05-22T20:00:53Z

Thanks for addressing the feedback, looks good to me.

mnabian · 2024-05-22T20:19:51Z

/blossom-ci

graphcast improvements

fff61ea

mnabian requested review from dallasfoster and loliverhennigh May 21, 2024 01:11

mnabian self-assigned this May 21, 2024

mnabian added the 3 - Ready for Review Ready for review by team label May 21, 2024

mnabian requested a review from stadlmax May 21, 2024 16:50

Merge branch 'main' into fea-ext-improve_graphcast

b5a4bb0

formatting

b7c0c63

linting

1420f88

fix tests

b7c6ef1

stadlmax reviewed May 22, 2024

View reviewed changes

modulus/utils/graphcast/graph.py Show resolved Hide resolved

stadlmax reviewed May 22, 2024

View reviewed changes

modulus/utils/graphcast/graph.py Show resolved Hide resolved

stadlmax reviewed May 22, 2024

View reviewed changes

examples/weather/graphcast/conf/config.yaml Show resolved Hide resolved

dallasfoster reviewed May 22, 2024

View reviewed changes

fix tests

08c3edc

dallasfoster reviewed May 22, 2024

View reviewed changes

modulus/models/graphcast/graph_cast_net.py Show resolved Hide resolved

add requirements.txt

2e0969b

dallasfoster approved these changes May 22, 2024

View reviewed changes

exclude graphcast from fcn-mip-plugin tests

94f6faf

taking static dataset outside of model definition

6d3d439

mnabian requested a review from stadlmax May 22, 2024 19:24

stadlmax reviewed May 22, 2024

View reviewed changes

examples/weather/graphcast/conf/config_small.yaml Outdated Show resolved Hide resolved

stadlmax approved these changes May 22, 2024

View reviewed changes

fix pytest

7f2c44f

mnabian merged commit fe32085 into NVIDIA:main May 22, 2024
1 check passed

mnabian deleted the fea-ext-improve_graphcast branch May 22, 2024 20:58

This was referenced Aug 6, 2024

🚀[FEA]: Better handling of the static dataset in GraphCast #147

Closed

🚀[FEA]: Better handling of the icosphere construction/usage in GraphCast #486

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GraphCast improvements - Part I #510

GraphCast improvements - Part I #510

mnabian commented May 21, 2024 •

edited

Loading

mnabian commented May 21, 2024

stadlmax commented May 21, 2024

mnabian commented May 21, 2024

mnabian commented May 21, 2024

mnabian commented May 21, 2024 •

edited

Loading

stadlmax commented May 21, 2024

mnabian commented May 21, 2024

stadlmax commented May 21, 2024

mnabian commented May 21, 2024

mnabian commented May 21, 2024

mnabian commented May 21, 2024

mnabian commented May 21, 2024

mnabian commented May 22, 2024

mnabian commented May 22, 2024

mnabian commented May 22, 2024

mnabian commented May 22, 2024

stadlmax commented May 22, 2024

mnabian commented May 22, 2024

GraphCast improvements - Part I #510

GraphCast improvements - Part I #510

Conversation

mnabian commented May 21, 2024 • edited Loading

Modulus Pull Request

Description

Checklist

Dependencies

mnabian commented May 21, 2024

stadlmax commented May 21, 2024

mnabian commented May 21, 2024

mnabian commented May 21, 2024

mnabian commented May 21, 2024 • edited Loading

stadlmax commented May 21, 2024

mnabian commented May 21, 2024

stadlmax commented May 21, 2024

mnabian commented May 21, 2024

mnabian commented May 21, 2024

mnabian commented May 21, 2024

mnabian commented May 21, 2024

mnabian commented May 22, 2024

mnabian commented May 22, 2024

mnabian commented May 22, 2024

mnabian commented May 22, 2024

stadlmax commented May 22, 2024

mnabian commented May 22, 2024

mnabian commented May 21, 2024 •

edited

Loading

mnabian commented May 21, 2024 •

edited

Loading