Skip to content

Latest commit

 

History

History
80 lines (59 loc) · 2.36 KB

File metadata and controls

80 lines (59 loc) · 2.36 KB

CliMA GPU profiling

Profiling GPU performance for CliMA.

To run this project, first install Calkit on the clima machine:

curl -LsSf https://github.com/calkit/calkit/raw/refs/heads/main/scripts/install.sh | sh

Next, configure a token for interacting with calkit.io (where we store version-controlled Nsight reports).

If you don't already have an SSH key added to GitHub, either follow their documentation or run:

calkit config github-ssh

Then clone the project:

calkit clone --ssh petebachant/clima-gpu-profiling

Lastly, call:

calkit run

This will run all pipeline stages in the order they're defined in calkit.yaml. If you'd like to run a single stage (in a reproducible way), you can use its name as the first positional argument to calkit run. For example:

calkit run amip-clima-nsys

However, by default, only stages whose Nsight reports are now invalid (since their inputs have changed since the last run) will run.

Running a Jupyter server on clima with srun

srun --gpus=1 --mpi=none --pty bash
calkit jupyter lab --ip=0.0.0.0 --no-browser

Then, copy the server URL, which starts with http://127.0.0.1, and in VS Code, use that when selecting a kernel for the notebook.

Submodule branches

  • ClimaCore.jl --> pb/rm-nvtx
  • ClimaCore.jl-mod --> pb/perf
  • ClimaCoupler.jl --> pb/rm-nvtx
  • ClimaCoupler.jl-mod --> pb/perf
  • ClimaAtmos.jl --> pb/rm-nvtx
  • ClimaAtmos.jl-mod --> pb/perf

Experiment results

Commit (super-repo) Change summary Result
130baab Occupancy for run_field_matrix_solver increased and reduced registers per thread but slowed down overall.
e5845b7 Similar as above, but not quite as slow.
ff26f4b1 Use PCR for tri-diagonal matrix solve. Seems to be 3% faster, but higher error. May not have isolated changed properly though.
e6099c2 Try capping all threads to 256 1% slower on flagship.
23c9104 Attempt to coalesce memory access in solvers. 5% slowdown.
7614ca6 Thread block restructuring and LocalGeometry caching. No significant change.
f9eb67a Tr/mem access patterns 9% speedup.