Add NVTX support via DrHook #23

pmarguinaud · 2024-05-31T19:23:23Z

This PR enables NVTX instrumentation using DrHook (PGI only). This is enabled with the two following environment variables :

export DR_HOOK=1
export DR_NVTX=1

This code has been provided courtesy of Louis Stuber (NVIDIA) and Lukas Mosimann (NVIDIA).

Example of an ARPEGE nsys profile with NVTX enabled :

wdeconinck · 2024-06-07T10:13:02Z

Hi Philippe thanks for pushing this, looks like a great feature!!!

A comment perhaps is to rename the module names and subroutines (C,Fortran) with prefix "my..." to something more unambiguous like "fiat_", or better "drhook_" ?

Another suggestion is to keep this feature optional even when using the NVIDIA compiler:

ecbuild_add_option( DR_HOOK_NVTX  ... )
# or instead : ecbuild_add_option( NVTX  ... )

The DEFAULT could be ON for this option when NVIDIA compiler is detected if you wish. The nvtx files should then be compiled only when this feature is ON (HAVE_DR_HOOK_NVTX or HAVE_NVTX).
Then add the definition:

if( HAVE_DR_HOOK_NVTX OR HAVE_NVTX )
   list( APPEND FIAT_DEFINITIONS DR_HOOK_NVTX )
endif()

And guard the specific code that is currently guarded with __PGI with DR_HOOK_NVTX

A slightly larger refactoring, but useful, however is that drhook is also to be used from C code.
This means not to call from dr_hook_util.F90, but from within drhook.c
Then we should be enabling this via the environment variable DR_HOOK_OPT=NVTX,...,... which is currently used to enable/disable DR_HOOK options.
The Fortran drivers for nvtx are then perhaps not even be required, unless you want to expose this outside of DR_HOOK as well.

@ioanhadade , @marsdeno could you also review please?

pmarguinaud · 2024-06-07T14:18:07Z

Hello Willem,

I followed your advice and renamed my* to dr_nvtx, then added the DR_NVTX option and I now use the HAVE_DR_NVTX macro everywhere instead of __PGI.

All of this does not work with nvhpc/21.9 (I tried it on our cluster), but it works with nvhpc/24.3; I tried changing that in ./.github/tools/install-nvhpc.sh, but github tests fail with "No space left on device".

What should I do ?

reuterbal · 2024-06-07T14:31:00Z

I have seen a similar issue with nvhpc in Github actions. For CLOUDSC we use this "hack":
https://github.com/ecmwf-ifs/dwarf-p-cloudsc/blob/95125c267e5baed113ef7671c9f346979bd84029/.github/workflows/build.yml#L110-L127

pmarguinaud · 2024-06-11T05:30:34Z

I followed Balthasar advice, but it did not work. So CI is failing for nvhpc builds and I do not know what to do about it.

wdeconinck · 2024-06-11T07:03:28Z

I followed Balthasar advice, but it did not work. So CI is failing for nvhpc builds and I do not know what to do about it.

@pmarguinaud I will have a look at making this work. In the mean time, please, could you look into getting nvtx to work via drhook.c ? We could then e.g. use DR_HOOK_OPT=NVTX also for C/C++ programs using DR_HOOK.
And also a unit-test would be great, at least to catch catastrophic errors.

wdeconinck · 2024-06-11T12:05:15Z

Hi @pmarguinaud , in the mean time I have updated the develop branch to use nvhpc-24.3 in github actions. It also sets "NVHPC_ROOT" environment variable which you need for the find_package(NVHPC) to work.

You can now rebase this branch, minus the added GitHub actions changes, on develop.

…oknvtx

pmarguinaud · 2024-06-12T09:04:40Z

I have merged develop and added a test case for NVTX.

Instrumenting C code is not a priority for me, but you can do it yourself if it is important for you.

wdeconinck · 2024-08-26T08:28:44Z

Hi @Andrew-Beggs-ECMWF discussing offline with @ioanhadade, @ioanhadade suggested for you to have a look at the possibility of integrating this within drhook itself.

Andrew-Beggs-ECMWF · 2024-09-03T15:42:33Z

src/fiat/drnvtx/dr_nvtx_map.cc

+  ilast++;
+  stack[ilast] = elem;
+  elem->calls ++;
+  if (elem->calls >= 11 && elem->elapsed < 0.0001) 


Hi @pmarguinaud, I've been doing some work on this patch to integrate it into the core of drhook so that it's faster, works with all support drhook languages, and is easier for us to maintain.

This condition is giving me a bit of confusion though. My understanding is that it's meant to eliminate high call but low cumulative runtime drhook regions from the nvtx trace, so as to reduce noise in the trace. However, the way it's implemented (particularly lines 72-75) means that once you go over 10 calls, it will always be skipped. This also adds a fun situation where the following recursive code won't close any nvtx regions it opens:

void foo(void) { static int calls = 0; calls++; int skipstart = dr_hook_nvtx_map_start("foo"); std::cout<<"In call "<<calls<<" Skipped? "<<!skipstart<<std::endl; if (calls < 11) { foo(); } int skipend = dr_hook_nvtx_map_stop(); std::cout<<"Out call "<<calls<<" Skipped? "<<!skipend<<std::endl; calls--; return; }

This gave the following output:

In call 1 Skipped? 0 In call 2 Skipped? 0 In call 3 Skipped? 0 In call 4 Skipped? 0 In call 5 Skipped? 0 In call 6 Skipped? 0 In call 7 Skipped? 0 In call 8 Skipped? 0 In call 9 Skipped? 0 In call 10 Skipped? 0 In call 11 Skipped? 1 Out call 11 Skipped? 1 Out call 10 Skipped? 1 Out call 9 Skipped? 1 Out call 8 Skipped? 1 Out call 7 Skipped? 1 Out call 6 Skipped? 1 Out call 5 Skipped? 1 Out call 4 Skipped? 1 Out call 3 Skipped? 1 Out call 2 Skipped? 1 Out call 1 Skipped? 1

Am I correct in thinking this is not the intended behaviour and I should attempt to fix it in my patch?

Andrew-Beggs-ECMWF · 2024-09-13T12:43:17Z

I've now finished my patch and have it as a branch on my own fork of fiat. Do we want to have it merged into pmarguinaud's branch (which should then make it part of this pr once it's merged) or start a new pr directly to ecmwf-ifs:develop?

The outcome is the same, just a matter of how it's merged : )

wdeconinck · 2024-09-16T09:42:55Z

Let's make a new PR for it please.

wdeconinck · 2024-11-05T16:14:40Z

@pmarguinaud this functionality has been merged with PR #28 . I will close this PR now. If you have more feedback please let us know.

Add NVTX support

3c879e7

pmarguinaud added 3 commits June 7, 2024 12:31

Try nvhpc-24.3 instead of nvhpc-21.9

ab4c080

Rename mynvtx -> dr_nvtx + cleaning

dfad60a

Make DR_NVTX optional, use HAVE_DR_NVTX instead of __PGI in Fortran/C

453b148

Try freeing some disk space

6a9e9c1

wdeconinck changed the base branch from main to develop June 11, 2024 07:51

pmarguinaud added 3 commits June 12, 2024 07:19

Merge branch 'develop' of https://github.com/ecmwf-ifs/fiat into drho…

8b0d4eb

…oknvtx

Add test for DR_NVTX

5a22f69

Add test for DR_NVTX

4a0656f

Andrew-Beggs-ECMWF reviewed Sep 3, 2024

View reviewed changes

Andrew-Beggs-ECMWF mentioned this pull request Sep 16, 2024

Add NVTX support via DrHook (Refactored) #28

Merged

wdeconinck closed this Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NVTX support via DrHook #23

Add NVTX support via DrHook #23

pmarguinaud commented May 31, 2024 •

edited

Loading

wdeconinck commented Jun 7, 2024

pmarguinaud commented Jun 7, 2024

reuterbal commented Jun 7, 2024

pmarguinaud commented Jun 11, 2024

wdeconinck commented Jun 11, 2024 •

edited

Loading

wdeconinck commented Jun 11, 2024

pmarguinaud commented Jun 12, 2024 •

edited

Loading

wdeconinck commented Aug 26, 2024

Andrew-Beggs-ECMWF Sep 3, 2024

Andrew-Beggs-ECMWF commented Sep 13, 2024

wdeconinck commented Sep 16, 2024

wdeconinck commented Nov 5, 2024

Add NVTX support via DrHook #23

Add NVTX support via DrHook #23

Conversation

pmarguinaud commented May 31, 2024 • edited Loading

wdeconinck commented Jun 7, 2024

pmarguinaud commented Jun 7, 2024

reuterbal commented Jun 7, 2024

pmarguinaud commented Jun 11, 2024

wdeconinck commented Jun 11, 2024 • edited Loading

wdeconinck commented Jun 11, 2024

pmarguinaud commented Jun 12, 2024 • edited Loading

wdeconinck commented Aug 26, 2024

Andrew-Beggs-ECMWF Sep 3, 2024

Choose a reason for hiding this comment

Andrew-Beggs-ECMWF commented Sep 13, 2024

wdeconinck commented Sep 16, 2024

wdeconinck commented Nov 5, 2024

pmarguinaud commented May 31, 2024 •

edited

Loading

wdeconinck commented Jun 11, 2024 •

edited

Loading

pmarguinaud commented Jun 12, 2024 •

edited

Loading