Skip to content

Register observer & correct clock setting #242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Apr 22, 2024
Merged

Conversation

fjwillemsen
Copy link
Collaborator

@fjwillemsen fjwillemsen commented Feb 8, 2024

This pull request adds a built-in Register Observer. This observer works for the PyCUDA, CuPy, and CUDA-Python backends. On unsupported backends, it gives a NotImplementedError.
In addition, the pull request improves the efficiency with which clocks are set, and does not count the time spent doing so towards the benchmark time.

@fjwillemsen fjwillemsen changed the title Register observer Register observer & correct clock setting Feb 16, 2024
Copy link
Collaborator

@csbnw csbnw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I happened to stumble upon this PR and out of curiosity had a look at the changes and couldn't help to add some comments. My main 'complaint't is that some scope creep seems to have occurred regarding the L2 flushing. In my opinion, it should really be moved into a separate PR.

@@ -46,6 +47,7 @@ def __init__(self, device=0, iterations=7, compiler_options=None, observers=None
self.devprops = dev.attributes
self.cc = dev.compute_capability
self.max_threads = self.devprops["MaxThreadsPerBlock"]
self.cache_size_L2 = self.devprops["L2CacheSize"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the PR text, it comes as a surprise to me that L2 related things have been changed.

@@ -124,6 +126,7 @@ def compile(self, kernel_instance):
compiler_options = self.compiler_options
if not any(["-std=" in opt for opt in self.compiler_options]):
compiler_options = ["--std=c++11"] + self.compiler_options
# CuPy already sets the --gpu-architecture by itself, as per https://github.com/cupy/cupy/blob/20ccd63c0acc40969c851b1917dedeb032209e8b/cupy/cuda/compiler.py#L145
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a recommend version of CuPy for KernelTuner? If so, consider replacing the link to e.g. https://github.com/cupy/cupy/blob/v13/cupy/cuda/compiler.py#L145 for v13 of CuPy. Something like https://pypi.org/project/bump2version/ may be helpful to keep such version numbers up to date going forward.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, it should indeed be version independent. I'll change it to main instead for now.

Comment on lines 48 to 51
# TODO the L2 cache size request fails
# self.cache_size_L2 = self.ctx.devices[0].get_info(
# cl.device_affinity_domain.L2_CACHE
# )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open a new issue to keep track of this?

@@ -100,7 +100,7 @@ def run(self, parameter_space, tuning_options):
params = process_metrics(params, tuning_options.metrics)

# get the framework time by estimating based on other times
total_time = 1000 * (perf_counter() - self.start_time) - warmup_time
total_time = 1000 * ((perf_counter() - self.start_time) - warmup_time) # TODO is it valid that we deduct the warmup time here?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on when self.start_time (before or after the warm-up?) is set and whatever perf_counter() returns. I don't know about these KernelTuner details to give a definitive answer, though.

@@ -221,7 +221,7 @@ def check_block_size_names(block_size_names):
if not isinstance(block_size_names, list):
raise ValueError("block_size_names should be a list of strings!")
if len(block_size_names) > 3:
raise ValueError("block_size_names should not contain more than 3 names!")
raise ValueError(f"block_size_names should not contain more than 3 names! ({block_size_names=})")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful for debugging, but should it be included in this PR?

@@ -570,6 +570,24 @@ def get_total_timings(results, env, overhead_time):
return env


def to_valid_nvrtc_gpu_arch_cc(compute_capability: str) -> str:
"""Returns a valid Compute Capability for NVRTC `--gpu-architecture=`, as per https://docs.nvidia.com/cuda/nvrtc/index.html#group__options."""
valid_cc = ['50', '52', '53', '60', '61', '62', '70', '72', '75', '80', '87', '89', '90', '90a'] # must be in ascending order, when updating also update test_to_valid_nvrtc_gpu_arch_cc
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you do have the Pascal, Volta and Turing architectures! Can't you put these in some global list and use them in the tests as well to avoid having this list duplicated?

Comment on lines 577 to 586
if len(compute_capability) < 2:
raise ValueError(f"Compute capability '{compute_capability}' must be at least of length 2, is {len(compute_capability)}")
if compute_capability in valid_cc:
return compute_capability
# if the compute capability does not match, scale down to the nearest matching
subset_cc = [cc for cc in valid_cc if compute_capability[0] == cc[0]]
if len(subset_cc) > 0:
# get the next-highest valid CC
highest_cc_index = max([i for i, cc in enumerate(subset_cc) if int(cc[1]) <= int(compute_capability[1])])
return subset_cc[highest_cc_index]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a rather complicated way of trying to match the input compute_capability to something in valid_cc. I am sure there must be a way to make it simpler and more readable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the problem is we can't use integer comparison because it's also possible to have e.g. 90a, and we can only match within the major number. Any suggestions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about converting 90a to 901 and 90 to 900. I.e.: multiply the integer part of the number by 10 and add the ordinal value of the letter following (if any) to it.

@fjwillemsen
Copy link
Collaborator Author

I happened to stumble upon this PR and out of curiosity had a look at the changes and couldn't help to add some comments. My main 'complaint't is that some scope creep seems to have occurred regarding the L2 flushing. In my opinion, it should really be moved into a separate PR.

@csbnw good comments! There is indeed some feature creep in this PR due to ongoing research 😅 the L2 stuff was just in here so Ben could read along, but I'll indeed create a separate PR for it. Converting back to draft for now.

@fjwillemsen fjwillemsen marked this pull request as draft March 1, 2024 15:58
@fjwillemsen fjwillemsen marked this pull request as ready for review March 1, 2024 16:51
Copy link

sonarqubecloud bot commented Mar 1, 2024

Quality Gate Passed Quality Gate passed

Issues
6 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

Copy link

Quality Gate Passed Quality Gate passed

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@benvanwerkhoven benvanwerkhoven merged commit d792304 into master Apr 22, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants