-
Notifications
You must be signed in to change notification settings - Fork 54
Register observer & correct clock setting #242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ar one where the memory clock would always be seen as not-equal due to a rounding error
…goes to framework time instead of benchmark time
…goes to framework time instead of benchmark time
…ck on gpu-architecture compiler option, added gpu-architecture auto-adding to CuPy
…d tests for this function, removed setting --gpu-architecture for CuPy as it is already set internally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I happened to stumble upon this PR and out of curiosity had a look at the changes and couldn't help to add some comments. My main 'complaint't is that some scope creep seems to have occurred regarding the L2 flushing. In my opinion, it should really be moved into a separate PR.
kernel_tuner/backends/cupy.py
Outdated
@@ -46,6 +47,7 @@ def __init__(self, device=0, iterations=7, compiler_options=None, observers=None | |||
self.devprops = dev.attributes | |||
self.cc = dev.compute_capability | |||
self.max_threads = self.devprops["MaxThreadsPerBlock"] | |||
self.cache_size_L2 = self.devprops["L2CacheSize"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the PR text, it comes as a surprise to me that L2 related things have been changed.
kernel_tuner/backends/cupy.py
Outdated
@@ -124,6 +126,7 @@ def compile(self, kernel_instance): | |||
compiler_options = self.compiler_options | |||
if not any(["-std=" in opt for opt in self.compiler_options]): | |||
compiler_options = ["--std=c++11"] + self.compiler_options | |||
# CuPy already sets the --gpu-architecture by itself, as per https://github.com/cupy/cupy/blob/20ccd63c0acc40969c851b1917dedeb032209e8b/cupy/cuda/compiler.py#L145 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a recommend version of CuPy for KernelTuner? If so, consider replacing the link to e.g. https://github.com/cupy/cupy/blob/v13/cupy/cuda/compiler.py#L145
for v13
of CuPy. Something like https://pypi.org/project/bump2version/ may be helpful to keep such version numbers up to date going forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, it should indeed be version independent. I'll change it to main
instead for now.
kernel_tuner/backends/opencl.py
Outdated
# TODO the L2 cache size request fails | ||
# self.cache_size_L2 = self.ctx.devices[0].get_info( | ||
# cl.device_affinity_domain.L2_CACHE | ||
# ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open a new issue to keep track of this?
kernel_tuner/runners/sequential.py
Outdated
@@ -100,7 +100,7 @@ def run(self, parameter_space, tuning_options): | |||
params = process_metrics(params, tuning_options.metrics) | |||
|
|||
# get the framework time by estimating based on other times | |||
total_time = 1000 * (perf_counter() - self.start_time) - warmup_time | |||
total_time = 1000 * ((perf_counter() - self.start_time) - warmup_time) # TODO is it valid that we deduct the warmup time here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on when self.start_time
(before or after the warm-up?) is set and whatever perf_counter()
returns. I don't know about these KernelTuner details to give a definitive answer, though.
kernel_tuner/util.py
Outdated
@@ -221,7 +221,7 @@ def check_block_size_names(block_size_names): | |||
if not isinstance(block_size_names, list): | |||
raise ValueError("block_size_names should be a list of strings!") | |||
if len(block_size_names) > 3: | |||
raise ValueError("block_size_names should not contain more than 3 names!") | |||
raise ValueError(f"block_size_names should not contain more than 3 names! ({block_size_names=})") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Useful for debugging, but should it be included in this PR?
kernel_tuner/util.py
Outdated
@@ -570,6 +570,24 @@ def get_total_timings(results, env, overhead_time): | |||
return env | |||
|
|||
|
|||
def to_valid_nvrtc_gpu_arch_cc(compute_capability: str) -> str: | |||
"""Returns a valid Compute Capability for NVRTC `--gpu-architecture=`, as per https://docs.nvidia.com/cuda/nvrtc/index.html#group__options.""" | |||
valid_cc = ['50', '52', '53', '60', '61', '62', '70', '72', '75', '80', '87', '89', '90', '90a'] # must be in ascending order, when updating also update test_to_valid_nvrtc_gpu_arch_cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you do have the Pascal, Volta and Turing architectures! Can't you put these in some global list and use them in the tests as well to avoid having this list duplicated?
kernel_tuner/util.py
Outdated
if len(compute_capability) < 2: | ||
raise ValueError(f"Compute capability '{compute_capability}' must be at least of length 2, is {len(compute_capability)}") | ||
if compute_capability in valid_cc: | ||
return compute_capability | ||
# if the compute capability does not match, scale down to the nearest matching | ||
subset_cc = [cc for cc in valid_cc if compute_capability[0] == cc[0]] | ||
if len(subset_cc) > 0: | ||
# get the next-highest valid CC | ||
highest_cc_index = max([i for i, cc in enumerate(subset_cc) if int(cc[1]) <= int(compute_capability[1])]) | ||
return subset_cc[highest_cc_index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a rather complicated way of trying to match the input compute_capability
to something in valid_cc
. I am sure there must be a way to make it simpler and more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, the problem is we can't use integer comparison because it's also possible to have e.g. 90a
, and we can only match within the major number. Any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about converting 90a
to 901
and 90
to 900
. I.e.: multiply the integer part of the number by 10 and add the ordinal value of the letter following (if any) to it.
@csbnw good comments! There is indeed some feature creep in this PR due to ongoing research 😅 the L2 stuff was just in here so Ben could read along, but I'll indeed create a separate PR for it. Converting back to draft for now. |
|
|
This pull request adds a built-in Register Observer. This observer works for the PyCUDA, CuPy, and CUDA-Python backends. On unsupported backends, it gives a
NotImplementedError
.In addition, the pull request improves the efficiency with which clocks are set, and does not count the time spent doing so towards the benchmark time.