Skip to content

Conversation

kkovary
Copy link

@kkovary kkovary commented Mar 30, 2025

closes #356

This is still a WIP! As you can tell there are a couple of different ways device handling is being done, lots of this has been guess and check, and chasing down issues as they pop up (lots of whack-a-mole). It turns out my initial attempt enabled cuda support, but broke cpu support, and lots of tests were failing. At least now both cuda and cpu are supported, and tests appear to be passing.

Very open to suggestions, and I'll continue to iterate on a consistent way to handle devices.

@Scienfitz Scienfitz added enhancement Expand / change existing functionality new feature New functionality and removed enhancement Expand / change existing functionality labels Apr 9, 2025
@AVHopp
Copy link
Collaborator

AVHopp commented Apr 9, 2025

Hey @kkovary thanks for the contribution. We currently have a lot of work on our plates, so it might take a while until we get to have a detailed look, but thank you already!

@kkovary
Copy link
Author

kkovary commented Apr 9, 2025

no worries, no rush on my end

@AdrianSosic AdrianSosic marked this pull request as ready for review April 28, 2025 08:59
@AdrianSosic AdrianSosic marked this pull request as draft April 28, 2025 09:00
Copy link
Collaborator

@AdrianSosic AdrianSosic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kkovary. Great to see that you're working on this 👏🏼 Now that PyCon is over, I finally have some time to look at the PR. More comments/questions will follow, but here are some general questions first.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting this as a general comment for the PR here:

My biggest concern at the moment are all these getattr and hasattr expressions, which impact type-safety and which I'd like to keep to the absolute minimum. One part required to solving this brings us to a central question to be answered: How and where do we want to control the device type?

I see three layers of control, with increasing level of precedence:

  1. Via environment variable
    🟢 You've basically already implemented this via the BAYBE_USE_GPU flag
  2. Via a "settings variable" defined inside the running python session
    🟡 This will be added at a later point in time and is not directly related to the PR. Once there, we can let your get_default_device function read from there
  3. Via corresponding class attributes
    🔴 I'm not yet sure what is the right place to set this attribute. At the moment, you're placing it to BotorchRecommender, but I could also imagine that this should instead become an attribute of the base class, or even the surrogate/acquisition classes.

Don't want to bias you too much, hence I'll wait to share more thoughts before hearing yours 🙃

Comment on lines +104 to +107
if self.device is None:
self.device = getattr(self.surrogate, "device", None)
if self.device is None and torch.cuda.is_available():
self.device = get_default_device()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is purely related to the device attribute, I think this better belongs into a dedicated attrs default method (i.e. using @device.default)

Comment on lines +13 to +14
class _SingleDeviceMode:
"""Internal context manager that forces operations to happen on a single device."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice that the way this class is written does not depend at all on the fact that its used to handle devices, i.e. its implementation is that of a generic settings handler --> rename?

Args:
device: The device to use. If None, uses the default device.
manage_memory: If True, clears GPU memory before and after operations.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment on when to set this to true/false?

Comment on lines +108 to +111
if enforce_single_device:
managers.extend(
[_SingleDeviceMode(True), debug(True), fast_computations(solves=False)]
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments:

  1. Can you elaborate on what is the purpose of the flag?
  2. I don't yet understand the reasons for the gpytorch specific flags, but it's definitely suboptimal that gpytorch-specific logic is coupled to this general-purpose function. Maybe you can explain the motivation for it?

uv pip compile --universal --python-version 3.10 pyproject.toml --extra dev -o {env:DOCS_LOCKFILE_PATH:{env:DEFAULT_DOCS_LOCKFILE_PATH}} {posargs}

[testenv:gpu,gpu-py{310,311,312}]
description = Run PyTest with GPU support
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My other big concern at the moment is that I don't know yet how we can properly test GPU usage in CI. I guess we need some special runners or something? Would be glad to hear your thoughts about it 🙃

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to add to this:
ideally we have a test that fails quickly when GPU computation is not working as expected. Can you elaborate how you tested this during developing this extension? Ie what were you looking for when executing code/tests? Did you look out for speed or just for non-failure etc?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new feature New functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPU support

5 participants