-
Notifications
You must be signed in to change notification settings - Fork 57
Add GPU support to BotorchRecommender
#520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hey @kkovary thanks for the contribution. We currently have a lot of work on our plates, so it might take a while until we get to have a detailed look, but thank you already! |
no worries, no rush on my end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kkovary. Great to see that you're working on this 👏🏼 Now that PyCon is over, I finally have some time to look at the PR. More comments/questions will follow, but here are some general questions first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting this as a general comment for the PR here:
My biggest concern at the moment are all these getattr
and hasattr
expressions, which impact type-safety and which I'd like to keep to the absolute minimum. One part required to solving this brings us to a central question to be answered: How and where do we want to control the device type?
I see three layers of control, with increasing level of precedence:
- Via environment variable
🟢 You've basically already implemented this via theBAYBE_USE_GPU
flag - Via a "settings variable" defined inside the running python session
🟡 This will be added at a later point in time and is not directly related to the PR. Once there, we can let yourget_default_device
function read from there - Via corresponding class attributes
🔴 I'm not yet sure what is the right place to set this attribute. At the moment, you're placing it toBotorchRecommender
, but I could also imagine that this should instead become an attribute of the base class, or even the surrogate/acquisition classes.
Don't want to bias you too much, hence I'll wait to share more thoughts before hearing yours 🙃
if self.device is None: | ||
self.device = getattr(self.surrogate, "device", None) | ||
if self.device is None and torch.cuda.is_available(): | ||
self.device = get_default_device() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is purely related to the device attribute, I think this better belongs into a dedicated attrs default
method (i.e. using @device.default
)
class _SingleDeviceMode: | ||
"""Internal context manager that forces operations to happen on a single device.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice that the way this class is written does not depend at all on the fact that its used to handle devices, i.e. its implementation is that of a generic settings handler --> rename?
Args: | ||
device: The device to use. If None, uses the default device. | ||
manage_memory: If True, clears GPU memory before and after operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment on when to set this to true/false?
if enforce_single_device: | ||
managers.extend( | ||
[_SingleDeviceMode(True), debug(True), fast_computations(solves=False)] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two comments:
- Can you elaborate on what is the purpose of the flag?
- I don't yet understand the reasons for the gpytorch specific flags, but it's definitely suboptimal that gpytorch-specific logic is coupled to this general-purpose function. Maybe you can explain the motivation for it?
uv pip compile --universal --python-version 3.10 pyproject.toml --extra dev -o {env:DOCS_LOCKFILE_PATH:{env:DEFAULT_DOCS_LOCKFILE_PATH}} {posargs} | ||
|
||
[testenv:gpu,gpu-py{310,311,312}] | ||
description = Run PyTest with GPU support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My other big concern at the moment is that I don't know yet how we can properly test GPU usage in CI. I guess we need some special runners or something? Would be glad to hear your thoughts about it 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to add to this:
ideally we have a test that fails quickly when GPU computation is not working as expected. Can you elaborate how you tested this during developing this extension? Ie what were you looking for when executing code/tests? Did you look out for speed or just for non-failure etc?
closes #356
This is still a WIP! As you can tell there are a couple of different ways device handling is being done, lots of this has been guess and check, and chasing down issues as they pop up (lots of whack-a-mole). It turns out my initial attempt enabled cuda support, but broke cpu support, and lots of tests were failing. At least now both cuda and cpu are supported, and tests appear to be passing.
Very open to suggestions, and I'll continue to iterate on a consistent way to handle devices.