Add GPU support to `BotorchRecommender` #520

kkovary · 2025-03-30T03:26:01Z

closes #356

This is still a WIP! As you can tell there are a couple of different ways device handling is being done, lots of this has been guess and check, and chasing down issues as they pop up (lots of whack-a-mole). It turns out my initial attempt enabled cuda support, but broke cpu support, and lots of tests were failing. At least now both cuda and cpu are supported, and tests appear to be passing.

Very open to suggestions, and I'll continue to iterate on a consistent way to handle devices.

AVHopp · 2025-04-09T18:01:51Z

Hey @kkovary thanks for the contribution. We currently have a lot of work on our plates, so it might take a while until we get to have a detailed look, but thank you already!

kkovary · 2025-04-09T22:03:19Z

no worries, no rush on my end

AdrianSosic

Hi @kkovary. Great to see that you're working on this 👏🏼 Now that PyCon is over, I finally have some time to look at the PR. More comments/questions will follow, but here are some general questions first.

AdrianSosic · 2025-04-28T09:31:12Z

baybe/acquisition/_builder.py

Putting this as a general comment for the PR here:

My biggest concern at the moment are all these getattr and hasattr expressions, which impact type-safety and which I'd like to keep to the absolute minimum. One part required to solving this brings us to a central question to be answered: How and where do we want to control the device type?

I see three layers of control, with increasing level of precedence:

Via environment variable
🟢 You've basically already implemented this via the BAYBE_USE_GPU flag

Via a "settings variable" defined inside the running python session
🟡 This will be added at a later point in time and is not directly related to the PR. Once there, we can let your get_default_device function read from there

Via corresponding class attributes
🔴 I'm not yet sure what is the right place to set this attribute. At the moment, you're placing it to BotorchRecommender, but I could also imagine that this should instead become an attribute of the base class, or even the surrogate/acquisition classes.

Don't want to bias you too much, hence I'll wait to share more thoughts before hearing yours 🙃

AdrianSosic · 2025-04-28T09:32:53Z

baybe/acquisition/_builder.py

+        if self.device is None:
+            self.device = getattr(self.surrogate, "device", None)
+            if self.device is None and torch.cuda.is_available():
+                self.device = get_default_device()


Since this is purely related to the device attribute, I think this better belongs into a dedicated attrs default method (i.e. using @device.default)

AdrianSosic · 2025-04-28T09:34:01Z

baybe/utils/device_utils.py

+class _SingleDeviceMode:
+    """Internal context manager that forces operations to happen on a single device."""


Notice that the way this class is written does not depend at all on the fact that its used to handle devices, i.e. its implementation is that of a generic settings handler --> rename?

AdrianSosic · 2025-04-28T09:35:16Z

baybe/utils/device_utils.py

+
+    Args:
+        device: The device to use. If None, uses the default device.
+        manage_memory: If True, clears GPU memory before and after operations.


Can you comment on when to set this to true/false?

AdrianSosic · 2025-04-28T09:37:07Z

baybe/utils/device_utils.py

+    if enforce_single_device:
+        managers.extend(
+            [_SingleDeviceMode(True), debug(True), fast_computations(solves=False)]
+        )


Two comments:

Can you elaborate on what is the purpose of the flag?

I don't yet understand the reasons for the gpytorch specific flags, but it's definitely suboptimal that gpytorch-specific logic is coupled to this general-purpose function. Maybe you can explain the motivation for it?

AdrianSosic · 2025-04-28T09:38:21Z

tox.ini

+    uv pip compile --universal --python-version 3.10 pyproject.toml --extra dev -o {env:DOCS_LOCKFILE_PATH:{env:DEFAULT_DOCS_LOCKFILE_PATH}} {posargs}
+
+[testenv:gpu,gpu-py{310,311,312}]
+description = Run PyTest with GPU support


My other big concern at the moment is that I don't know yet how we can properly test GPU usage in CI. I guess we need some special runners or something? Would be glad to hear your thoughts about it 🙃

to add to this:
ideally we have a test that fails quickly when GPU computation is not working as expected. Can you elaborate how you tested this during developing this extension? Ie what were you looking for when executing code/tests? Did you look out for speed or just for non-failure etc?

kkovary and others added 30 commits February 10, 2025 11:51

gpu support for botorch

a6ff8e6

handle devices

1acec13

more device consistency bug fixes

cc4a4dd

more thorough device handling

55c20e0

another device patch

657c284

device patch 3

12b548c

device patch 4

5472a36

device patch 5

14166e5

device patch 6

e8689cd

device patch 7

9094d53

device patch 8

6fb8616

device patch 9

722f4d1

device patch 10

521b57b

device patch 11

79f2662

device patch 12

75cd3ad

device patch 13

48ee1ae

device patch 14

672070f

device patch 15

0d5f723

device patch 15

6b78741

device patch 17

8b3431c

device patch 18

f59c412

device patch 19

6460492

device patch 20

7730639

device patch 21

9fbfb17

device patch 22

8c1f1a2

Merge branch 'emdgroup:main' into gpu-optimization

e37f3f7

device patch 23

897bee9

device patch 24

487f35d

device patch 25

b35b5fa

device patch 26

a50e124

kkovary and others added 14 commits March 30, 2025 06:13

fix docs test failure

53db606

removed cpu_only_mode function

3f9111d

add gpu to tox

86fec71

consolidate device managment

0bf14e4

removed redundant to_tensor

150889b

no need for device managment on BayesianRecommender

0e8a271

removed from_tensor

6334864

more consistent device handling

5f4a93e

removed redundant _to_device method

cdf28a6

use to_device in _recommend_with_discrete_parts

c67e0b9

fixed incorrect kwarg

f366e8c

change default device to cpu

7028b40

added BAYBE_USE_GPU to testing

54f1569

Merge branch 'main' into gpu-optimization

f85f2e1

Scienfitz added enhancement Expand / change existing functionality new feature New functionality and removed enhancement Expand / change existing functionality labels Apr 9, 2025

kkovary added 5 commits April 20, 2025 19:32

Merge branch 'main' into gpu-optimization

dd4f112

ensure indices tensor in GP surrogate is on same device as input data

c88388f

ensure device consistency in GP surrogate posterior computation

df62786

ensure device consistency in surrogate models and acquisition functions

7d5f18d

ensure device consistency in acquisition function evaluation

73c7c0c

AdrianSosic marked this pull request as ready for review April 28, 2025 08:59

AdrianSosic requested review from AVHopp, AdrianSosic and Scienfitz as code owners April 28, 2025 08:59

AdrianSosic marked this pull request as draft April 28, 2025 09:00

AdrianSosic reviewed Apr 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add GPU support to `BotorchRecommender` #520

Add GPU support to `BotorchRecommender` #520

kkovary commented Mar 30, 2025 •

edited

Loading

Uh oh!

AVHopp commented Apr 9, 2025

Uh oh!

kkovary commented Apr 9, 2025

Uh oh!

AdrianSosic left a comment

Uh oh!

AdrianSosic Apr 28, 2025

Uh oh!

AdrianSosic Apr 28, 2025

Uh oh!

AdrianSosic Apr 28, 2025

Uh oh!

AdrianSosic Apr 28, 2025

Uh oh!

AdrianSosic Apr 28, 2025

Uh oh!

AdrianSosic Apr 28, 2025

Uh oh!

Scienfitz Apr 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		class _SingleDeviceMode:
		"""Internal context manager that forces operations to happen on a single device."""

Add GPU support to BotorchRecommender #520

Are you sure you want to change the base?

Add GPU support to BotorchRecommender #520

Conversation

kkovary commented Mar 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AVHopp commented Apr 9, 2025

Uh oh!

kkovary commented Apr 9, 2025

Uh oh!

AdrianSosic left a comment

Choose a reason for hiding this comment

Uh oh!

AdrianSosic Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

AdrianSosic Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

AdrianSosic Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

AdrianSosic Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

AdrianSosic Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

AdrianSosic Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

Scienfitz Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add GPU support to `BotorchRecommender` #520

Add GPU support to `BotorchRecommender` #520

kkovary commented Mar 30, 2025 •

edited

Loading