Skip to content

Conversation

AVHopp
Copy link
Collaborator

@AVHopp AVHopp commented May 7, 2025

Replaces the custom IndexKernel construction with BoTorch's MultiTaskGP (which became possible due the added all_tasks argument).

@AVHopp AVHopp marked this pull request as draft May 7, 2025 07:40
@AVHopp AVHopp changed the title Tl benchmarking investigation Use Botorch MultiTaskGP for transfer learning May 7, 2025
@Hrovatin Hrovatin force-pushed the tl_benchmarking_investigation branch 2 times, most recently from 8fee382 to 88e1dfe Compare June 4, 2025 11:18
@Hrovatin Hrovatin marked this pull request as ready for review June 5, 2025 10:39
Copy link
Collaborator

@AdrianSosic AdrianSosic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Hrovatin, here the first batch of comments

@Hrovatin Hrovatin requested a review from AdrianSosic June 6, 2025 14:40
@Scienfitz
Copy link
Collaborator

Scienfitz commented Aug 15, 2025

@Hrovatin would you consider abandoning this PR? I think if this topic is picked up again its better to start afresh (and only open a PR after investigations have concluded).

@Hrovatin
Copy link
Collaborator

@Scienfitz I would keep open as the main blocker for this was randomness in benchmarks. Since that may be solved now I would suggest running benchmarks again on the new HPC (need to confirm it is also reproducible there)

@Scienfitz
Copy link
Collaborator

@Hrovatin any update?

@Hrovatin
Copy link
Collaborator

Hrovatin commented Sep 9, 2025

No, I need to first set up testing on oneHPC to reproducibly benchmark - as that seems to be the only option to make fully reproducible. I will post update here once I have the results @Scienfitz

@Copilot Copilot AI review requested due to automatic review settings September 12, 2025 11:22
@Hrovatin Hrovatin force-pushed the tl_benchmarking_investigation branch from 8ce5fba to bee32aa Compare September 12, 2025 11:22
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@Hrovatin
Copy link
Collaborator

Hrovatin commented Sep 17, 2025

@AdrianSosic @Scienfitz @AVHopp Update on the comparison of MultiTask GP from botorch and current kernel:

  • The results are not identical, but very close, except for michaelewicz (but it seems that variation is likely not significant here as well)
  • A concern: When using botorch multitask gp the hartman tl benchmark always fails due to ooo (when using at 0.05 but not 0.01 source data). I have not yet figured out why. Before investigating this we should probably make a call if we are ok with accepting some deviation from current main (named benchmarks-reproducibility-beforeBug on the plot) or not as if we decide we need 100% reproducibility anyways it also does not make sense to investigate any other issues further.
image

Copy link
Collaborator Author

@AVHopp AVHopp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of comments, but we should discuss some of the points (in particular the one regarding multiple active values) internally first.

@Hrovatin Hrovatin force-pushed the tl_benchmarking_investigation branch from de81707 to 68a9c24 Compare September 25, 2025 07:13
Copy link
Collaborator Author

@AVHopp AVHopp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be willing to approve - however, since this is technically my PR I can't

@Hrovatin
Copy link
Collaborator

Hrovatin commented Oct 2, 2025

Results after rebase:
Note:

  • Hartmann tl did not run for the new branch due to issues in local setup (shows only the main branch). But I tested that it runs successfully in actions
  • Reproducibility is in general not 100% (also when not using the tl code that was changed)
image

@AdrianSosic AdrianSosic force-pushed the tl_benchmarking_investigation branch 4 times, most recently from 5cfb366 to 7bb49d9 Compare October 6, 2025 08:50
Copy link
Collaborator

@Scienfitz Scienfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets do the final check referenced here and merge if everything is alright

Hrovatin and others added 23 commits October 9, 2025 11:26
@AdrianSosic AdrianSosic force-pushed the tl_benchmarking_investigation branch from 953b609 to 6c3dd93 Compare October 9, 2025 09:26
@Hrovatin
Copy link
Collaborator

@AdrianSosic even after the new rebase, this branch and main still differ even in the naive case (when no TaskParam/Kernerl is used)
Here is the info about the branches I used (both from 13. 10. 2025)
Main last commit:

commit 823558afa173b5cfaecb23c6cb5431ef7e28088b (HEAD -> main, origin/main, origin/HEAD)
Merge: 48046272c 4206a24da
Author: Martin Fitzner <[email protected]>
Date:   Thu Oct 9 15:54:14 2025 +0200

This branch PR last commit:

commit 075c33322bb36051cdf5efd6d7ee672621edc5a7 (HEAD -> tl_benchmarking_investigation, origin/tl_benchmarking_investigation)
Author: AdrianSosic <[email protected]>
Date:   Thu Oct 9 13:16:08 2025 +0200

image

@Hrovatin
Copy link
Collaborator

Hrovatin commented Oct 15, 2025

@AVHopp and @AdrianSosic the issue is indeed the handling of active dimensions in base kernel (as speculated here).

Changing the active dims as in this branch ensures that in naive case the outcome matches exactly the main branch, while in non-naive case it still stays as was this branch.

These are equality analysis for Michalewicz (as that one was usually most obviously discrepant) - output True means equal:

tl-benchmarking-investigation-activeDims and main in naive True:
True
tl-benchmarking-investigation-activeDims and main in naive False:
False
tl-benchmarking-investigation-activeDims and tl-benchmarking-investigation in naive True:
False
tl-benchmarking-investigation-activeDims and tl-benchmarking-investigation in naive False:
True
main and tl-benchmarking-investigation in naive True:
False
main and tl-benchmarking-investigation in naive False:
False

covar_module = kernel.to_gpytorch(
ard_num_dims=kernel_num_dims,
batch_shape=batch_shape,
active_dims=tuple(range(kernel_num_dims)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting 🤔 Can you explain how exactly I have to understand these prints?

@AVHopp and @AdrianSosic the issue is indeed the handling of active dimensions in base kernel (as speculated here).

Changing the active dims as in this branch ensures that in naive case the outcome matches exactly the main branch, while in non-naive case it still stays as was this branch.

These are equality analysis for Michalewicz (as that one was usually most obviously discrepant) - output True means equal:

tl-benchmarking-investigation-activeDims and main in naive True:
True
tl-benchmarking-investigation-activeDims and main in naive False:
False
tl-benchmarking-investigation-activeDims and tl-benchmarking-investigation in naive True:
False
tl-benchmarking-investigation-activeDims and tl-benchmarking-investigation in naive False:
True
main and tl-benchmarking-investigation in naive True:
False
main and tl-benchmarking-investigation in naive False:
False

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Branches

  • tl-benchmarking-investigation-activeDims - branch where I specify active dims directly
tuple(range(train_x.shape[-1] - context.n_task_dimensions))
  • tl-benchmarking-investigation - version of this branch before the last commit where I added active_dims. To my understanding based on Gpytorch they should be the same as inputing None just uses all dimensions. Note that in BoTorch MultiTaskGP data is split up into non-task and task parts so in both in TL and naive case (with SingleTaskGP and no Task kernel) one would just use integers [0:n-1] for the base_kernel.
  • main - main branch with last comit from 9. 10.

Naive

  • naive =True uses no task, not TL
  • naive = False uses task, TL

The True/False printed below each line is whether the two branches had exactly the same CumBest result in either naive or TL setting.

All comparisons were done on full Michlewicz domain benchmark

You can see that tl-benchmarking-investigation-activeDims retains performance of main in naive setting while matching the current tl-benchmarking-investigation branch in TL setting. I am not sure why setting the active_dims seems to make a difference only in the naive setting (and not in TL). But at least we now have full reproducibility for the naive setting. For the TL setting it is still not 100% reproducible with main (also in general due to using MultiTaskGP), but I think we already decided this is acceptable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants