[BUG]: Fix TimeXer autograd graph detachment and deepcopy crash#2179
[BUG]: Fix TimeXer autograd graph detachment and deepcopy crash#2179Siddhazntx wants to merge 4 commits intosktime:mainfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2179 +/- ##
=======================================
Coverage ? 86.62%
=======================================
Files ? 165
Lines ? 9737
Branches ? 0
=======================================
Hits ? 8435
Misses ? 1302
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| loss = MultiLoss([MAE()] * len(self.target_positions)) | ||
| else: | ||
| loss = MAE() | ||
| loss = MAE() |
There was a problem hiding this comment.
why are you removing the MultiLoss from here?
There was a problem hiding this comment.
Originally took it out because the old way ([MAE()] * len(...)) shared the exact same metric object in memory, which was actually what caused the autograd graph to detach. But you're totally right. I'll add it back, but I'll use a list comprehension ([MAE() for _ in range(...)]) so each target gets its own independent metric. That should fix the bug and keep the graph stable!
| if len(target_positions) == 1: | ||
| prediction = prediction[..., 0, :] | ||
| if len(target_positions) > 1: | ||
| prediction = torch.stack( |
There was a problem hiding this comment.
why are you using torch.stack? I think for multi-target, it should be a list of tensors?
There was a problem hiding this comment.
My bad on that one! I was trying to format the 3D tensor output and just defaulted to torch.stack without thinking. A list is definitely the right way to go here. I'll update it!
… a list per reviewer request
…tiLoss initialization
|
Hey @phoeenniixx, Just pushed a quick follow-up commit. The last update tripped a unit test because I tried to check I swapped it to safely check |
Reference Issues/PRs
Fixes #2110
What does this implement/fix? Explain your changes.
This PR resolves the RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn that caused the automated CI to fail during TimeXer integration tests (specifically [TimeXer-base_params-3-RMSE]).
Root Cause & Fixes:
Autograd Graph Detachment (Shared State Bug): When initializing TimeXer for multivariate forecasting (features="M"), MultiLoss was being initialized using list multiplication ([MAE()] * len(...)). This created multiple pointers to the exact same metric instance. During the forward pass for multiple targets, the internal state was overwritten, breaking the autograd graph and detaching the final loss tensor.
Change: Updated the initialization to use a list comprehension ([MAE() for _ in range(...)]), ensuring every target receives an independent metric instance.
Deepcopy Crash on Hyperparameter Save: self.save_hyperparameters() was attempting to deepcopy initialized, complex metric objects and live tensors, which PyTorch natively blocks, causing local test crashes.
Change: Added ignore=["loss", "logging_metrics"] to save_hyperparameters() and ensured it executes before super().init() wraps them.
What should a reviewer concentrate their feedback on?
The updated initialization order in TimeXer.init.
The list comprehension logic for MultiLoss to ensure it aligns with the expected metric tracking structure in PyTorch Lightning.
Did you add any tests for the change?
No new tests were added, as this PR fixes the existing, failing CI integration tests. Validated the changes locally by running the full test suite in headless mode (pytest pytorch_forecasting/tests/test_all_estimators.py -k "TimeXer"), resulting in a clean pass for all 42 variants.
Click to expand: Local Test Run Output (All 42 TimeXer tests passing)
PR checklist
pre-commit install.To run hooks independent of commit, execute
pre-commit run --all-files