[BUG] Fix logging_metrics not registered as submodules in v2 BaseModel (#2197)#2203
[BUG] Fix logging_metrics not registered as submodules in v2 BaseModel (#2197)#2203Gitanaskhan26 wants to merge 1 commit intosktime:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2203 +/- ##
=======================================
Coverage ? 86.58%
=======================================
Files ? 165
Lines ? 9736
Branches ? 0
=======================================
Hits ? 8430
Misses ? 1306
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I request you to remove/close this PR as the scope and nature of the reference issue has not been ascertained yet and I am already working on it. |
|
Hi @StrikerEureka34, thanks for reporting the issue. The reproduction script was very helpful in tracking this down. I didn't see an open PR or assignment on the issue, so I went ahead with a fix. I'm happy to let the maintainers decide how they'd like to proceed. If there's anything you'd like me to adjust in this implementation, just let me know! |
I get your point, but the issue was still being investigated and it's premature to raise a PR. |
Understood, and I apologize if it felt like I was stepping on your toes! Since the issue was unassigned and didn't have a comment claiming a WIP, I just jumped in when I saw the fix. I will definitely make sure to ask beforehand on future issues. I'll leave this open for the maintainers to evaluate whenever they have bandwidth. If your investigation led to a different architectural approach, I’m more than happy to collaborate on this or adjust my implementation! |
No worries, my only concern was that it leads to duplication of efforts and wastes a lot of time.
I'll raise a draft PR soon if this bug checks out, looking forward for your thoughts there, thanks! |
logging_metrics was stored as a plain Python list, so PyTorch did not register metrics as submodules. When the model was moved to GPU, metric internal state buffers (losses, lengths) remained on CPU, causing a RuntimeError device mismatch during training/validation/test steps. Wrap logging_metrics in nn.ModuleList to match the v1 implementation. Add regression tests verifying submodule registration. Fixes sktime#2197.
e4c37f5 to
5019d3f
Compare
Reference Issues/PRs
Fixes #2197.
What does this implement/fix? Explain your changes.
BaseModel.__init__()in_base_model_v2.pystoredlogging_metricsas a plain Python list instead ofnn.ModuleList. PyTorch only tracks submodules assigned asnn.Module,nn.ModuleList, ornn.ModuleDictattributes. Because metrics were in a plain list, they were not registered as submodules, so whenmodel.to("cuda")was called, metric internal state buffers (losses,lengthsregistered viaadd_state()) stayed on CPU. This caused aRuntimeError: Expected all tensors to be on the same deviceduringtraining_step/validation_step/test_step.All 5 existing v2 models are affected (DLinear, TimeXer, SAMformer, TFT, TiDE).
The fix wraps
logging_metricsinnn.ModuleList, matching the v1 implementation at_base_model.py:553-558.What should a reviewer concentrate their feedback on?
_base_model_v2.pymetric._defaultsto reach torchmetrics state tensors is acceptable or if there is a cleaner public APIDid you add any tests for the change?
Yes, three tests appended to
tests/test_models/test_dlinear_v2.py:test_logging_metrics_is_module_list— asserts the container type and that metrics appear inmodel.modules()test_empty_logging_metrics_is_module_list— asserts empty case also usesnn.ModuleListtest_logging_metrics_device_propagation— moves model totorch.device("meta")and checks that metric state tensors follow. This catches the bug on CPU-only CI since.to("meta")is a real device move unlike.to("cpu")which is a no-op when already on CPU.Any other comments?
Tests reuse the existing
sample_datasetfixture intest_dlinear_v2.py.PR checklist