Skip to content

[BUG] Fix logging_metrics not registered as submodules in v2 BaseModel (#2197)#2203

Open
Gitanaskhan26 wants to merge 1 commit intosktime:mainfrom
Gitanaskhan26:fix/logging-metrics-module-list
Open

[BUG] Fix logging_metrics not registered as submodules in v2 BaseModel (#2197)#2203
Gitanaskhan26 wants to merge 1 commit intosktime:mainfrom
Gitanaskhan26:fix/logging-metrics-module-list

Conversation

@Gitanaskhan26
Copy link

@Gitanaskhan26 Gitanaskhan26 commented Mar 17, 2026

Reference Issues/PRs

Fixes #2197.

What does this implement/fix? Explain your changes.

BaseModel.__init__() in _base_model_v2.py stored logging_metrics as a plain Python list instead of nn.ModuleList. PyTorch only tracks submodules assigned as nn.Module, nn.ModuleList, or nn.ModuleDict attributes. Because metrics were in a plain list, they were not registered as submodules, so when model.to("cuda") was called, metric internal state buffers (losses, lengths registered via add_state()) stayed on CPU. This caused a RuntimeError: Expected all tensors to be on the same device during training_step / validation_step / test_step.

All 5 existing v2 models are affected (DLinear, TimeXer, SAMformer, TFT, TiDE).

The fix wraps logging_metrics in nn.ModuleList, matching the v1 implementation at _base_model.py:553-558.

What should a reviewer concentrate their feedback on?

  • The one-line production fix in _base_model_v2.py
  • The three regression tests, in particular whether iterating metric._defaults to reach torchmetrics state tensors is acceptable or if there is a cleaner public API

Did you add any tests for the change?

Yes, three tests appended to tests/test_models/test_dlinear_v2.py:

  • test_logging_metrics_is_module_list — asserts the container type and that metrics appear in model.modules()
  • test_empty_logging_metrics_is_module_list — asserts empty case also uses nn.ModuleList
  • test_logging_metrics_device_propagation — moves model to torch.device("meta") and checks that metric state tensors follow. This catches the bug on CPU-only CI since .to("meta") is a real device move unlike .to("cpu") which is a no-op when already on CPU.

Any other comments?

Tests reuse the existing sample_dataset fixture in test_dlinear_v2.py.

PR checklist

  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. → [BUG]
  • Added/modified tests
  • Used pre-commit hooks when committing

@codecov
Copy link

codecov bot commented Mar 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@edbdeb4). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2203   +/-   ##
=======================================
  Coverage        ?   86.58%           
=======================================
  Files           ?      165           
  Lines           ?     9736           
  Branches        ?        0           
=======================================
  Hits            ?     8430           
  Misses          ?     1306           
  Partials        ?        0           
Flag Coverage Δ
cpu 86.58% <100.00%> (?)
pytest 86.58% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@StrikerEureka34
Copy link

StrikerEureka34 commented Mar 17, 2026

I request you to remove/close this PR as the scope and nature of the reference issue has not been ascertained yet and I am already working on it.
Appreciate the help though, thanks!

@Gitanaskhan26
Copy link
Author

Hi @StrikerEureka34, thanks for reporting the issue. The reproduction script was very helpful in tracking this down. I didn't see an open PR or assignment on the issue, so I went ahead with a fix. I'm happy to let the maintainers decide how they'd like to proceed. If there's anything you'd like me to adjust in this implementation, just let me know!

@StrikerEureka34
Copy link

I didn't see an open PR or assignment on the issue, so I went ahead with a fix. I'm happy to let the maintainers decide how they'd like to proceed. If there's anything you'd like me to adjust in this implementation, just let me know!

I get your point, but the issue was still being investigated and it's premature to raise a PR.
And also before raising a PR please ask if the author of the issue was working on it or not, which in this case I was.

@Gitanaskhan26
Copy link
Author

please ask if the author of the issue was working on it or not, which in this case I was.

Understood, and I apologize if it felt like I was stepping on your toes! Since the issue was unassigned and didn't have a comment claiming a WIP, I just jumped in when I saw the fix. I will definitely make sure to ask beforehand on future issues.

I'll leave this open for the maintainers to evaluate whenever they have bandwidth. If your investigation led to a different architectural approach, I’m more than happy to collaborate on this or adjust my implementation!

@StrikerEureka34
Copy link

StrikerEureka34 commented Mar 17, 2026

Understood, and I apologize if it felt like I was stepping on your toes! Since the issue was unassigned and didn't have a comment claiming a WIP, I just jumped in when I saw the fix. I will definitely make sure to ask beforehand on future issues.

No worries, my only concern was that it leads to duplication of efforts and wastes a lot of time.

I’m more than happy to collaborate on this

I'll raise a draft PR soon if this bug checks out, looking forward for your thoughts there, thanks!

logging_metrics was stored as a plain Python list, so PyTorch did not
register metrics as submodules. When the model was moved to GPU, metric
internal state buffers (losses, lengths) remained on CPU, causing a
RuntimeError device mismatch during training/validation/test steps.

Wrap logging_metrics in nn.ModuleList to match the v1 implementation.
Add regression tests verifying submodule registration.

Fixes sktime#2197.
@Gitanaskhan26 Gitanaskhan26 force-pushed the fix/logging-metrics-module-list branch from e4c37f5 to 5019d3f Compare March 17, 2026 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] v2 BaseModel stores logging_metrics as plain list instead of nn.ModuleList (breaks GPU training)

2 participants