Skip to content

Bump PyTorch pin to nightly dev20260628#20590

Open
pytorchupdatebot wants to merge 4 commits into
mainfrom
automated/pytorch-pin-bump-dev20260628
Open

Bump PyTorch pin to nightly dev20260628#20590
pytorchupdatebot wants to merge 4 commits into
mainfrom
automated/pytorch-pin-bump-dev20260628

Conversation

@pytorchupdatebot

Copy link
Copy Markdown
Collaborator

Summary

Automated weekly PyTorch pin bump.

  • Updates NIGHTLY_VERSION in torch_pin.py to dev20260628
  • Updates .ci/docker/ci_commit_pins/pytorch.txt to the corresponding nightly commit hash
  • Syncs c10 headers from PyTorch into runtime/core/portable_type/c10/

This PR was created automatically. If CI fails, Claude will attempt to fix issues (up to 3 attempts). If CI still fails, human review will be requested.

cc @jakeszwe

@pytorch-bot

pytorch-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20590

Note: Links to docs will display an error until the docs builds have been completed.

❌ 50 New Failures, 130 Cancelled Jobs, 1 Pending, 1 Unrelated Failure, 11 Unclassified Failures

As of commit 1a75242 with merge base db5f1ba (image):

NEW FAILURES - The following jobs have failed:

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 29, 2026
@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@pytorchupdatebot

Copy link
Copy Markdown
Collaborator Author

@claude [ci-fix-attempt 1/3]

The trunk CI workflow has failed on this automated PyTorch pin bump PR.
Failed run: https://github.com/pytorch/executorch/actions/runs/28365950466

Please:

  1. Read the Dr. CI comment on this PR for a summary of which jobs failed and whether they are flaky. Ignore failures marked as FLAKY.
  2. Use your CI tools to download the failure logs for the non-flaky failing jobs
  3. Identify the root cause of the failure
  4. If this is a build or test failure caused by PyTorch API changes, fix the ExecuTorch code to be compatible with the new PyTorch version
  5. If this is a c10 header sync issue, the headers have already been synced by the pin bump script — the issue is likely in ExecuTorch code that uses those headers
  6. Run lintrunner -a on any files you change
  7. Push your fix as a new commit to this PR branch

Important constraints:

  • Do NOT modify torch_pin.py or .ci/docker/ci_commit_pins/pytorch.txt — the pin itself is correct
  • Do NOT modify files under runtime/core/portable_type/c10/ unless the sync introduced a new API that ExecuTorch code needs to adapt to
  • Focus on fixing ExecuTorch code to be compatible with the new PyTorch APIs
  • If this is a major BC-breaking change that requires architectural discussion, say so clearly and stop — do not attempt a fix

@claude

claude Bot commented Jun 29, 2026

Copy link
Copy Markdown

Claude finished @pytorchupdatebot's task in 11m 54s —— View job


CI Fix — Attempt 1/3

  • Read Dr. CI summary, identify non-flaky failures
  • Download failure logs for root cause
  • Identify root cause
  • Assess whether this is fixable in ExecuTorch code

Root cause

All 20 non-flaky NEW FAILURES share a single root cause. Every one of the test-models-* quantization-delegation jobs dies with:

File ".../torchao/quantization/pt2e/quantizer/port_metadata_pass.py", line 181, in _port_metadata_for_input_quant_nodes
    torch.ops.aten.transpose.Dimname,
File ".../torch/_ops.py", line 1243, in __getattr__
    raise AttributeError(...)
AttributeError: The underlying op of 'aten.transpose' has no overload name 'Dimname'
Exception: An error occurred when running the 'PortNodeMetaForQDQ' pass

The new PyTorch nightly (dev20260628) removed the aten.transpose.Dimname overload. The code that still references it lives in the installed torchao package (torchao/quantization/pt2e/quantizer/port_metadata_pass.py), invoked from convert_pt2ePortNodeMetaForQDQ. The RuntimeError ... exit code 139 (segfault) macOS failures are downstream of the same broken export.

This is not fixable in ExecuTorch code

I grepped the entire ExecuTorch tree for .Dimname / transpose.Dimnamezero references. ExecuTorch does not use this overload anywhere. The failing code is entirely inside torchao, which ExecuTorch consumes via the third-party/ao submodule pin (installed by install_requirements.py, line 74). There is no ExecuTorch-side shim that can make torch.ops.aten.transpose.Dimname resolve again — the attribute lookup happens inside torchao's pass before any ExecuTorch code runs.

Per the stop condition in the request ("If this is a major BC-breaking change that requires architectural discussion, say so clearly and stop"), I'm stopping rather than forcing a change into ExecuTorch source or into runtime/core/portable_type/c10/.

Recommended fix (one of)

  1. Bump the third-party/ao submodule to a torchao commit that no longer references aten.transpose.Dimname (i.e. one that has adapted to the overload removal). This is the most likely correct fix and belongs in this PR or a companion. Current pin: 02105d46c61dc80a8c9d39d5836e827ba3af8439.
  2. If torchao upstream hasn't yet adapted, the torchao fix must land first (drop/guard the aten.transpose.Dimname entry in port_metadata_pass.py), then bump the submodule here.
  3. Alternatively, hold this pin bump until torchao is compatible with the nightly that dropped the overload.

I did not push a commit because the only viable fix is a torchao submodule bump whose target commit I can't verify in this environment (the third-party/ao submodule isn't checked out here and I don't have network access to query the torchao repo). A blind submodule bump risks pulling in unrelated changes. This needs a maintainer to pick the correct torchao commit.

cc @jakeszwe
· branch automated/pytorch-pin-bump-dev20260628

The dev20260628 PyTorch nightly removes the aten.transpose.Dimname
overload, which the pinned torchao (v0.17.0) still references in
port_metadata_pass.py, breaking convert_pt2e in the quantization
delegation CI jobs. This advances third-party/ao past pytorch/ao#4462,
which drops that reference.

Authored with assistance from Claude.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@pytorchupdatebot

Copy link
Copy Markdown
Collaborator Author

@claude [ci-fix-attempt 2/3]

The trunk CI workflow has failed on this automated PyTorch pin bump PR.
Failed run: https://github.com/pytorch/executorch/actions/runs/28414460583

Please:

  1. Read the Dr. CI comment on this PR for a summary of which jobs failed and whether they are flaky. Ignore failures marked as FLAKY.
  2. Use your CI tools to download the failure logs for the non-flaky failing jobs
  3. Identify the root cause of the failure
  4. If this is a build or test failure caused by PyTorch API changes, fix the ExecuTorch code to be compatible with the new PyTorch version
  5. If this is a c10 header sync issue, the headers have already been synced by the pin bump script — the issue is likely in ExecuTorch code that uses those headers
  6. Run lintrunner -a on any files you change
  7. Push your fix as a new commit to this PR branch

Important constraints:

  • Do NOT modify torch_pin.py or .ci/docker/ci_commit_pins/pytorch.txt — the pin itself is correct
  • Do NOT modify files under runtime/core/portable_type/c10/ unless the sync introduced a new API that ExecuTorch code needs to adapt to
  • Focus on fixing ExecuTorch code to be compatible with the new PyTorch APIs
  • If this is a major BC-breaking change that requires architectural discussion, say so clearly and stop — do not attempt a fix

@claude

claude Bot commented Jun 30, 2026

Copy link
Copy Markdown

Claude finished @pytorchupdatebot's task in 9m 48s —— View job


CI Fix — Attempt 2/3

  • Read Dr. CI summary, identify non-flaky failures
  • Download failure logs for the failed trunk run
  • Identify root cause(s)
  • Assess whether this is fixable in ExecuTorch code

What changed since attempt 1

Attempt 1's aten.transpose.Dimname failure was addressed by bumping the third-party/ao submodule to cb76f2943 (commit a46a60b). That advanced the export far enough to surface new, deeper failures. The remaining non-flaky trunk failures fall into three distinct root causes — only the first is broadly shared.

Root cause 1 (dominant) — torchtune ↔ torchao incompatibility

llama3_2_vision_encoder, phi_4_mini, qwen2_5_1_5b, and test-torchao-huggingface-checkpoints (phi_4_mini) all die the same way:

File ".../torchtune/modules/common_utils.py", line 19, in <module>
    from torchao.dtypes.nf4tensor import NF4Tensor
ModuleNotFoundError: No module named 'torchao.dtypes.nf4tensor'

The newer torchao (cb76f2943) that we pulled in to fix attempt 1 removed/relocated torchao.dtypes.nf4tensor, but the pinned torchtune still imports it:

requirements-examples.txt:6
torchtune @ git+https://github.com/pytorch/torchtune.git@6f2aa7254458145f99d7004cbd6ebc8e53a06404

The import fires from torchtune.__init__ (any import torchtune), so ExecuTorch cannot shim around it. The correct fix is to bump the torchtune pin in requirements-examples.txt to a commit compatible with torchao cb76f2943 (i.e. one that imports NF4Tensor from its new location). This file is ExecuTorch-owned (not in the forbidden set), so the fix belongs here.

Root cause 2 (separate) — XNNPACK squeeze regression — ic3

test-models-linux-aarch64 (ic3, xnnpack-quantization-delegation) fails at runtime, not import:

[XNNCompiler.cpp:1276] Failed to create squeeze node 344 with code: xnn_status_invalid_parameter
[XNNPACKBackend.cpp:143] XNNCompiler::compileModel failed: 0x1
RuntimeError: Failed to execute method forward, error: 0x1

Root cause 3 (separate) — heap corruption — emformer_transcribe

test-models-linux-aarch64 (emformer_transcribe, xnnpack-quantization-delegation) core-dumps (exit 134):

double free or corruption (!prev)
... examples.xnnpack.aot_compiler --model_name=emformer_transcribe --delegate --quantize ... Aborted (core dumped)

Root causes 2 and 3 are in the XNNPACK delegate / quantized-graph path and are most likely downstream of the torchao quantizer changes (different QDQ graph shapes feeding XNNPACK) or the new PyTorch nightly. They are not c10-header issues and need separate investigation; they are not addressable by a Python-side compatibility shim.

Why I'm not pushing a commit

The primary fix is a torchtune pin bump in requirements-examples.txt. I can't pick or verify the correct commit here: this sandbox has no network access (every gh api / curl / WebFetch to query torchao/torchtune was blocked), and third-party/ao isn't checked out, so I can't confirm where NF4Tensor moved or which torchtune commit re-aligns with torchao cb76f2943. A blind bump to torchtune main risks pulling in unrelated breakage — the same verification blocker as attempt 1.

Recommended fix (for a maintainer / next attempt with network)

  1. Bump the torchtune pin in requirements-examples.txt from 6f2aa72… to a commit whose torchtune/modules/common_utils.py imports NF4Tensor from the location it now lives in torchao cb76f2943 (likely from torchao.dtypes import NF4Tensor or torchao.prototype.dtypes.nf4tensor). Verify import torchtune succeeds against the pinned torchao.
  2. Alternatively, re-select the third-party/ao pin to a torchao commit that drops the aten.transpose.Dimname reference but still exposes torchao.dtypes.nf4tensor, so the existing torchtune pin keeps working.
  3. Separately investigate the XNNPACK ic3 (squeeze xnn_status_invalid_parameter) and emformer_transcribe (double-free) regressions — these are independent of the torchtune issue and won't be resolved by the pin bump.

cc @jakeszwe
· branch automated/pytorch-pin-bump-dev20260628

The torchao bump to cb76f2943 moved NF4Tensor out of
torchao.dtypes.nf4tensor (into torchao.quantization) and removed
TensorCoreTiledLayout, breaking 'import torchtune' for the model export
and torchao-huggingface-checkpoint CI jobs. This advances the torchtune
pin past meta-pytorch/torchtune#2960 and #2958, which adapt those imports to
the new torchao locations.

Authored with assistance from Claude.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/pytorch-pin-bump ci-no-td CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants