Skip to content

[MLX] Isolate submodule build with ExternalProject#20585

Merged
metascroy merged 2 commits into
pytorch:mainfrom
SuryanshSS1011:fix/20556-mlx-build-isolation
Jun 29, 2026
Merged

[MLX] Isolate submodule build with ExternalProject#20585
metascroy merged 2 commits into
pytorch:mainfrom
SuryanshSS1011:fix/20556-mlx-build-isolation

Conversation

@SuryanshSS1011

@SuryanshSS1011 SuryanshSS1011 commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #20556.

Right now MLX gets pulled in with add_subdirectory, so MLX's entire CMake project lands in ExecuTorch's target/option namespace. This is a problem because MLX fetches nlohmann_json, which collides with the copy ExecuTorch already provides. The current workaround patches MLX's CMakeLists.txt at configure time to guard the fetch. That patch is brittle as it's pinned to specific lines, meaning an MLX bump that touches them makes it silently stop applying, and the collision comes back with no obvious signal. It also leaves the submodule dirty and lets MLX's MLX_BUILD_* options bleed into ExecuTorch's cache.

This builds MLX as an ExternalProject in its own CMake scope and consumes it through an imported mlx target, so MLX runs its FetchContent in its own namespace and the patch is no longer needed.

  • The imported mlx target re-adds the Metal/Foundation/QuartzCore frameworks, since a static libmlx.a doesn't carry them. This mirrors the imported mlx target already in tools/cmake/executorch-config.cmake. mlxdelegate depends on mlx_external explicitly, because add_dependencies on an imported target doesn't order the build on its own.
  • install(TARGETS mlx) is changed to install(FILES ${_mlx_static_lib}); an imported target cannot be installed with install(TARGETS).
  • The metallib install and MLX_METALLIB_PATH now point at the ExternalProject output.
  • Removed backends/mlx/patches/mlx_json.patch and the patch loop.

The BINARY_DIR is kept where add_subdirectory put it, so libmlx.a and mlx.metallib land at the same paths as before. The package config, the metallib copy helper in Utils.cmake, and the wheel path in setup.py are unchanged.

Test plan

Verified on Apple Silicon (macOS, Metal toolchain installed):

cmake --preset mlx-release configures, builds, and installs; libmlx.a, libmlxdelegate.a, and mlx.metallib are installed to cmake-out/lib/, and the MLX submodule stays clean (git -C backends/mlx/third-party/mlx status).

With -DEXECUTORCH_BUILD_TESTS=ON, op_test_runner, multi_thread_test_runner, and mlx_mutable_state_test build and link mlx; mlx_mutable_state_test passes.

I don't have the model checkpoints to run the qwen3_5_moe / gemma4_31b MLX runners or the wheel end-to-end, so I'm relying on CI for those. Since the artifact paths are unchanged, I'd expect them to behave the same as before.

AI assistance disclosure

I used an AI coding assistant for parts of this change, including drafting the CMake and investigating the build. I reviewed and tested everything in the test plan myself and take complete responsibility for the submission.

MLX was pulled into ExecuTorch's build via add_subdirectory, dropping MLX's
whole CMake project into ET's target/option namespace. That shared scope
collides with deps MLX fetches (today nlohmann_json), worked around by patching
MLX's own CMakeLists.txt at configure time. Patching a submodule is fragile: the
patch is pinned to specific lines and silently stops applying when an MLX bump
touches that region, at which point the json collision returns with no clear
signal. It also dirties the submodule and leaks MLX's MLX_BUILD_* options into
ET's cache.

Build MLX in its own isolated CMake scope via ExternalProject and consume it as
a prebuilt static lib + metallib through an imported `mlx` target. MLX then runs
its FetchContent in its own namespace, so the json collision cannot happen and
the patch is deleted. The imported target re-adds the Metal/Foundation/
QuartzCore frameworks a static libmlx.a does not carry (matching the imported
mlx target already in tools/cmake/executorch-config.cmake), and mlxdelegate
depends on mlx_external directly since add_dependencies on an imported target
does not order the build.

The ExternalProject BINARY_DIR is kept at the same location add_subdirectory
used, so libmlx.a and mlx.metallib land at their existing paths; downstream
consumers (package config, metallib copy helper, the pybindings wheel in
setup.py) need no changes.

- Replace MLX_BUILD_* options + patch loop + add_subdirectory with
  ExternalProject_Add(mlx_external) and an imported `mlx` target.
- install(TARGETS mlx) -> install(FILES ${_mlx_static_lib}), since an imported
  target cannot be installed via install(TARGETS).
- Point the metallib install and MLX_METALLIB_PATH at ${_mlx_metallib}.
- Delete backends/mlx/patches/mlx_json.patch and the patch-apply loop.

Verified: cmake --preset mlx-release configures, builds, and installs;
libmlx.a, libmlxdelegate.a, and mlx.metallib land in cmake-out/lib/; the MLX
submodule stays pristine through configure/build/install; the delegate tests
(op_test_runner, multi_thread_test_runner, mlx_mutable_state_test) build and
link mlx directly, and mlx_mutable_state_test passes.

Fixes pytorch#20556.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@pytorch-bot

pytorch-bot Bot commented Jun 28, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20585

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 8 Pending, 1 Unrelated Failure

As of commit 84d2a1d with merge base 55a71e6 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla

meta-cla Bot commented Jun 28, 2026

Copy link
Copy Markdown

Hi @SuryanshSS1011!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@linux-foundation-easycla

linux-foundation-easycla Bot commented Jun 28, 2026

Copy link
Copy Markdown

CLA Signed
The committers listed above are authorized under a signed CLA.

  • ✅ Unknown (351ee54)
  • ✅ login: SuryanshSS1011 / name: Suryansh Sijwali (351ee54)

One or more co-authors of this pull request were not found. You must specify co-authors in commit message trailer via:

Co-authored-by: name <email>

Supported Co-authored-by: formats include:

  1. Anything <id+login@users.noreply.github.com> - it will locate your GitHub user by id part.
  2. Anything <login@users.noreply.github.com> - it will locate your GitHub user by login part.
  3. Anything <public-email> - it will locate your GitHub user by public-email part. Note that this email must be made public on Github.
  4. Anything <other-email> - it will locate your GitHub user by other-email part but only if that email was used before for any other CLA as a main commit author.
  5. login <any-valid-email> - it will locate your GitHub user by login part, note that login part must be at least 3 characters long.

Alternatively, if the co-author should not be included, remove the Co-authored-by: line from the commit message.

Please update your commit message(s) by doing git commit --amend and then git push [--force] and then request re-running CLA check via commenting on this pull request:

/easycla

@meta-cla

meta-cla Bot commented Jun 28, 2026

Copy link
Copy Markdown

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 28, 2026
@SuryanshSS1011

Copy link
Copy Markdown
Contributor Author

@pytorchbot label "release notes: build"

@pytorch-bot pytorch-bot Bot added the release notes: build Changes related to build, including dependency upgrades, build flags, optimizations, etc. label Jun 28, 2026
@rascani rascani requested a review from metascroy June 29, 2026 14:53
@metascroy

Copy link
Copy Markdown
Contributor

@claude review this code

@claude

claude Bot commented Jun 29, 2026

Copy link
Copy Markdown

Claude finished @metascroy's task in 4m 55s —— View job


Code Review

  • Read backends/mlx/CMakeLists.txt
  • Verify ExternalProject wiring (imported target, dependencies, install)
  • Check related files (test/CMakeLists.txt, executorch-config.cmake, setup.py)
  • Post review feedback

Reviewed backends/mlx/CMakeLists.txt against origin/main, plus the downstream consumers (backends/mlx/test/CMakeLists.txt, tools/cmake/executorch-config.cmake, setup.py). This is a clean, well-motivated change — moving MLX into an ExternalProject to escape the shared target/option namespace is the right call, and dropping the line-pinned submodule patch removes a real silent-failure footgun. The comments explaining the add_dependencies-on-imported-target subtlety are excellent. The binary-dir layout is preserved, so setup.py:1081 (backends/mlx/mlx/mlx/backend/metal/kernels/) and the metallib copy helper continue to resolve.

A few things worth addressing:

1. QuartzCore divergence between the two imported mlx targets (medium)

The new in-tree imported target adds three frameworks (CMakeLists.txt:222-229):

find_library(QUARTZ_FRAMEWORK QuartzCore)
... INTERFACE_LINK_LIBRARIES "${METAL_FRAMEWORK};${FOUNDATION_FRAMEWORK};${QUARTZ_FRAMEWORK}"

But the imported mlx target in the installed package config (tools/cmake/executorch-config.cmake:138-146) only re-adds Metal;Foundation — no QuartzCore. The PR description says this "mirrors the imported mlx target already in executorch-config.cmake", but it actually diverges from it.

This matters because downstream consumers that link MLX through the installed package config (rather than the in-tree target) get a different link line. If QuartzCore is genuinely required to resolve libmlx.a symbols, those consumers will hit link errors; if it isn't required, it's dead weight here. Either way the two should agree. Please confirm whether QuartzCore is needed and bring them into parity. Fix this →

2. ExternalProject won't rebuild MLX on a submodule bump (medium)

ExternalProject_Add configures/builds once and stamps; with no BUILD_ALWAYS and no source-dir dependency tracking, updating the MLX submodule (git submodule update) won't invalidate the configure/build stamp. A developer who bumps MLX gets a stale libmlx.a with no signal — a quieter version of the exact failure mode this PR is fixing.

This is fine for clean CI builds, and the trade-off may well be acceptable, but it's worth either adding BUILD_ALWAYS ON (cheap with Ninja when nothing changed) or noting the "reconfigure after an MLX bump" expectation in the comment so it isn't a surprise.

3. Imported-target include dir is source-only (low — verify)

target_include_directories(mlx INTERFACE ${MLX_SOURCE_DIR}) (:221) exposes only the source tree. With add_subdirectory, MLX's own mlx target also carried its build-tree include dir, which is where MLX generates headers (e.g. build/version config). Your tested targets compile, so the headers mlxdelegate and strict_compile_test pull in are evidently all source-resident — but if a future MLX rev moves a public header into the build tree, consumers will fail to find it. Not a blocker; just flagging the narrowed include surface.

4. Minor: redundant add_dependencies(mlx mlx_external) (nit)

Line :230 adds a dependency on the imported mlx target, which (as your own comment at :258-261 explains) doesn't order anything — the real ordering is the add_dependencies(mlxdelegate mlx_external) at :261. The :230 call is a harmless no-op and could be dropped to avoid implying it does something. Optional.

Things I verified look correct

  • install(FILES ${_mlx_static_lib}) instead of install(TARGETS mlx) — correct, imported targets can't be installed via install(TARGETS).
  • BUILD_BYPRODUCTS correctly declares both libmlx.a and the metallib so Ninja tracks them.
  • $<BUILD_INTERFACE:mlx> link + tests/package-config linking mlx separately — consistent with the "consumers link mlx separately" model, unchanged from before.
  • Binary paths (_mlx_binary_dir = ${CMAKE_CURRENT_BINARY_DIR}/mlx) preserve the add_subdirectory layout, so setup.py and Utils.cmake need no changes, as claimed.
  • Toolchain/platform args (CMAKE_TOOLCHAIN_FILE, PLATFORM, DEPLOYMENT_TARGET, CMAKE_OSX_DEPLOYMENT_TARGET) are forwarded so the iOS/macOS deployment-target validation upstream stays meaningful.

Nothing here is blocking; item #1 is the one I'd most want resolved (or explained) before merge since it affects downstream linking through the installed config.
• branch fix/20556-mlx-build-isolation

# For the installed library, we put metallib in lib/ alongside libmlx.a. The
# metallib is produced in the mlx_external build tree (MLX_METAL_JIT=ON does not
# install it); _mlx_metallib points there.
install(FILES ${_mlx_metallib} DESTINATION ${CMAKE_INSTALL_LIBDIR})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIth this change, do we still need the function(executorch_target_copy_mlx_metallib target) utility in executorch/tools/cmake/Utils.cmake?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, from what I have seen, it's still needed and still works. It copies the metallib next to a target binary using MLX_METALLIB_PATH, which this PR keeps. I just repointed it at the ExternalProject output (same location as before, since I left BINARY_DIR where add_subdirectory had it). The runners and pybindings still rely on it to colocate mlx.metallib with the binary that statically links MLX, so removing it might break that.

@metascroy

Copy link
Copy Markdown
Contributor

Thanks @SuryanshSS1011! How did the mlxdelegate.a size change with this?

@SuryanshSS1011

Copy link
Copy Markdown
Contributor Author

Thanks for the review, @metascroy!
To answer your question: libmlxdelegate.a size is unchanged (3,893,456 bytes both before and after) as libmlxdelegate.a is built from the same three source files (MLXLoader.cpp, MLXBackend.cpp, mlx_mutable_state.cpp) with the same flags. The PR only affects how MLX itself is pulled in, and not the delegate's own compilation.

@metascroy

Copy link
Copy Markdown
Contributor

Looks great! Thanks for the contribution.

I pushed an update to address some of the nits from Claude. If CI passes, I'll merge.

@metascroy metascroy merged commit 0cef6de into pytorch:main Jun 29, 2026
225 of 227 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/mlx CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: build Changes related to build, including dependency upgrades, build flags, optimizations, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Good First Issue: isolate the MLX submodule build (MLX backend)

3 participants