Update rocm recipes to 7.2.1 and add rocwmma, triton, libmagma and pytorch#26
Conversation
|
rocm-device-libs fails with: |
|
It seems due to https://github.com/ROCm/llvm-project/blob/rocm-7.2.1/amd/device-libs/ockl/src/cluster.cl and ROCm/llvm-project#553 . It is probably a modification to support newer gfx1250 and gfx1251 (that anyhow we do not support) that requires a rocm-specific llvm change that is not there in clang 20.1.8 . I think the easier option is to revert the device-libs specific bits of ROCm/llvm-project#553 , but let's check what gentoo does (as the only distro that packages recent rocm with older upstream clang/llvm). |
|
Now we have a rocr-runtime error: |
Ok, the missing symbols were introduced in glibc 2.27, so for this package we need to raise the c_stdlib_version, for now we just do it in conda_build_config.yaml for simplificy, but once that is ported upstream that should be done in a per-feedstock way. |
Updated the C standard library version for Linux from 2.17 to 2.27.
Done in 0716640 . |
This resulted in: the closest available version is 2.28 . |
Update c_stdlib_version to 2.28 for Linux.
Removed 'pciutils-devel' from build dependencies.
Removed patches for backporting changes from ROCm repository.
|
We have a real failure in the newly added rocminfo tests: the good news is that the test work as intended, and they fail the generation if an amdgpu is detected but rocminfo does not behave as intended. |
Fixed by fac8b8d . |
|
I was trying to get rocm-comgr 7.2.1 to build with llvm 20.1.8, but checking more in details it seems that rocm 7.2.1 uses llvm 22, see https://github.com/ROCm/llvm-project/blob/rocm-7.2.1/cmake/Modules/LLVMVersion.cmake . So we should jump to llvm 22 also here, indeed also gentoo for rocm 7.2 only had patches for llvm 22 compat, not llvm 20 compat. |
|
xref: gentoo/gentoo#45826 |
|
The error is now: so this is now blocked by conda-forge/ctng-compiler-activation-feedstock#184 . |
As a workaround, I tried to build the package locally, but I get as error: that is similar to the upstream error: |
|
I backported here the fix proposed in conda-forge/ctng-compiler-activation-feedstock#184 (comment), and reduced the build matrix to speedup compilation of the clang_impl_* packages. |
Co-authored-by: traversaro <1857049+traversaro@users.noreply.github.com>
Done. The only conflict was in |
|
Ok, we finally reach the tests, where the fun part will begin: |
|
This failure is probably due to the missing correct run dependeing on the magma rocm variant. |
|
Ok, a bit of recap. After fixing the dependency on the magma variant of libmagma, the loading of the library failed with the error: This seems the upstream error tracked in pytorch/pytorch#173707 . In a nutshell, it seems that llvm as C/C++ compiler does something unexpected w.r.t. to symbol visibility, that results in that error. Looking at other distros, it seesm that even if LLVM is used as hip compiler, gcc/g++ is still used as C/C++ compiler, so I guess we can do the same, and I tried to do this in 1bb0c9a . However, for some reason switching to GCC/G++ as C/C++ compiler results in the following error: |
|
First actual tests running, the output is: and then a lot of failures and error. As noted by ChatGPT (see https://chatgpt.com/share/69e68fed-6278-8395-a132-f54368822d27) we should skip int4 tests as they are only supported in CDNA >= 2, and use 1 worker. However, given ROCm/TheRock#2151 I think it is probably expected if the full pytorch test suite fails, so we could probably just skip the remaining failing tests and call it a day. |
|
Ok, modulo the test (that we will probably skip anyhow due to ROCm/TheRock#2151) this is close to be merged. @flferretti how do you prefer to proceed? In this branch warp and jax are disabled. Perhaps we can create a |
|
Ok, the tests are probably hanging as the build is going since ~8 hours, we finally reached the behavior in ROCm/TheRock#2151 . |
Exactly (notice the timestamps): |
Removed step to delete outdated packages from cache.
Yes, we could keep the branches separated as clangdev does in conda-forge. Then I'll work on updating JAX and warp with rocm7.2 |
|
The last failure is: what I missed in the first patch is that the |
Ok, let me know when we can create |
Merged. Thank you |
Done: https://github.com/gbionics/rock-the-conda/tree/v7.0 . |
Remove cached PyTorch test files during cleanup.
|
This is finally green! |
Removed outdated activation file cleanup step from workflow.
|
The packages have been uploaded, let's merge! |
Almost all initial changes done by Opus 4.6, with the initial prompt: