-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Documenting this here as an issue for now, but the underlying root cause remains unclear: could be Oceananigans, ClimaSeaIce, ClimaOcean, or the specific case of CUDA-aware MPI. Sorry this issue is relatively opaque as I have not been able to nail it down further, any help would be incredibly appreciated!
I am trying to run a high resolution global simulation over distributed GPUs, with a script which is documented in https://github.com/CliMA/ClimaOcean.jl/blob/79a61c0b4f86cba948829de030e01e262c04db4b/experiments/omip_prototype/oneeighth_degree_simulation_minimal.jl
For the record the script is below, however it did not make it past the ImmersedBoundaryGrid instantiation. When tested on distributed CPUs, the ImmersedBoundaryGrid can be set up, but the code got stuck at ocean = ocean_simulation(...) to no end.
using ClimaOcean
using ClimaSeaIce
using Oceananigans
using Oceananigans.Grids
using Oceananigans.Units
using Oceananigans.OrthogonalSphericalShellGrids
using Oceananigans.Architectures: on_architecture
using ClimaOcean.OceanSimulations
using ClimaOcean.JRA55
using ClimaOcean.DataWrangling
using ClimaOcean.DataWrangling: NearestNeighborInpainting
using ClimaSeaIce.SeaIceThermodynamics: IceWaterThermalEquilibrium
using Printf
using Dates
using CUDA
using PythonCall
using Oceananigans.BuoyancyFormulations: buoyancy, buoyancy_frequency
import Oceananigans.OutputWriters: checkpointer_address
# arch = GPU()
arch = Distributed(GPU(), partition=Partition(1, 4), synchronized_communication=true)
# arch = Distributed(CPU(), partition=Partition(1, 2), synchronized_communication=true)
@info "Architecture $(arch)"
Nx = 2880 # longitudinal direction
Ny = 1440 # meridional direction
Nz = 100
z_faces = ExponentialCoordinate(Nz, -6000, 0)
const z_surf = z_faces(Nz)
@info "Building grid..."
grid = TripolarGrid(arch;
size = (Nx, Ny, Nz),
z = z_faces,
halo = (7, 7, 7))
@info "Regridding bathymetry..."
bottom_height = regrid_bathymetry(grid; minimum_depth=15, major_basins=1, interpolation_passes=10)
fitted_bottom = GridFittedBottom(bottom_height)
@info "Building immersed boundary grid..."
grid = ImmersedBoundaryGrid(grid, fitted_bottom; active_cells_map=true)
@info grid
@info "Created ImmersedBoundaryGrid"
#####
##### A Propgnostic Ocean model
#####
using Oceananigans.TurbulenceClosures: ExplicitTimeDiscretization
using Oceananigans.TurbulenceClosures.TKEBasedVerticalDiffusivities: CATKEVerticalDiffusivity, CATKEMixingLength, CATKEEquation
using Oceananigans.TurbulenceClosures: RiBasedVerticalDiffusivity
momentum_advection = WENOVectorInvariant()
tracer_advection = WENO(order=7)
free_surface = SplitExplicitFreeSurface(grid; cfl=0.8, fixed_Δt=12minutes)
@info "Free surface", free_surface
obl_closure = ClimaOcean.OceanSimulations.default_ocean_closure() # CATKE
closure = (obl_closure, VerticalScalarDiffusivity(κ=1e-5, ν=1e-4))
glorys_dir = joinpath(homedir(), "GLORYS_data")
mkpath(glorys_dir)
glorys_dataset = GLORYSMonthly()
@info "Building ocean component..."
ocean = ocean_simulation(grid; Δt=1minutes,
momentum_advection,
tracer_advection,
timestepper = :SplitRungeKutta3,
free_surface,
closure)
start_date = DateTime(1993, 1, 1)
end_date = DateTime(2003, 4, 1)
simulation_period = Dates.value(Second(end_date - start_date))
inpainting = NearestNeighborInpainting(50)
@info "Setting initial conditions..."
set!(ocean.model, T=Metadatum(:temperature; dataset=glorys_dataset, date=start_date, dir=glorys_dir),
S=Metadatum(:salinity; dataset=glorys_dataset, date=start_date, dir=glorys_dir); inpainting)
@info ocean.model.clock
#####
##### A Prognostic Sea-ice model
#####
@info "Building sea-ice component..."
sea_ice = sea_ice_simulation(grid, ocean; dynamics=nothing)
@info "Setting sea-ice initial conditions..."
set!(sea_ice.model, h=Metadatum(:sea_ice_thickness; dataset=glorys_dataset, dir=glorys_dir),
ℵ=Metadatum(:sea_ice_concentration; dataset=glorys_dataset, dir=glorys_dir), inpainting = nothing)
#####
##### A Prescribed Atmosphere model
#####
jra55_dir = joinpath(homedir(), "JRA55_data")
mkpath(jra55_dir)
dataset = MultiYearJRA55()
jra55_backend = JRA55NetCDFBackend(10)
@info "Building atmospheric forcing..."
atmosphere = JRA55PrescribedAtmosphere(arch; dir=jra55_dir, dataset=jra55_backend, backend=jra55_backend, include_rivers_and_icebergs=true, start_date)
radiation = Radiation()
#####
##### An ocean-sea ice coupled model
#####
@info "Building coupled ocean-sea ice model..."
omip = OceanSeaIceModel(ocean, sea_ice; atmosphere, radiation)
omip = Simulation(omip, Δt=10minutes, stop_time=60days)
wall_time = Ref(time_ns())
using Statistics
function progress(sim)
sea_ice = sim.model.sea_ice
ocean = sim.model.ocean
hmax = maximum(sea_ice.model.ice_thickness)
ℵmax = maximum(sea_ice.model.ice_concentration)
Tmax = maximum(sim.model.interfaces.atmosphere_sea_ice_interface.temperature)
Tmin = minimum(sim.model.interfaces.atmosphere_sea_ice_interface.temperature)
umax = maximum(ocean.model.velocities.u)
vmax = maximum(ocean.model.velocities.v)
wmax = maximum(ocean.model.velocities.w)
step_time = 1e-9 * (time_ns() - wall_time[])
msg1 = @sprintf("time: %s, iteration: %d, Δt: %s, ", prettytime(sim), iteration(sim), prettytime(sim.Δt))
msg2 = @sprintf("max(h): %.2e m, max(ℵ): %.2e ", hmax, ℵmax)
msg4 = @sprintf("extrema(T): (%.2f, %.2f) ᵒC, ", Tmax, Tmin)
msg5 = @sprintf("maximum(u): (%.2f, %.2f, %.2f) m/s, ", umax, vmax, wmax)
msg6 = @sprintf("wall time: %s \n", prettytime(step_time))
@info msg1 * msg2 * msg4 * msg5 * msg6
wall_time[] = time_ns()
return nothing
end
add_callback!(omip, progress, IterationInterval(1))
@info "Starting simulation..."
run!(omip)
omip.Δt = 10minutes
omip.stop_time = simulation_period
run!(omip)For verbosity, I am copying the entire error log here:
Resolving package versions...
No Changes to `~/.julia/environments/v1.11/Project.toml`
No Changes to `~/.julia/environments/v1.11/Manifest.toml`
[ Info: Configure the active project to use the default CUDA from the local system; please re-start Julia for this to take effect.
Resolving package versions...
No Changes to `~/.julia/environments/v1.11/Project.toml`
No Changes to `~/.julia/environments/v1.11/Manifest.toml`
┌ Info: MPI implementation identified
│ libmpi = "/sw/openmpi-5.0.5/lib/libmpi"
│ version_string = "Open MPI v5.0.5, package: Open MPI ext_yifanchen_google_com@hpc12-slurm-login-001 Distribution, ident: 5.0.5, repo rev: v5.0.5, Jul 22, 2024\0"
│ impl = "OpenMPI"
│ version = v"5.0.5"
└ abi = "OpenMPI"
┌ Info: MPIPreferences unchanged
│ binary = "system"
│ libmpi = "/sw/openmpi-5.0.5/lib/libmpi"
│ abi = "OpenMPI"
│ mpiexec = "/sw/openmpi-5.0.5/bin/mpiexec"
│ preloads = Any[]
└ preloads_env_switch = nothing
┌ Warning: You are using CUDA 12.4.0 with a driver for CUDA 13.x.
│ It is recommended to upgrade your driver, or switch to automatic installation of CUDA.
└ @ CUDA ~/.julia/packages/CUDA/OnIOF/src/initialization.jl:137
┌ Warning: You are using Julia v1.11 or later!"
│ Oceananigans is currently tested on Julia v1.10."
│ If you find issues with Julia v1.11 or later,"
│ please report at https://github.com/CliMA/Oceananigans.jl/issues/new
└ @ Oceananigans ~/.julia/packages/Oceananigans/yZVc9/src/Oceananigans.jl:125
┌ Warning: You are using Julia v1.11 or later!"
│ Oceananigans is currently tested on Julia v1.10."
│ If you find issues with Julia v1.11 or later,"
│ please report at https://github.com/CliMA/Oceananigans.jl/issues/new
└ @ Oceananigans ~/.julia/packages/Oceananigans/yZVc9/src/Oceananigans.jl:125
┌ Warning: You are using Julia v1.11 or later!"
│ Oceananigans is currently tested on Julia v1.10."
│ If you find issues with Julia v1.11 or later,"
│ please report at https://github.com/CliMA/Oceananigans.jl/issues/new
└ @ Oceananigans ~/.julia/packages/Oceananigans/yZVc9/src/Oceananigans.jl:125
┌ Warning: You are using Julia v1.11 or later!"
│ Oceananigans is currently tested on Julia v1.10."
│ If you find issues with Julia v1.11 or later,"
│ please report at https://github.com/CliMA/Oceananigans.jl/issues/new
└ @ Oceananigans ~/.julia/packages/Oceananigans/yZVc9/src/Oceananigans.jl:125
┌ Info: CondaPkg: Waiting for lock to be freed. You may delete this file if no other process is resolving.
└ lock_file = "/home/ext_xinkai_caltech_edu/.julia/environments/v1.11/.CondaPkg/lock"
┌ Info: CondaPkg: Waiting for lock to be freed. You may delete this file if no other process is resolving.
└ lock_file = "/home/ext_xinkai_caltech_edu/.julia/environments/v1.11/.CondaPkg/lock"
┌ Info: CondaPkg: Waiting for lock to be freed. You may delete this file if no other process is resolving.
└ lock_file = "/home/ext_xinkai_caltech_edu/.julia/environments/v1.11/.CondaPkg/lock"
CondaPkg Found dependencies: /home/ext_xinkai_caltech_edu/1deg_simulation/ClimaOcean.jl/CondaPkg.toml
CondaPkg Found dependencies: /home/ext_xinkai_caltech_edu/.julia/packages/PythonCall/IOKTD/CondaPkg.toml
CondaPkg Initialising pixi
│ /home/ext_xinkai_caltech_edu/.julia/artifacts/cefba4912c2b400756d043a2563ef77a0088866b/bin/pixi
│ init
│ --format pixi
└ /home/ext_xinkai_caltech_edu/.julia/environments/v1.11/.CondaPkg
✔ Created /home/ext_xinkai_caltech_edu/.julia/environments/v1.11/.CondaPkg/pixi.toml
CondaPkg Wrote /home/ext_xinkai_caltech_edu/.julia/environments/v1.11/.CondaPkg/pixi.toml
│ [dependencies]
│ openssl = ">=3, <3.6"
│ libstdcxx = ">=3.4,<14.0"
│ uv = ">=0.4"
│ libstdcxx-ng = ">=3.4,<14.0"
│
│ [dependencies.python]
│ channel = "conda-forge"
│ build = "*cp*"
│ version = ">=3.9,<4"
│
│ [project]
│ name = ".CondaPkg"
│ platforms = ["linux-64"]
│ channels = ["conda-forge"]
│ channel-priority = "strict"
│ description = "automatically generated by CondaPkg.jl"
│
│ [pypi-dependencies]
│ jax = ">=0.6"
│ xarray = ">=2024.7.0"
│ copernicusmarine = ">=2.0.0"
│ numpy = ">=2.0.0"
└ tensorflow = ">=2.17"
CondaPkg Installing packages
│ /home/ext_xinkai_caltech_edu/.julia/artifacts/cefba4912c2b400756d043a2563ef77a0088866b/bin/pixi
│ install
└ --manifest-path /home/ext_xinkai_caltech_edu/.julia/environments/v1.11/.CondaPkg/pixi.toml
✔ The default environment has been installed.
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: MPI has not been initialized, so we are calling MPI.Init().
┌ Info: Architecture Distributed{GPU{CUDABackend}} across 4 = 1×4×1 ranks:
│ ├── local_rank: 3 of 0-3
│ ├── local_index: [1, 4, 1]
└ └── connectivity: north=0 south=2
[ Info: Building grid...
┌ Info: Architecture Distributed{GPU{CUDABackend}} across 4 = 1×4×1 ranks:
│ ├── local_rank: 0 of 0-3
│ ├── local_index: [1, 1, 1]
└ └── connectivity: north=1 south=3
┌ Info: Architecture Distributed{GPU{CUDABackend}} across 4 = 1×4×1 ranks:
│ ├── local_rank: 1 of 0-3
│ ├── local_index: [1, 2, 1]
└ └── connectivity: north=2 south=0
┌ Info: Architecture Distributed{GPU{CUDABackend}} across 4 = 1×4×1 ranks:
│ ├── local_rank: 2 of 0-3
│ ├── local_index: [1, 3, 1]
└ └── connectivity: north=3 south=1
[ Info: Building grid...
[ Info: Building grid...
[ Info: Building grid...
[ Info: Regridding bathymetry...
[ Info: Regridding bathymetry...
[ Info: Regridding bathymetry...
[ Info: Regridding bathymetry...
[ Info: Interpolation passes of bathymetry size (21600, 10800, 1) onto a TripolarGrid target grid of size (2880, 1440, 100):
[ Info: pass 1 to size (19728, 9864, 1)
[ Info: pass 2 to size (17856, 8928, 1)
[ Info: pass 3 to size (15984, 7992, 1)
[ Info: pass 4 to size (14112, 7056, 1)
[ Info: pass 5 to size (12240, 6120, 1)
[ Info: pass 6 to size (10368, 5184, 1)
[ Info: pass 7 to size (8496, 4248, 1)
[ Info: pass 8 to size (6624, 3312, 1)
[ Info: pass 9 to size (4752, 2376, 1)
[ Info: pass 10 to size (2880, 1440, 1)
[hpc12-a3mega8gnodese-0:5654 :0:5654] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x159f)
==== backtrace (tid: 5654) ====
0 /sw/ucx-1.17.0/lib/libucs.so.0(ucs_handle_error+0x294) [0x7f31c75da764]
1 /sw/ucx-1.17.0/lib/libucs.so.0(+0x3591f) [0x7f31c75da91f]
2 /sw/ucx-1.17.0/lib/libucs.so.0(+0x35be6) [0x7f31c75dabe6]
3 /lib/x86_64-linux-gnu/libcuda.so.1(+0x23f29e) [0x7f31c58fe29e]
4 /lib/x86_64-linux-gnu/libcuda.so.1(+0x1541e5) [0x7f31c58131e5]
5 /lib/x86_64-linux-gnu/libcuda.so.1(+0x28a839) [0x7f31c5949839]
6 /home/ext_xinkai_caltech_edu/.julia/compiled/v1.11/CUDA/oWw5k_5TPCE.so(+0xd8d16) [0x7f3130ef1d16]
=================================
[5654] signal 11 (-6): Segmentation fault
in expression starting at /home/ext_xinkai_caltech_edu/1deg_simulation/ClimaOcean.jl/experiments/omip_prototype/oneeighth_degree_simulation_minimal.jl:42
[hpc12-a3mega8gnodese-0:5656 :0:5656] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x15a1)
==== backtrace (tid: 5656) ====
0 /sw/ucx-1.17.0/lib/libucs.so.0(ucs_handle_error+0x294) [0x7fad7a9cb764]
1 /sw/ucx-1.17.0/lib/libucs.so.0(+0x3591f) [0x7fad7a9cb91f]
2 /sw/ucx-1.17.0/lib/libucs.so.0(+0x35be6) [0x7fad7a9cbbe6]
3 /lib/x86_64-linux-gnu/libcuda.so.1(+0x23f29e) [0x7fad78cef29e]
4 /lib/x86_64-linux-gnu/libcuda.so.1(+0x1541e5) [0x7fad78c041e5]
5 /lib/x86_64-linux-gnu/libcuda.so.1(+0x28a839) [0x7fad78d3a839]
6 /home/ext_xinkai_caltech_edu/.julia/compiled/v1.11/CUDA/oWw5k_5TPCE.so(+0xd8d16) [0x7face42f8d16]
=================================
[5656] signal 11 (-6): Segmentation fault
in expression starting at /home/ext_xinkai_caltech_edu/1deg_simulation/ClimaOcean.jl/experiments/omip_prototype/oneeighth_degree_simulation_minimal.jl:42
unknown function (ip: 0x7f31c58fe29e)
unknown function (ip: 0x7f31c58131e4)
unknown function (ip: 0x7fad78cef29e)
unknown function (ip: 0x7fad78c041e4)
unknown function (ip: 0x7fad78d3a838)
unknown function (ip: 0x7f31c5949838)
macro expansion at /home/ext_xinkai_caltech_edu/.julia/packages/GPUToolbox/XaIIx/src/ccalls.jl:143 [inlined]
unchecked_cuModuleLoadDataEx at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/lib/cudadrv/libcuda.jl:4076 [inlined]
#991 at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/lib/cudadrv/module.jl:25
macro expansion at /home/ext_xinkai_caltech_edu/.julia/packages/GPUToolbox/XaIIx/src/ccalls.jl:143 [inlined]
unchecked_cuModuleLoadDataEx at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/lib/cudadrv/libcuda.jl:4076 [inlined]
#991 at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/lib/cudadrv/module.jl:25
retry_reclaim at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/memory.jl:434 [inlined]
checked_cuModuleLoadDataEx at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/lib/cudadrv/module.jl:24
retry_reclaim at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/memory.jl:434 [inlined]
checked_cuModuleLoadDataEx at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/lib/cudadrv/module.jl:24
CuModule at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/lib/cudadrv/module.jl:60
CuModule at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/lib/cudadrv/module.jl:60
CuModule at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/lib/cudadrv/module.jl:49 [inlined]
link at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/compiler/compilation.jl:409
unknown function (ip: 0x7f337c50691a)
CuModule at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/lib/cudadrv/module.jl:49 [inlined]
link at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/compiler/compilation.jl:409
unknown function (ip: 0x7faf273697da)
actual_compilation at /home/ext_xinkai_caltech_edu/.julia/packages/GPUCompiler/bTNLD/src/execution.jl:270
unknown function (ip: 0x7f337c44feb5)
actual_compilation at /home/ext_xinkai_caltech_edu/.julia/packages/GPUCompiler/bTNLD/src/execution.jl:270
unknown function (ip: 0x7faaad97bf45)
cached_compilation at /home/ext_xinkai_caltech_edu/.julia/packages/GPUCompiler/bTNLD/src/execution.jl:159
cached_compilation at /home/ext_xinkai_caltech_edu/.julia/packages/GPUCompiler/bTNLD/src/execution.jl:159
macro expansion at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/compiler/execution.jl:373 [inlined]
macro expansion at ./lock.jl:273 [inlined]
#cufunction#1210 at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/compiler/execution.jl:368
cufunction at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/compiler/execution.jl:365
unknown function (ip: 0x7f337c42e862)
macro expansion at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/compiler/execution.jl:373 [inlined]
macro expansion at ./lock.jl:273 [inlined]
#cufunction#1210 at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/compiler/execution.jl:368
cufunction at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/compiler/execution.jl:365
unknown function (ip: 0x7faaad95a8d2)
macro expansion at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/compiler/execution.jl:112 [inlined]
#_#7 at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/CUDAKernels.jl:124
Kernel at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/CUDAKernels.jl:110 [inlined]
fill! at /home/ext_xinkai_caltech_edu/.julia/packages/GPUArrays/ZRk7Q/src/host/construction.jl:22
unknown function (ip: 0x7f337c4215dd)
macro expansion at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/compiler/execution.jl:112 [inlined]
#_#7 at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/CUDAKernels.jl:124
#zeros#3 at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/CUDAKernels.jl:25
zeros at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/CUDAKernels.jl:25
unknown function (ip: 0x7f337c420676)
Kernel at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/CUDAKernels.jl:110 [inlined]
fill! at /home/ext_xinkai_caltech_edu/.julia/packages/GPUArrays/ZRk7Q/src/host/construction.jl:22
unknown function (ip: 0x7faaad94d65d)
#zeros#5 at /home/ext_xinkai_caltech_edu/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:564
#zeros#3 at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/CUDAKernels.jl:25
zeros at /home/ext_xinkai_caltech_edu/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:564 [inlined]
zeros at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Grids/zeros_and_ones.jl:9 [inlined]
zeros at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/DistributedComputations/distributed_architectures.jl:319 [inlined]
new_data at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Grids/new_data.jl:70
zeros at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/CUDAKernels.jl:25
unknown function (ip: 0x7faaad94c6f6)
#zeros#5 at /home/ext_xinkai_caltech_edu/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:564
new_data at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Grids/new_data.jl:75 [inlined]
Field at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Fields/field.jl:189
zeros at /home/ext_xinkai_caltech_edu/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:564 [inlined]
zeros at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Grids/zeros_and_ones.jl:9 [inlined]
zeros at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/DistributedComputations/distributed_architectures.jl:319 [inlined]
new_data at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Grids/new_data.jl:70
new_data at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Grids/new_data.jl:75 [inlined]
Field at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Fields/field.jl:189
#_#12 at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Fields/field.jl:186 [inlined]
Field at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Fields/field.jl:182 [inlined]
Field at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Fields/field.jl:182 [inlined]
#regrid_bathymetry#4 at /home/ext_xinkai_caltech_edu/1deg_simulation/ClimaOcean.jl/src/Bathymetry.jl:185
regrid_bathymetry at /home/ext_xinkai_caltech_edu/1deg_simulation/ClimaOcean.jl/src/Bathymetry.jl:153 [inlined]
#regrid_bathymetry#3 at /home/ext_xinkai_caltech_edu/1deg_simulation/ClimaOcean.jl/src/Bathymetry.jl:148 [inlined]
regrid_bathymetry at /home/ext_xinkai_caltech_edu/1deg_simulation/ClimaOcean.jl/src/Bathymetry.jl:146
unknown function (ip: 0x7f31e1c79126)
#_#12 at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Fields/field.jl:186 [inlined]
Field at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Fields/field.jl:182 [inlined]
Field at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Fields/field.jl:182 [inlined]
#regrid_bathymetry#4 at /home/ext_xinkai_caltech_edu/1deg_simulation/ClimaOcean.jl/src/Bathymetry.jl:185
regrid_bathymetry at /home/ext_xinkai_caltech_edu/1deg_simulation/ClimaOcean.jl/src/Bathymetry.jl:153 [inlined]
#regrid_bathymetry#3 at /home/ext_xinkai_caltech_edu/1deg_simulation/ClimaOcean.jl/src/Bathymetry.jl:148 [inlined]
regrid_bathymetry at /home/ext_xinkai_caltech_edu/1deg_simulation/ClimaOcean.jl/src/Bathymetry.jl:146
unknown function (ip: 0x7faaad938df6)
jl_apply at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
do_call at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/interpreter.c:126
jl_apply at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
do_call at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/interpreter.c:126
eval_value at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/interpreter.c:223
eval_value at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/interpreter.c:223
eval_stmt_value at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/interpreter.c:174 [inlined]
eval_body at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/interpreter.c:666
eval_stmt_value at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/interpreter.c:174 [inlined]
eval_body at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/interpreter.c:666
jl_interpret_toplevel_thunk at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/interpreter.c:824
jl_interpret_toplevel_thunk at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/interpreter.c:824
jl_toplevel_eval_flex at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/toplevel.c:943
jl_toplevel_eval_flex at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/toplevel.c:943
jl_toplevel_eval_flex at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/toplevel.c:886
jl_toplevel_eval_flex at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/toplevel.c:886
ijl_toplevel_eval_in at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/toplevel.c:994
ijl_toplevel_eval_in at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/toplevel.c:994
[hpc12-a3mega8gnodese-0:5655 :0:5655] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x15a0)
==== backtrace (tid: 5655) ====
0 /sw/ucx-1.17.0/lib/libucs.so.0(ucs_handle_error+0x294) [0x7f37f83ee764]
1 /sw/ucx-1.17.0/lib/libucs.so.0(+0x3591f) [0x7f37f83ee91f]
2 /sw/ucx-1.17.0/lib/libucs.so.0(+0x35be6) [0x7f37f83eebe6]
3 /lib/x86_64-linux-gnu/libcuda.so.1(+0x23f29e) [0x7f37f671229e]
4 /lib/x86_64-linux-gnu/libcuda.so.1(+0x1541e5) [0x7f37f66271e5]
5 /lib/x86_64-linux-gnu/libcuda.so.1(+0x28a839) [0x7f37f675d839]
6 /home/ext_xinkai_caltech_edu/.julia/compiled/v1.11/CUDA/oWw5k_5TPCE.so(+0xd8d16) [0x7f3761d14d16]
=================================
[5655] signal 11 (-6): Segmentation fault
in expression starting at /home/ext_xinkai_caltech_edu/1deg_simulation/ClimaOcean.jl/experiments/omip_prototype/oneeighth_degree_simulation_minimal.jl:42
unknown function (ip: 0x7f37f671229e)
unknown function (ip: 0x7f37f66271e4)
unknown function (ip: 0x7f37f675d838)
macro expansion at /home/ext_xinkai_caltech_edu/.julia/packages/GPUToolbox/XaIIx/src/ccalls.jl:143 [inlined]
unchecked_cuModuleLoadDataEx at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/lib/cudadrv/libcuda.jl:4076 [inlined]
#991 at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/lib/cudadrv/module.jl:25
retry_reclaim at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/memory.jl:434 [inlined]
checked_cuModuleLoadDataEx at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/lib/cudadrv/module.jl:24
CuModule at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/lib/cudadrv/module.jl:60
CuModule at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/lib/cudadrv/module.jl:49 [inlined]
link at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/compiler/compilation.jl:409
unknown function (ip: 0x7f39ad11998a)
actual_compilation at /home/ext_xinkai_caltech_edu/.julia/packages/GPUCompiler/bTNLD/src/execution.jl:270
unknown function (ip: 0x7f39ad062eb5)
cached_compilation at /home/ext_xinkai_caltech_edu/.julia/packages/GPUCompiler/bTNLD/src/execution.jl:159
macro expansion at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/compiler/execution.jl:373 [inlined]
macro expansion at ./lock.jl:273 [inlined]
#cufunction#1210 at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/compiler/execution.jl:368
cufunction at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/compiler/execution.jl:365
unknown function (ip: 0x7f39ad041862)
macro expansion at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/compiler/execution.jl:112 [inlined]
#_#7 at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/CUDAKernels.jl:124
Kernel at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/CUDAKernels.jl:110 [inlined]
fill! at /home/ext_xinkai_caltech_edu/.julia/packages/GPUArrays/ZRk7Q/src/host/construction.jl:22
unknown function (ip: 0x7f39ad0345dd)
#zeros#3 at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/CUDAKernels.jl:25
zeros at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/CUDAKernels.jl:25
unknown function (ip: 0x7f39ad033676)
#zeros#5 at /home/ext_xinkai_caltech_edu/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:564
zeros at /home/ext_xinkai_caltech_edu/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:564 [inlined]
zeros at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Grids/zeros_and_ones.jl:9 [inlined]
zeros at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/DistributedComputations/distributed_architectures.jl:319 [inlined]
new_data at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Grids/new_data.jl:70
new_data at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Grids/new_data.jl:75 [inlined]
Field at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Fields/field.jl:189
#_#12 at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Fields/field.jl:186 [inlined]
Field at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Fields/field.jl:182 [inlined]
Field at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Fields/field.jl:182 [inlined]
#regrid_bathymetry#4 at /home/ext_xinkai_caltech_edu/1deg_simulation/ClimaOcean.jl/src/Bathymetry.jl:185
regrid_bathymetry at /home/ext_xinkai_caltech_edu/1deg_simulation/ClimaOcean.jl/src/Bathymetry.jl:153 [inlined]
#regrid_bathymetry#3 at /home/ext_xinkai_caltech_edu/1deg_simulation/ClimaOcean.jl/src/Bathymetry.jl:148 [inlined]
regrid_bathymetry at /home/ext_xinkai_caltech_edu/1deg_simulation/ClimaOcean.jl/src/Bathymetry.jl:146
unknown function (ip: 0x7f382288c626)
jl_apply at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
do_call at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/interpreter.c:126
eval_value at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/interpreter.c:223
eval_stmt_value at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/interpreter.c:174 [inlined]
eval_body at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/interpreter.c:666
jl_interpret_toplevel_thunk at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/interpreter.c:824
jl_toplevel_eval_flex at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/toplevel.c:943
jl_toplevel_eval_flex at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/toplevel.c:886
ijl_toplevel_eval_in at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/toplevel.c:994
eval at ./boot.jl:430 [inlined]
include_string at ./loading.jl:2734
eval at ./boot.jl:430 [inlined]
include_string at ./loading.jl:2734
eval at ./boot.jl:430 [inlined]
include_string at ./loading.jl:2734
_include at ./loading.jl:2794
_include at ./loading.jl:2794
_include at ./loading.jl:2794
include at ./Base.jl:562
include at ./Base.jl:562
include at ./Base.jl:562
jfptr_include_46943.1 at /home/ext_xinkai_caltech_edu/julia-1.11.6/lib/julia/sys.so (unknown line)
jfptr_include_46943.1 at /home/ext_xinkai_caltech_edu/julia-1.11.6/lib/julia/sys.so (unknown line)
jfptr_include_46943.1 at /home/ext_xinkai_caltech_edu/julia-1.11.6/lib/julia/sys.so (unknown line)
exec_options at ./client.jl:323
exec_options at ./client.jl:323
exec_options at ./client.jl:323
_start at ./client.jl:531
_start at ./client.jl:531
_start at ./client.jl:531
jfptr__start_73597.1 at /home/ext_xinkai_caltech_edu/julia-1.11.6/lib/julia/sys.so (unknown line)
jfptr__start_73597.1 at /home/ext_xinkai_caltech_edu/julia-1.11.6/lib/julia/sys.so (unknown line)
jfptr__start_73597.1 at /home/ext_xinkai_caltech_edu/julia-1.11.6/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
true_main at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/jlapi.c:900
jl_apply at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
true_main at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/jlapi.c:900
jl_repl_entrypoint at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/jlapi.c:1059
jl_repl_entrypoint at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/jlapi.c:1059
jl_apply at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
true_main at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/jlapi.c:900
jl_repl_entrypoint at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/jlapi.c:1059
main at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/cli/loader_exe.c:58
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 735986519 (Pool: 735983304; Big: 3215); GC: 302
main at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/cli/loader_exe.c:58
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 726523022 (Pool: 726521715; Big: 1307); GC: 327
main at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/cli/loader_exe.c:58
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 725579818 (Pool: 725578540; Big: 1278); GC: 326
[hpc12-a3mega8gnodese-0:5653 :0:5653] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x159e)
==== backtrace (tid: 5653) ====
0 /sw/ucx-1.17.0/lib/libucs.so.0(ucs_handle_error+0x294) [0x7feee2436764]
1 /sw/ucx-1.17.0/lib/libucs.so.0(+0x3591f) [0x7feee243691f]
2 /sw/ucx-1.17.0/lib/libucs.so.0(+0x35be6) [0x7feee2436be6]
3 /lib/x86_64-linux-gnu/libcuda.so.1(+0x23f29e) [0x7feee075a29e]
4 /lib/x86_64-linux-gnu/libcuda.so.1(+0x1541e5) [0x7feee066f1e5]
5 /lib/x86_64-linux-gnu/libcuda.so.1(+0x28a839) [0x7feee07a5839]
6 /home/ext_xinkai_caltech_edu/.julia/compiled/v1.11/CUDA/oWw5k_5TPCE.so(+0xd8d16) [0x7fee4bc62d16]
=================================
[5653] signal 11 (-6): Segmentation fault
in expression starting at /home/ext_xinkai_caltech_edu/1deg_simulation/ClimaOcean.jl/experiments/omip_prototype/oneeighth_degree_simulation_minimal.jl:42
unknown function (ip: 0x7feee075a29e)
unknown function (ip: 0x7feee066f1e4)
unknown function (ip: 0x7feee07a5838)
macro expansion at /home/ext_xinkai_caltech_edu/.julia/packages/GPUToolbox/XaIIx/src/ccalls.jl:143 [inlined]
unchecked_cuModuleLoadDataEx at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/lib/cudadrv/libcuda.jl:4076 [inlined]
#991 at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/lib/cudadrv/module.jl:25
retry_reclaim at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/memory.jl:434 [inlined]
checked_cuModuleLoadDataEx at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/lib/cudadrv/module.jl:24
CuModule at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/lib/cudadrv/module.jl:60
CuModule at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/lib/cudadrv/module.jl:49 [inlined]
link at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/compiler/compilation.jl:409
unknown function (ip: 0x7ff096ed866a)
actual_compilation at /home/ext_xinkai_caltech_edu/.julia/packages/GPUCompiler/bTNLD/src/execution.jl:270
unknown function (ip: 0x7ff09701a805)
cached_compilation at /home/ext_xinkai_caltech_edu/.julia/packages/GPUCompiler/bTNLD/src/execution.jl:159
macro expansion at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/compiler/execution.jl:373 [inlined]
macro expansion at ./lock.jl:273 [inlined]
#cufunction#1210 at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/compiler/execution.jl:368
cufunction at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/compiler/execution.jl:365
unknown function (ip: 0x7ff096ff9122)
macro expansion at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/compiler/execution.jl:112 [inlined]
#_#7 at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/CUDAKernels.jl:124
Kernel at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/CUDAKernels.jl:110 [inlined]
fill! at /home/ext_xinkai_caltech_edu/.julia/packages/GPUArrays/ZRk7Q/src/host/construction.jl:22
unknown function (ip: 0x7ff096febe7d)
#zeros#3 at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/CUDAKernels.jl:25
zeros at /home/ext_xinkai_caltech_edu/.julia/packages/CUDA/OnIOF/src/CUDAKernels.jl:25
unknown function (ip: 0x7ff096feaf16)
#zeros#5 at /home/ext_xinkai_caltech_edu/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:564
zeros at /home/ext_xinkai_caltech_edu/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:564 [inlined]
zeros at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Grids/zeros_and_ones.jl:9 [inlined]
zeros at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/DistributedComputations/distributed_architectures.jl:319 [inlined]
new_data at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Grids/new_data.jl:70
new_data at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Grids/new_data.jl:75 [inlined]
Field at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Fields/field.jl:189
#_#12 at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Fields/field.jl:186 [inlined]
Field at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Fields/field.jl:182 [inlined]
Field at /home/ext_xinkai_caltech_edu/.julia/packages/Oceananigans/yZVc9/src/Fields/field.jl:182 [inlined]
#regrid_bathymetry#4 at /home/ext_xinkai_caltech_edu/1deg_simulation/ClimaOcean.jl/src/Bathymetry.jl:185
regrid_bathymetry at /home/ext_xinkai_caltech_edu/1deg_simulation/ClimaOcean.jl/src/Bathymetry.jl:153 [inlined]
#regrid_bathymetry#3 at /home/ext_xinkai_caltech_edu/1deg_simulation/ClimaOcean.jl/src/Bathymetry.jl:148 [inlined]
regrid_bathymetry at /home/ext_xinkai_caltech_edu/1deg_simulation/ClimaOcean.jl/src/Bathymetry.jl:146
unknown function (ip: 0x7feefc8d4536)
jl_apply at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
do_call at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/interpreter.c:126
eval_value at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/interpreter.c:223
eval_stmt_value at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/interpreter.c:174 [inlined]
eval_body at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/interpreter.c:666
jl_interpret_toplevel_thunk at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/interpreter.c:824
jl_toplevel_eval_flex at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/toplevel.c:943
jl_toplevel_eval_flex at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/toplevel.c:886
ijl_toplevel_eval_in at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/toplevel.c:994
eval at ./boot.jl:430 [inlined]
include_string at ./loading.jl:2734
_include at ./loading.jl:2794
include at ./Base.jl:562
jfptr_include_46943.1 at /home/ext_xinkai_caltech_edu/julia-1.11.6/lib/julia/sys.so (unknown line)
exec_options at ./client.jl:323
_start at ./client.jl:531
jfptr__start_73597.1 at /home/ext_xinkai_caltech_edu/julia-1.11.6/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
true_main at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/jlapi.c:900
jl_repl_entrypoint at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/jlapi.c:1059
main at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/cli/loader_exe.c:58
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 767108252 (Pool: 767105920; Big: 2332); GC: 351
--------------------------------------------------------------------------
prterun noticed that process rank 1 with PID 5654 on node hpc12-a3mega8gnodese-0 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------A potentially important note is also that I am running the script with the Manifest in this branch https://github.com/CliMA/ClimaOcean.jl/tree/xk/oneeighth-degree-simulation which uses Oceananigans v0.99.2 and ClimaSeaIce v0.3.7 in case that is important to note.
Let me also put in my job script for reference as the CUDA-aware MPI could be the problem
#!/bin/bash
#
#SBATCH --nodes=1
#SBATCH --partition=a3mega
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-core=1
#SBATCH --threads-per-core=1
#SBATCH --exclusive
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
export JULIA_CUDA_MEMORY_POOL=none
export OPAL_PREFIX="/sw/openmpi-5.0.5"
export PATH="/sw/openmpi-5.0.5/bin:$PATH"
export LD_LIBRARY_PATH="/sw/openmpi-5.0.5/lib:$LD_LIBRARY_PATH"
export JULIA_NUM_THREADS=1
export CUDA_VISIBLE_DEVICES=0,1,2,3
cd ~/1deg_simulation/ClimaOcean.jl
~/julia-1.11.6/bin/julia -e 'using Pkg; Pkg.add("CUDA"); using CUDA; CUDA.set_runtime_version!(local_toolkit=true)'
~/julia-1.11.6/bin/julia -e 'using Pkg; Pkg.add("MPIPreferences"); using MPIPreferences; use_system_binary(library_names="/sw/openmpi-5.0.5/lib/libmpi", mpiexec="/sw/openmpi-5.0.5/bin/mpiexec", force=true)'
~/julia-1.11.6/bin/julia --project -e 'using CUDA; CUDA.precompile_runtime()'
~/julia-1.11.6/bin/julia --project -e 'using Pkg; Pkg.status()'
/sw/openmpi-5.0.5/bin/mpiexec -n 4 ~/julia-1.11.6/bin/julia --project ./experiments/omip_prototype/oneeighth_degree_simulation_minimal.jlHere's what I get when I run CUDA.versioninfo() over MPI:
CUDA toolchain:
- runtime 12.4, local installation
- driver 550.90.7 for 13.0
- compiler 13.0
CUDA libraries:
- CUBLAS: 12.4.5
- CURAND: 10.3.5
- CUFFT: 11.2.1
- CUSOLVER: 11.6.1
- CUSPARSE: 12.3.1
- CUPTI: 2024.1.1 (API 12.4.0)
- NVML: 12.0.0+550.90.7
Julia packages:
- CUDA: 5.9.0
- CUDA_Driver_jll: 13.0.1+0
- CUDA_Compiler_jll: 0.2.1+0
- CUDA_Runtime_jll: 0.19.1+0
- CUDA_Runtime_Discovery: 1.0.0
Toolchain:
- Julia: 1.11.6
- LLVM: 16.0.6
Environment:
- JULIA_CUDA_MEMORY_POOL: none
Preferences:
- CUDA_Runtime_jll.local: true
4 devices:
0: NVIDIA H100 80GB HBM3 (sm_90, 79.093 GiB / 79.647 GiB available)
1: NVIDIA H100 80GB HBM3 (sm_90, 79.093 GiB / 79.647 GiB available)
2: NVIDIA H100 80GB HBM3 (sm_90, 79.093 GiB / 79.647 GiB available)
3: NVIDIA H100 80GB HBM3 (sm_90, 79.093 GiB / 79.647 GiB available)
CUDA toolchain:
- runtime 12.4, local installation
- driver 550.90.7 for 13.0
- compiler 13.0
CUDA libraries:
- CUBLAS: 12.4.5
- CURAND: 10.3.5
- CUFFT: 11.2.1
- CUSOLVER: 11.6.1
- CUSPARSE: 12.3.1
- CUPTI: 2024.1.1 (API 12.4.0)
- NVML: 12.0.0+550.90.7
Julia packages:
- CUDA: 5.9.0
- CUDA_Driver_jll: 13.0.1+0
- CUDA_Compiler_jll: 0.2.1+0
- CUDA_Runtime_jll: 0.19.1+0
- CUDA_Runtime_Discovery: 1.0.0
Toolchain:
- Julia: 1.11.6
- LLVM: 16.0.6
Environment:
- JULIA_CUDA_MEMORY_POOL: none
Preferences:
- CUDA_Runtime_jll.local: true
4 devices:
0: NVIDIA H100 80GB HBM3 (sm_90, 79.093 GiB / 79.647 GiB available)
CUDA toolchain:
- runtime 12.4, local installation
1: NVIDIA H100 80GB HBM3 (sm_90, 79.093 GiB / 79.647 GiB available)
CUDA toolchain:
- runtime 12.4, local installation
2: NVIDIA H100 80GB HBM3 (sm_90, 79.093 GiB / 79.647 GiB available)
3: NVIDIA H100 80GB HBM3 (sm_90, 79.093 GiB / 79.647 GiB available)
- driver 550.90.7 for 13.0
- compiler 13.0
CUDA libraries:
- driver 550.90.7 for 13.0
- compiler 13.0
CUDA libraries:
- CUBLAS: - CUBLAS: 12.4.5
12.4.5
- CURAND: 10.3.5
- CURAND: 10.3.5
- CUFFT: 11.2.1
- CUFFT: 11.2.1
- CUSOLVER: 11.6.1
- CUSOLVER: 11.6.1
- CUSPARSE: 12.3.1
- CUPTI: 2024.1.1 (API 12.4.0)
- CUSPARSE: 12.3.1
- CUPTI: 2024.1.1 (API 12.4.0)
- NVML: 12.0.0+- NVML: 12.0.0+550.90.7
Julia packages:
- CUDA: 5.9.0
550.90.7
Julia packages:
- CUDA: 5.9.0
- CUDA_Driver_jll: 13.0.1+0
- CUDA_Compiler_jll: 0.2.1+0
- CUDA_Runtime_jll: 0.19.1+0
- CUDA_Runtime_Discovery: 1.0.0
Toolchain:
- Julia: 1.11.6
- LLVM: 16.0.6
Environment:
- JULIA_CUDA_MEMORY_POOL: none
- CUDA_Driver_jll: 13.0.1+0
- CUDA_Compiler_jll: 0.2.1+0
- CUDA_Runtime_jll: 0.19.1+0
- CUDA_Runtime_Discovery: 1.0.0
Toolchain:
- Julia: 1.11.6
- LLVM: 16.0.6
Environment:
- JULIA_CUDA_MEMORY_POOL: none
Preferences:
Preferences:
- CUDA_Runtime_jll.local: true
4 devices:
- CUDA_Runtime_jll.local: true
4 devices:
0: NVIDIA H100 80GB HBM3 (sm_90, 79.093 GiB / 79.647 GiB available)
0: NVIDIA H100 80GB HBM3 (sm_90, 79.093 GiB / 79.647 GiB available)
1: NVIDIA H100 80GB HBM3 (sm_90, 79.093 GiB / 79.647 GiB available)
1: NVIDIA H100 80GB HBM3 (sm_90, 79.093 GiB / 79.647 GiB available)
2: NVIDIA H100 80GB HBM3 (sm_90, 79.093 GiB / 79.647 GiB available)
3: NVIDIA H100 80GB HBM3 (sm_90, 79.093 GiB / 79.647 GiB available)
2: NVIDIA H100 80GB HBM3 (sm_90, 79.093 GiB / 79.647 GiB available)
3: NVIDIA H100 80GB HBM3 (sm_90, 79.093 GiB / 79.647 GiB available)@simone-silvestri @navidcy @taimoorsohail
Also cc'ing @akshaysridhar !