Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Communication between ranks failing when using DaCe orchestration. #52

Closed
fmalatino opened this issue Jun 7, 2024 · 0 comments · Fixed by #53
Closed

Communication between ranks failing when using DaCe orchestration. #52

fmalatino opened this issue Jun 7, 2024 · 0 comments · Fixed by #53

Comments

@fmalatino
Copy link
Contributor

Describe the bug
When running in parallel the error: TypeError: Expected Node, got SerializableObject . Error observed when running Orchestrated dace:cpu Acoustics test in pyFV3 github workflow.

Failure output:

FAILED tests/savepoint/test_translate.py::test_parallel_savepoint[DynCore-rank=5-call=0] - TypeError: Expected Node, got SerializableObject (<dace.serialize.SerializableObject object at 0x7efbed4f0690>)
================================================== 1 failed, 1 deselected, 232 warnings in 29.09s ===================================================
FAILED

===================================================================== FAILURES ======================================================================
__________________________________________________ test_parallel_savepoint[DynCore-rank=3-call=0] ___________________________________________________

self = <gt4py.cartesian.backend.dace_backend.SDFGManager object at 0x7fbf8d36a710>

    def _frozen_sdfg(self, *, origin: Dict[str, Tuple[int, ...]], domain: Tuple[int, ...]):
        frozen_hash = shash(origin, domain)
        # check if same sdfg already cached on disk
        path = self.builder.module_path
        basename = os.path.splitext(path)[0]
        path = basename + "_" + str(frozen_hash) + ".sdfg"
        if path not in self._loaded_sdfgs:
            try:
>               sdfg = dace.SDFG.from_file(path)

../.conda/envs/pyftest/lib/python3.11/site-packages/gt4py/cartesian/backend/dace_backend.py:398: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

filename = '/home/Frank.Malatino/pyfv3_fork/.gt_cache_FV3_A/py311_1013/dacecpu/pyFV3/stencils/c_sw/zero_delpc_ptc/m_zero_delpc_ptc__dacecpu_a004db179b_a38a10fd19fc8a4ef448d512c7b50a6ed8c336c75b461b0e75f5279eabc1c222.sdfg'

    @staticmethod
    def from_file(filename: str) -> 'SDFG':
        """ Constructs an SDFG from a file.
    
            :param filename: File name to load SDFG from.
            :return: An SDFG.
        """
        # Try compressed first. If fails, try uncompressed
        try:
            with gzip.open(filename, 'rb') as fp:
                return SDFG._from_file(fp)
        except OSError:
            pass
>       with open(filename, "rb") as fp:
E       FileNotFoundError: [Errno 2] No such file or directory: '/home/Frank.Malatino/pyfv3_fork/.gt_cache_FV3_A/py311_1013/dacecpu/pyFV3/stencils/c_sw/zero_delpc_ptc/m_zero_delpc_ptc__dacecpu_a004db179b_a38a10fd19fc8a4ef448d512c7b50a6ed8c336c75b461b0e75f5279eabc1c222.sdfg'

../.conda/envs/pyftest/lib/python3.11/site-packages/dace/sdfg/sdfg.py:1589: FileNotFoundError

During handling of the above exception, another exception occurred:

case = SavepointCase(savepoint_name='DynCore', data_dir='./test_data/8.1.3/c12_6ranks_standard/dycore', rank=3, i_call=0, tes...te_dyncore.TranslateDynCore object at 0x7fbf90fc7050>, grid=<ndsl.stencils.testing.grid.Grid object at 0x7fbf90ebfe50>)
backend = 'dace:cpu', print_failures = False, failure_stride = 1
subtests = SubTests(ihook=<_pytest.config.compat.PathAwareHookProxy object at 0x7fbfd7ae0f90>, suspend_capture_ctx=<bound method ..._capture_fixture=None>>, request=<SubRequest 'subtests' for <Function test_parallel_savepoint[DynCore-rank=3-call=0]>>)
caplog = <_pytest.logging.LogCaptureFixture object at 0x7fbf8fe57bd0>
threshold_overrides = {'A2B_Ord4': [{'backend': 'dace:cpu', 'max_error': '1e-13'}], 'CS_Profile_2d': [{'backend': 'gt:gpu', 'max_error': 2.5...'cuda', 'max_error': '3e-8', 'near_zero': 1.5e-14}], 'Del2Cubed': [{'backend': 'dace:cpu', 'max_error': '8e-12'}], ...}
grid = 'file', xy_indices = True

    @pytest.mark.parallel
    @pytest.mark.skipif(
        MPI is None or MPI.COMM_WORLD.Get_size() == 1,
        reason="Not running in parallel with mpi",
    )
    def test_parallel_savepoint(
        case: SavepointCase,
        backend,
        print_failures,
        failure_stride,
        subtests,
        caplog,
        threshold_overrides,
        grid,
        xy_indices=True,
    ):
        if MPI.COMM_WORLD.Get_size() % 6 != 0:
            layout = (
                int(MPI.COMM_WORLD.Get_size() ** 0.5),
                int(MPI.COMM_WORLD.Get_size() ** 0.5),
            )
            communicator = get_tile_communicator(MPI.COMM_WORLD, layout)
        else:
            layout = (
                int((MPI.COMM_WORLD.Get_size() // 6) ** 0.5),
                int((MPI.COMM_WORLD.Get_size() // 6) ** 0.5),
            )
            communicator = get_communicator(MPI.COMM_WORLD, layout)
        if case.testobj is None:
            pytest.xfail(
                f"no translate object available for savepoint {case.savepoint_name}"
            )
        stencil_config = StencilConfig(
            compilation_config=CompilationConfig(backend=backend),
            dace_config=DaceConfig(
                communicator=communicator,
                backend=backend,
            ),
        )
        # Increase minimum error threshold for GPU
        if stencil_config.is_gpu_backend:
            case.testobj.max_error = max(case.testobj.max_error, GPU_MAX_ERR)
            case.testobj.near_zero = max(case.testobj.near_zero, GPU_NEAR_ZERO)
        if threshold_overrides is not None:
            process_override(
                threshold_overrides, case.testobj, case.savepoint_name, backend
            )
        if case.testobj.skip_test:
            return
        if (grid == "compute") and not case.testobj.compute_grid_option:
            pytest.xfail(f"Grid compute option not used for test {case.savepoint_name}")
        input_data = dataset_to_dict(case.ds_in)
        # run python version of functionality
>       output = case.testobj.compute_parallel(input_data, communicator)

../.conda/envs/pyftest/lib/python3.11/site-packages/ndsl/stencils/testing/test_translate.py:396: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyFV3/testing/translate_dyncore.py:177: in compute_parallel
    acoustic_dynamics(state, timestep=inputs["mdt"], n_map=state.n_map)
../.conda/envs/pyftest/lib/python3.11/site-packages/ndsl/dsl/dace/orchestration.py:498: in __call__
    return wrapped(*arg, **kwarg)
../.conda/envs/pyftest/lib/python3.11/site-packages/ndsl/dsl/dace/orchestration.py:399: in __call__
    sdfg = _parse_sdfg(
../.conda/envs/pyftest/lib/python3.11/site-packages/ndsl/dsl/dace/orchestration.py:295: in _parse_sdfg
    sdfg = daceprog.to_sdfg(
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/parser.py:276: in to_sdfg
    sdfg = self._parse(args, kwargs, simplify=simplify, save=save, validate=validate)
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/parser.py:492: in _parse
    sdfg, cached = self._generate_pdp(args, kwargs, simplify=simplify)
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/parser.py:900: in _generate_pdp
    sdfg = newast.parse_dace_program(self.name,
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/newast.py:238: in parse_dace_program
    sdfg, _, _, _ = visitor.parse_program(preprocessed_ast.preprocessed_ast.body[0])
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/newast.py:1230: in parse_program
    self.visit_TopLevel(stmt)
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/astutils.py:489: in visit_TopLevel
    return self.visit(node)
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/newast.py:1206: in visit
    return super().visit(node)
../.conda/envs/pyftest/lib/python3.11/ast.py:418: in visit
    return visitor(node)
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/newast.py:2327: in visit_For
    laststate, first_loop_state, last_loop_state, _ = self._recursive_visit(node.body,
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/newast.py:2192: in _recursive_visit
    self.visit_TopLevel(stmt)
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/astutils.py:487: in visit_TopLevel
    return visitor(node)
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/newast.py:4625: in visit_TopLevelExpr
    self.visit_Call(node.value)
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/newast.py:4451: in visit_Call
    return self._parse_sdfg_call(funcname, func, node)
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/newast.py:3769: in _parse_sdfg_call
    raise ex
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/newast.py:3746: in _parse_sdfg_call
    sdfg = fcopy.__sdfg__(*fargs, **fkwargs)
../.conda/envs/pyftest/lib/python3.11/site-packages/ndsl/dsl/dace/orchestration.py:501: in __sdfg__
    return wrapped.__sdfg__(*args, **kwargs)
../.conda/envs/pyftest/lib/python3.11/site-packages/ndsl/dsl/dace/orchestration.py:414: in __sdfg__
    return _parse_sdfg(self.daceprog, self.lazy_method.config, *args, **kwargs)
../.conda/envs/pyftest/lib/python3.11/site-packages/ndsl/dsl/dace/orchestration.py:295: in _parse_sdfg
    sdfg = daceprog.to_sdfg(
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/parser.py:276: in to_sdfg
    sdfg = self._parse(args, kwargs, simplify=simplify, save=save, validate=validate)
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/parser.py:492: in _parse
    sdfg, cached = self._generate_pdp(args, kwargs, simplify=simplify)
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/parser.py:900: in _generate_pdp
    sdfg = newast.parse_dace_program(self.name,
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/newast.py:238: in parse_dace_program
    sdfg, _, _, _ = visitor.parse_program(preprocessed_ast.preprocessed_ast.body[0])
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/newast.py:1230: in parse_program
    self.visit_TopLevel(stmt)
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/astutils.py:487: in visit_TopLevel
    return visitor(node)
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/newast.py:4625: in visit_TopLevelExpr
    self.visit_Call(node.value)
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/newast.py:4451: in visit_Call
    return self._parse_sdfg_call(funcname, func, node)
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/newast.py:3769: in _parse_sdfg_call
    raise ex
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/frontend/python/newast.py:3746: in _parse_sdfg_call
    sdfg = fcopy.__sdfg__(*fargs, **fkwargs)
../.conda/envs/pyftest/lib/python3.11/site-packages/ndsl/dsl/stencil.py:506: in __sdfg__
    return self.stencil_object.__sdfg__(
../.conda/envs/pyftest/lib/python3.11/site-packages/gt4py/cartesian/backend/dace_lazy_stencil.py:66: in __sdfg__
    sdfg = sdfg_manager.frozen_sdfg(origin=norm_kwargs["origin"], domain=norm_kwargs["domain"])
../.conda/envs/pyftest/lib/python3.11/site-packages/gt4py/cartesian/backend/dace_backend.py:415: in frozen_sdfg
    return copy.deepcopy(self._frozen_sdfg(origin=origin, domain=domain))
../.conda/envs/pyftest/lib/python3.11/site-packages/gt4py/cartesian/backend/dace_backend.py:401: in _frozen_sdfg
    inner_sdfg = self.unexpanded_sdfg()
../.conda/envs/pyftest/lib/python3.11/site-packages/gt4py/cartesian/backend/dace_backend.py:379: in unexpanded_sdfg
    return copy.deepcopy(self._unexpanded_sdfg())
../.conda/envs/pyftest/lib/python3.11/site-packages/gt4py/cartesian/backend/dace_backend.py:358: in _unexpanded_sdfg
    sdfg = dace.SDFG.from_file(path)
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/sdfg/sdfg.py:1590: in from_file
    return SDFG._from_file(fp)
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/sdfg/sdfg.py:1568: in _from_file
    sdfg = SDFG.from_json(sdfg_json)
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/sdfg/sdfg.py:608: in from_json
    state = SDFGState.from_json(n, context=nci)
../.conda/envs/pyftest/lib/python3.11/site-packages/dace/sdfg/state.py:1380: in from_json
    ret.add_node(nret)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = SDFGState (state_0), node = <dace.serialize.SerializableObject object at 0x7fbf8d117210>

    def add_node(self, node):
        if not isinstance(node, nd.Node):
>           raise TypeError("Expected Node, got " + type(node).__name__ + " (" + str(node) + ")")
E           TypeError: Expected Node, got SerializableObject (<dace.serialize.SerializableObject object at 0x7fbf8d117210>)

../.conda/envs/pyftest/lib/python3.11/site-packages/dace/sdfg/state.py:1280: TypeError

To Reproduce

cd test_data
wget https://portal.nccs.nasa.gov/datashare/astg/smt/pace-regression-data/8.1.3_c12_6ranks_standard.DynCore.tar.gz
tar -xzvf 8.1.3_c12_6ranks_standard.DynCore.tar.gz
cd -
export FV3_DACEMODE=BuildAndRun
export PACE_FLOAT_PRECISION=64
export PACE_TEST_N_THRESHOLD_SAMPLES=0
export OMP_NUM_THREADS=1
export PACE_LOGLEVEL=Debug
mpirun -mca orte_abort_on_non_zero_status 1 -np 6 --oversubscribe pytest \
-v -s --data_path=./test_data/8.1.3/c12_6ranks_standard/dycore \
--backend=dace:cpu \
-m parallel \
--which_rank=0 \
--which_modules=DynCore \
--threshold_overrides_file=./tests/savepoint/translate/overrides/standard.yaml \
./tests/savepoint

Expected behavior

System Environment
Describe the system environment, include:

  • OS: Red Hat Enterprise Linux 8.10 (Ootpa)
  • Backend used: dace:cpu
  • Environment variables set:
FV3_DACEMODE=BuildAndRun
PACE_FLOAT_PRECISION=64
PACE_TEST_N_THRESHOLD_SAMPLES=0
OMP_NUM_THREADS=1
PACE_LOGLEVEL=Debug
  • Compiler(s):
gcc/12.3.0
openucx/1.15.0
python/3.11.7
  • MPI type, and version: openmpi-ucx/5.0.0
  • netCDF Version: netcdf/4.9.2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant