Skip to content

Comm optimization tests #1168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 12, 2025
Merged

Comm optimization tests #1168

merged 7 commits into from
Apr 12, 2025

Conversation

wsmoses
Copy link
Member

@wsmoses wsmoses commented Apr 11, 2025

No description provided.

wsmoses and others added 6 commits April 12, 2025 02:05
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@glwagner
Copy link
Collaborator

#1169 might have another test of interest

@glwagner
Copy link
Collaborator

as an example, it could be a good idea to check that trivial kernels have no communication (also not collective-permute)

@glwagner
Copy link
Collaborator

PS I am receiving an error on this branch for a simple kernel (not so different from #1169) on this PR, which does not occur on the latest tagged version:

E0000 00:00:1744467527.241539  701066 status_macros.cc:57] INTERNAL: RET_CHECK failure (external/xla/xla/hlo/ir/hlo_instruction.cc:361) absl::c_all_of(proto.operand_ids(), [&](int64_t id) { return instruction_map.contains(id); }) compare.5758 instruction contains invalid operand id(s)
*** Begin stack trace ***
        tsl::CurrentStackTrace()
        xla::status_macros::MakeError(char const*, int, absl::lts_20230802::StatusCode, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, bool, absl::lts_20230802::LogSeverity, bool)
        xla::status_macros::MakeErrorStream::Impl::GetStatus()
        xla::status_macros::MakeErrorStream::MakeErrorStreamWithOutput::operator absl::lts_20230802::StatusOr<std::__1::unique_ptr<xla::HloInstruction, std::__1::default_delete<xla::HloInstruction>>><std::__1::unique_ptr<xla::HloInstruction, std::__1::default_delete<xla::HloInstruction>>>()
        xla::HloInstruction::CreateFromProto(xla::HloInstructionProto const&, absl::lts_20230802::flat_hash_map<long long, xla::HloInstruction*, absl::lts_20230802::hash_internal::Hash<long long>, std::__1::equal_to<long long>, std::__1::allocator<std::__1::pair<long long const, xla::HloInstruction*>>> const&, absl::lts_20230802::flat_hash_map<long long, xla::HloComputation*, absl::lts_20230802::hash_internal::Hash<long long>, std::__1::equal_to<long long>, std::__1::allocator<std::__1::pair<long long const, xla::HloComputation*>>> const&, bool)
        xla::HloComputation::CreateFromProto(xla::HloComputationProto const&, absl::lts_20230802::flat_hash_map<long long, xla::HloComputation*, absl::lts_20230802::hash_internal::Hash<long long>, std::__1::equal_to<long long>, std::__1::allocator<std::__1::pair<long long const, xla::HloComputation*>>> const&, bool)
        xla::HloModule::CreateFromProto(xla::HloModuleProto const&, xla::HloModuleConfig const&, bool, std::__1::unique_ptr<xla::CompilationEnvironments, std::__1::default_delete<xla::CompilationEnvironments>>)
        xla::TfrtCpuClient::CompileInternal(xla::XlaComputation const&, std::__1::vector<xla::Shape const*, std::__1::allocator<xla::Shape const*>> const&, std::__1::function<absl::lts_20230802::StatusOr<std::__1::pair<std::__1::vector<xla::Shape, std::__1::allocator<xla::Shape>>, xla::Shape>> (xla::HloModule const&)>, xla::CompileOptions, xla::AotCompilationOptions const*)
        xla::TfrtCpuClient::CompileAndLoad(mlir::ModuleOp, xla::CompileOptions)
        xla::ifrt::PjRtLoadedExecutable::Create(xla::ifrt::PjRtCompatibleClient*, mlir::ModuleOp, xla::CompileOptions, std::__1::vector<tsl::RCReference<xla::ifrt::LoadedHostCallback>, std::__1::allocator<tsl::RCReference<xla::ifrt::LoadedHostCallback>>>)
        xla::ifrt::PjRtCompiler::Compile(std::__1::unique_ptr<xla::ifrt::Program, std::__1::default_delete<xla::ifrt::Program>>, std::__1::unique_ptr<xla::ifrt::CompileOptions, std::__1::default_delete<xla::ifrt::CompileOptions>>)
        ifrt_compile
        julia_YY.33_25522
        eval_stmt_value
        eval_body
        jl_interpret_toplevel_thunk
        jl_toplevel_eval_flex
        jl_toplevel_eval_flex
        ijl_toplevel_eval_in
        japi1_include_string_72472.1
        jl_repl_entrypoint
        main
        start
*** End stack trace ***

┌ Error: Compilation failed, MLIR module written to /var/folders/pv/2k_ry3f951jghlpbnn_hcqg80000gn/T/reactant_dVLnLk/module_000_reactant_first_t..._post_xla_compile.mlir
└ @ Reactant.MLIR.IR ~/.julia/packages/Reactant/vrlvN/src/mlir/IR/Pass.jl:116
ERROR: LoadError: INTERNAL: RET_CHECK failure (external/xla/xla/hlo/ir/hlo_instruction.cc:361) absl::c_all_of(proto.operand_ids(), [&](int64_t id) { return instruction_map.contains(id); }) compare.5758 instruction contains invalid operand id(s)

Stacktrace:
 [1] reactant_err(msg::Cstring)

@glwagner
Copy link
Collaborator

^^ we determined this comes from @jit for something complicated, so there may be more to investigate (and tests are passing on gb 25 for a similar problem --- though possibly fewer grid points).

giordano added a commit to PRONTOLab/GB-25 that referenced this pull request Apr 12, 2025
@giordano
Copy link
Member

https://github.com/PRONTOLab/GB-25/actions/runs/14420845002/job/40443210506?pr=179#step:10:677

ERROR: LoadError: DivideError: integer division error
Stacktrace:
  [1] mlirPassManagerRunOnOp
    @ ~/.julia/packages/Reactant/vrlvN/src/mlir/libMLIR_h.jl:8584 [inlined]
  [2] run!(pm::Reactant.MLIR.IR.PassManager, mod::Reactant.MLIR.IR.Module, key::String)
    @ Reactant.MLIR.IR ~/.julia/packages/Reactant/vrlvN/src/mlir/IR/Pass.jl:151
  [3] run_pass_pipeline!(mod::Reactant.MLIR.IR.Module, pass_pipeline::String, key::String; enable_verifier::Bool)
    @ Reactant.Compiler ~/.julia/packages/Reactant/vrlvN/src/Compiler.jl:876
  [4] run_pass_pipeline!
    @ ~/.julia/packages/Reactant/vrlvN/src/Compiler.jl:871 [inlined]
  [5] compile_mlir!(mod::Reactant.MLIR.IR.Module, f::typeof(first_time_step!), 

@wsmoses wsmoses merged commit 963d1d1 into main Apr 12, 2025
56 checks passed
@wsmoses wsmoses deleted the copt branch April 12, 2025 16:00
giordano added a commit to PRONTOLab/GB-25 that referenced this pull request Apr 12, 2025
giordano added a commit to PRONTOLab/GB-25 that referenced this pull request Apr 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants