-
Notifications
You must be signed in to change notification settings - Fork 29
Open
Labels
Performance optimizationbackend: openmpSpecific to OpenMP execution (CPUs)Specific to OpenMP execution (CPUs)component: diagnosticsall types of outputsall types of outputstracking: particles
Description
The diagnostics code in reduced_beam_characteristics(pc)
is too slow. In 1-MPI-rank simulations like the HTU beamline, when setting sim.particle_container().store_beam_moments = True
, it is dominating the runtime by ~1.5x compared to the next costly element of the actual simulation.
TinyProfiler total time across processes [min...avg...max]: 0.02604 ... 0.02604 ... 0.02604
-------------------------------------------------------------------------------------------------------
Name NCalls Excl. Min Excl. Avg Excl. Max Max %
-------------------------------------------------------------------------------------------------------
impactx::diagnostics::reduced_beam_characteristics(pc) 91 0.01197 0.01197 0.01197 45.96%
impactx::Push::ChrQuad 34 0.007997 0.007997 0.007997 30.71%
impactx::Push::ExactDrift 33 0.001654 0.001654 0.001654 6.35%
impactx::Push::ExactSbend 5 0.0004234 0.0004234 0.0004234 1.63%
impactX::collect_lost_particles 91 0.0003877 0.0003877 0.0003877 1.49%
ImpactX::evolve::slice_step 91 0.0003815 0.0003815 0.0003815 1.47%
ImpactX::add_particles 1 0.0003395 0.0003395 0.0003395 1.30%
impactx::Push::Kicker 8 0.0002024 0.0002024 0.0002024 0.78%
ImpactXParticleContainer::record_beam_moments 91 0.0001794 0.0001794 0.0001794 0.69%
DistributionMapping::LeastUsedCPUs() 1 0.0001495 0.0001495 0.0001495 0.57%
ImpactX::track_particles 1 3.08e-05 3.08e-05 3.08e-05 0.12%
impactx::Push 91 1.807e-05 1.807e-05 1.807e-05 0.07%
AmrMesh::MakeDistributionMap() 1 7.808e-06 7.808e-06 7.808e-06 0.03%
DistributionMapping::SFCProcessorMapDoIt() 1 2.937e-06 2.937e-06 2.937e-06 0.01%
Other 357 0.0001655 0.0001655 0.0001655 0.64%
-------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------
Name NCalls Incl. Min Incl. Avg Incl. Max Max %
-------------------------------------------------------------------------------------------------------
ImpactX::track_particles 1 0.02335 0.02335 0.02335 89.69%
ImpactX::evolve::slice_step 91 0.02331 0.02331 0.02331 89.52%
ImpactXParticleContainer::record_beam_moments 91 0.01215 0.01215 0.01215 46.65%
impactx::diagnostics::reduced_beam_characteristics(pc) 91 0.01197 0.01197 0.01197 45.96%
impactx::Push 91 0.0103 0.0103 0.0103 39.56%
impactx::Push::ChrQuad 34 0.007999 0.007999 0.007999 30.72%
impactx::Push::ExactDrift 33 0.001656 0.001656 0.001656 6.36%
impactx::Push::ExactSbend 5 0.0004239 0.0004239 0.0004239 1.63%
ImpactX::add_particles 1 0.0003912 0.0003912 0.0003912 1.50%
impactX::collect_lost_particles 91 0.0003877 0.0003877 0.0003877 1.49%
impactx::Push::Kicker 8 0.000203 0.000203 0.000203 0.78%
AmrMesh::MakeDistributionMap() 1 0.0001608 0.0001608 0.0001608 0.62%
DistributionMapping::SFCProcessorMapDoIt() 1 0.000153 0.000153 0.000153 0.59%
DistributionMapping::LeastUsedCPUs() 1 0.0001495 0.0001495 0.0001495 0.57%
Other 357 0.0001655 0.0001655 0.0001655 0.64%
-------------------------------------------------------------------------------------------------------
I think that amrex::ParticleReduce
is OpenMP parallelized over particle tiles, but maybe it is not working or can be optimized?
Additionally can some operations be vectorized on CPU that are not auto-vectorized?
Or do we just calculate/reduce way too many variables (currently: two full-Np reductions with the 2nd one on 22 variables) and need to introduce a more fine-tuned approach, as we do for optionally calculating the (costly) eigenemittances?
Metadata
Metadata
Assignees
Labels
Performance optimizationbackend: openmpSpecific to OpenMP execution (CPUs)Specific to OpenMP execution (CPUs)component: diagnosticsall types of outputsall types of outputstracking: particles