Skip to content

Occasional dramatic performance degradation of RESetMapReduce with a large number of processes #41115

@maxale

Description

@maxale

Steps To Reproduce

With number of processes going above 100 or so, RESetMapReduce shows occasional dramatic performance degradation as illustrated by the following benchmarks of RESetMPExample on a 384-core machine.

sage: from sage.parallel.map_reduce import RESetMPExample

sage: %timeit RESetMPExample(11).run(max_proc=40);
1.41 s ± 17.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

sage: %timeit RESetMPExample(11).run(max_proc=80);
1.41 s ± 337 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

sage: %timeit RESetMPExample(11).run(max_proc=100);
The slowest run took 5.90 times longer than the fastest. This could mean that an intermediate result is being cached.
2.23 s ± 1.69 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

sage: %timeit RESetMPExample(11).run(max_proc=120);
The slowest run took 11.28 times longer than the fastest. This could mean that an intermediate result is being cached.
6.67 s ± 4.66 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

sage: %timeit RESetMPExample(11).run(max_proc=160);
11.5 s ± 5.21 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

sage: %timeit RESetMPExample(11).run(max_proc=200);
The slowest run took 68.26 times longer than the fastest. This could mean that an intermediate result is being cached.
1min 28s ± 2min 5s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Expected Behavior

While the running time may not necessary improve much as the number of processes grow (Amdahl's law), we expect it not degrade either and definitely not in some-fold manner. Furthermore, we expect the running time be more or less consistent between multiple runs of the same code.

Actual Behavior

With the number of processes below 100 or so, benchmarks show consistent running time measurements. However, when the number of processes goes above that threshold, two issues appear:

  • running time significantly varies between the runs of the same code (some runs may be very fast, while others may be very slow);
  • in the worst case, running time can be hundreds times as large as that for a small number of processes.

Additional Information

As the following experiment shows, "an intermediate result is being cached" is not an explanation for inconsistent running time between the runs:

sage: %time RESetMPExample(11).run(max_proc=100);
CPU times: user 15 ms, sys: 362 ms, total: 377 ms
Wall time: 3.04 s
sage: %time RESetMPExample(11).run(max_proc=100);
CPU times: user 67.7 ms, sys: 397 ms, total: 465 ms
Wall time: 3.65 s
sage: %time RESetMPExample(11).run(max_proc=100);
CPU times: user 11 ms, sys: 373 ms, total: 384 ms
Wall time: 4.4 s
sage: %time RESetMPExample(11).run(max_proc=100);
CPU times: user 8.05 ms, sys: 359 ms, total: 367 ms
Wall time: 1.04 s
sage: %time RESetMPExample(11).run(max_proc=100);
CPU times: user 13.7 ms, sys: 424 ms, total: 438 ms
Wall time: 2.13 s
sage: %time RESetMPExample(11).run(max_proc=100);
CPU times: user 13.8 ms, sys: 365 ms, total: 379 ms
Wall time: 3.14 s

Here the running time randomly jumps up and down between the runs, which is not what we'd expect from caching.

Environment

  • OS: Rocky Linux 8.10 (Green Obsidian)
  • Sage Version: 10.7

Checklist

  • I have searched the existing issues for a bug report that matches the one I want to file, without success.
  • I have read the documentation and troubleshoot guide

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions