- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 684
Description
Steps To Reproduce
With number of processes going above 100 or so, RESetMapReduce shows occasional dramatic performance degradation as illustrated by the following benchmarks of RESetMPExample on a 384-core machine.
sage: from sage.parallel.map_reduce import RESetMPExample
sage: %timeit RESetMPExample(11).run(max_proc=40);
1.41 s ± 17.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sage: %timeit RESetMPExample(11).run(max_proc=80);
1.41 s ± 337 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sage: %timeit RESetMPExample(11).run(max_proc=100);
The slowest run took 5.90 times longer than the fastest. This could mean that an intermediate result is being cached.
2.23 s ± 1.69 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
sage: %timeit RESetMPExample(11).run(max_proc=120);
The slowest run took 11.28 times longer than the fastest. This could mean that an intermediate result is being cached.
6.67 s ± 4.66 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
sage: %timeit RESetMPExample(11).run(max_proc=160);
11.5 s ± 5.21 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
sage: %timeit RESetMPExample(11).run(max_proc=200);
The slowest run took 68.26 times longer than the fastest. This could mean that an intermediate result is being cached.
1min 28s ± 2min 5s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Expected Behavior
While the running time may not necessary improve much as the number of processes grow (Amdahl's law), we expect it not degrade either and definitely not in some-fold manner. Furthermore, we expect the running time be more or less consistent between multiple runs of the same code.
Actual Behavior
With the number of processes below 100 or so, benchmarks show consistent running time measurements. However, when the number of processes goes above that threshold, two issues appear:
- running time significantly varies between the runs of the same code (some runs may be very fast, while others may be very slow);
- in the worst case, running time can be hundreds times as large as that for a small number of processes.
Additional Information
As the following experiment shows, "an intermediate result is being cached" is not an explanation for inconsistent running time between the runs:
sage: %time RESetMPExample(11).run(max_proc=100);
CPU times: user 15 ms, sys: 362 ms, total: 377 ms
Wall time: 3.04 s
sage: %time RESetMPExample(11).run(max_proc=100);
CPU times: user 67.7 ms, sys: 397 ms, total: 465 ms
Wall time: 3.65 s
sage: %time RESetMPExample(11).run(max_proc=100);
CPU times: user 11 ms, sys: 373 ms, total: 384 ms
Wall time: 4.4 s
sage: %time RESetMPExample(11).run(max_proc=100);
CPU times: user 8.05 ms, sys: 359 ms, total: 367 ms
Wall time: 1.04 s
sage: %time RESetMPExample(11).run(max_proc=100);
CPU times: user 13.7 ms, sys: 424 ms, total: 438 ms
Wall time: 2.13 s
sage: %time RESetMPExample(11).run(max_proc=100);
CPU times: user 13.8 ms, sys: 365 ms, total: 379 ms
Wall time: 3.14 s
Here the running time randomly jumps up and down between the runs, which is not what we'd expect from caching.
Environment
- OS: Rocky Linux 8.10 (Green Obsidian)
- Sage Version: 10.7
Checklist
- I have searched the existing issues for a bug report that matches the one I want to file, without success.
- I have read the documentation and troubleshoot guide