-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
throughput of jobs slows down over time - part deux #6419
Comments
Can you retry without the |
Yeah, I think the history plugin is most of the issue here. Without it, I'm seeing more of a linear gradual slowdown. This makes sense since I think the history plugin adds jobs to a list without bound using
And actually if I restart the test the throughput is (mostly) restored. So, probably 100% attributable to the history plugin:
|
good guess, and I'm thinking ....
that Also ... I'm wondering if that |
I probably wouldn't spend too much time on it, since in a high throughput scenario you can just unload the plugin. |
yeah, I think these insert flags are backwards.
the lists are sorted with bigger trying this out
|
Problem: Internal job history lists are sorted with larger t_submit values at the front of the list. When new jobs are inserted (i.e. jobs with bigger t_submit values), they are inserted with a search starting at the back of the list. This is the opposite of what should be done and leads to a slowdown in job throughput over time. Insert into the job history list starting at the front. Fixes flux-framework#6419
While working on #6414 , I noticed that job throughput degrades over time. (similar to the old #3583)
Running throughput test of 8192 jobs over and over again in an instance via this old script
(I broke out of the test after 12 iterations, not wanting to bother waiting anymore)
Some degradation is to be expected, as memory gets eaten up and what not, but this seems a tad larger decrease than I'd expect. In total we're still < 100K total jobs in the results above, so it shouldn't be hogging too much memory. Perhaps a round of perf analysis would be good to do, perhaps we got some bottleneck somewhere.
The text was updated successfully, but these errors were encountered: