-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run importers in parallel within a single importer instance #1278
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a concept of what parallel means.
I think it makes sense to:
- Allow running multiple instances of the importer instance
- Allow running multiple importer runs in parallel
But I don't think it makes sense to run a single importer in a parallel mode.
Sure
I'm thinking we'll need another column in the
Can you clarify this? Are you referring to the
You're specifically referring to the |
also ... I wonder at what point do we would consider bringing in a job queue crate versus building one ourselves ... I have no experience with any rust crates so maybe there are none fit for purpose ? |
I'm thinking that threshold is very high. Because if we're requiring a distributed job queue, then I expect any crate would wrap redis/rabbit/etc. |
I think there's a bunch of way we could deal with this. One way could be a postgres based locking mechanism. Downside, it only works with postgres. Which might be ok. Another way could be do have a hash approach: have a max number of importers, and an individual instance number, and then only process those jobs. Should work with a statefulset. we could also hold a state in the database. But then we'd need to define such things as timeouts, stale detection, etc. |
If we think StatefulSet is the best option, then I'll re-assign this to you. :) |
StatefulSet is just a way of ensuring we have X number of importers running and are able to assign ascending numbers to those instances. The input to the importer would be "x of y", which could be transported via env-vars, even when using ansible. Having a default of "1 of 1", which means "process them all". |
To my understanding, this PR simply checks all importers in parallel. With no limitation. Waiting for the joint outcome of a single check run. I think this would result in the case that one importer runs for a long time, blocking the next check, and thus not really improving the situation. On an initial startup with pending import runs, yes. But afterwards, no. I think we'd need a strategy first, with the goal mentioned above, before having an implementation. |
True. So no worse than what we have, but does address the primary motivation for this PR: that QE will only have to wait as long as it takes the longest enabled importer to run for them all to be finished. Plus, better logging.
Agree. Still thinking about that. |
I would not want to merge something which only suits some specific QE task. Potentially causing issues for the overall system. Like the case that most of the times it runs sequential, while sometimes it does run in parallel. Without any way of controlling that behavior. |
It's not a QE task. They just happen to be the first users who care about how long it takes the app to import data. We already have a means to disable importers. I think there's a cancellation mechanism in there, no? I don't love that those git-based importers are sync and block threads, but it's not like we expect to have hundreds of them. Is your reluctance so high that you'd like me to close this PR? |
I added a limitation. Defaults to the old behavior: concurrency=1. |
Signed-off-by: Jim Crossley <[email protected]>
Signed-off-by: Jim Crossley <[email protected]>
Signed-off-by: Jim Crossley <[email protected]>
Signed-off-by: Jim Crossley <[email protected]>
Partial impl of #1207