Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run importers in parallel within a single importer instance #1278

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jcrossley3
Copy link
Contributor

@jcrossley3 jcrossley3 commented Feb 10, 2025

Partial impl of #1207

Copy link
Contributor

@ctron ctron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a concept of what parallel means.

I think it makes sense to:

  • Allow running multiple instances of the importer instance
  • Allow running multiple importer runs in parallel

But I don't think it makes sense to run a single importer in a parallel mode.

modules/importer/src/runner/csaf/mod.rs Outdated Show resolved Hide resolved
@jcrossley3
Copy link
Contributor Author

I think we should have a concept of what parallel means.

Sure

I think it makes sense to:

  • Allow running multiple instances of the importer instance

I'm thinking we'll need another column in the importer table to identify which instance is executing a "run" and some means of identifying a crashed/killed instance. Can you recommend a duration for a window to update a timestamp (last_run or last_success -- what's the difference?) denoting "I'm still alive" after which we can assume another instance needs to take over? Any thoughts on a unique identifier for each instance?

  • Allow running multiple importer runs in parallel

Can you clarify this? Are you referring to the join_all use in this PR or something else? I don't think we'd want multiple instances to run the same importer simulataneously, do we?

But I don't think it makes sense to run a single importer in a parallel mode.

You're specifically referring to the .walk_parallel calls, right? Just to avoid being rate-limited?

@JimFuller-RedHat
Copy link
Collaborator

also ... I wonder at what point do we would consider bringing in a job queue crate versus building one ourselves ... I have no experience with any rust crates so maybe there are none fit for purpose ?

@jcrossley3
Copy link
Contributor Author

jcrossley3 commented Feb 11, 2025

also ... I wonder at what point would we consider bringing in a job queue crate versus building one ourselves ... I have no experience with any rust crates so maybe there are none fit for purpose ?

I'm thinking that threshold is very high. Because if we're requiring a distributed job queue, then I expect any crate would wrap redis/rabbit/etc.

@ctron
Copy link
Contributor

ctron commented Feb 11, 2025

I think there's a bunch of way we could deal with this. One way could be a postgres based locking mechanism. ON LOCK SKIP, or whatever that was called.

Downside, it only works with postgres. Which might be ok.

Another way could be do have a hash approach: have a max number of importers, and an individual instance number, and then only process those jobs. Should work with a statefulset.

we could also hold a state in the database. But then we'd need to define such things as timeouts, stale detection, etc.

@jcrossley3
Copy link
Contributor Author

If we think StatefulSet is the best option, then I'll re-assign this to you. :)

@ctron
Copy link
Contributor

ctron commented Feb 11, 2025

If we think StatefulSet is the best option, then I'll re-assign this to you. :)

StatefulSet is just a way of ensuring we have X number of importers running and are able to assign ascending numbers to those instances. The input to the importer would be "x of y", which could be transported via env-vars, even when using ansible. Having a default of "1 of 1", which means "process them all".

@jcrossley3 jcrossley3 marked this pull request as ready for review February 12, 2025 22:01
@jcrossley3 jcrossley3 changed the title Make the importer execution more concurrent, i.e. faster Run importers in parallel within a single importer instance Feb 12, 2025
@jcrossley3 jcrossley3 requested a review from ctron February 12, 2025 22:03
@ctron
Copy link
Contributor

ctron commented Feb 13, 2025

To my understanding, this PR simply checks all importers in parallel. With no limitation. Waiting for the joint outcome of a single check run.

I think this would result in the case that one importer runs for a long time, blocking the next check, and thus not really improving the situation. On an initial startup with pending import runs, yes. But afterwards, no.

I think we'd need a strategy first, with the goal mentioned above, before having an implementation.

@jcrossley3
Copy link
Contributor Author

I think this would result in the case that one importer runs for a long time, blocking the next check, and thus not really improving the situation. On an initial startup with pending import runs, yes. But afterwards, no.

True. So no worse than what we have, but does address the primary motivation for this PR: that QE will only have to wait as long as it takes the longest enabled importer to run for them all to be finished.

Plus, better logging.

I think we'd need a strategy first, with the goal mentioned above, before having an implementation.

Agree. Still thinking about that.

@ctron
Copy link
Contributor

ctron commented Feb 13, 2025

I would not want to merge something which only suits some specific QE task. Potentially causing issues for the overall system. Like the case that most of the times it runs sequential, while sometimes it does run in parallel. Without any way of controlling that behavior.

@jcrossley3
Copy link
Contributor Author

I would not want to merge something which only suits some specific QE task. Potentially causing issues for the overall system. Like the case that most of the times it runs sequential, while sometimes it does run in parallel. Without any way of controlling that behavior.

It's not a QE task. They just happen to be the first users who care about how long it takes the app to import data. We already have a means to disable importers. I think there's a cancellation mechanism in there, no?

I don't love that those git-based importers are sync and block threads, but it's not like we expect to have hundreds of them.

Is your reluctance so high that you'd like me to close this PR?

@jcrossley3
Copy link
Contributor Author

To my understanding, this PR simply checks all importers in parallel. With no limitation. Waiting for the joint outcome of a single check run.

I added a limitation. Defaults to the old behavior: concurrency=1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants