Skip to content

Run importers in parallel within a single importer instance #1278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Feb 18, 2025

Conversation

jcrossley3
Copy link
Contributor

@jcrossley3 jcrossley3 commented Feb 10, 2025

Partial impl of #1207

Copy link
Contributor

@ctron ctron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a concept of what parallel means.

I think it makes sense to:

  • Allow running multiple instances of the importer instance
  • Allow running multiple importer runs in parallel

But I don't think it makes sense to run a single importer in a parallel mode.

@jcrossley3
Copy link
Contributor Author

I think we should have a concept of what parallel means.

Sure

I think it makes sense to:

  • Allow running multiple instances of the importer instance

I'm thinking we'll need another column in the importer table to identify which instance is executing a "run" and some means of identifying a crashed/killed instance. Can you recommend a duration for a window to update a timestamp (last_run or last_success -- what's the difference?) denoting "I'm still alive" after which we can assume another instance needs to take over? Any thoughts on a unique identifier for each instance?

  • Allow running multiple importer runs in parallel

Can you clarify this? Are you referring to the join_all use in this PR or something else? I don't think we'd want multiple instances to run the same importer simulataneously, do we?

But I don't think it makes sense to run a single importer in a parallel mode.

You're specifically referring to the .walk_parallel calls, right? Just to avoid being rate-limited?

@JimFuller-RedHat
Copy link
Collaborator

also ... I wonder at what point do we would consider bringing in a job queue crate versus building one ourselves ... I have no experience with any rust crates so maybe there are none fit for purpose ?

@jcrossley3
Copy link
Contributor Author

jcrossley3 commented Feb 11, 2025

also ... I wonder at what point would we consider bringing in a job queue crate versus building one ourselves ... I have no experience with any rust crates so maybe there are none fit for purpose ?

I'm thinking that threshold is very high. Because if we're requiring a distributed job queue, then I expect any crate would wrap redis/rabbit/etc.

@ctron
Copy link
Contributor

ctron commented Feb 11, 2025

I think there's a bunch of way we could deal with this. One way could be a postgres based locking mechanism. ON LOCK SKIP, or whatever that was called.

Downside, it only works with postgres. Which might be ok.

Another way could be do have a hash approach: have a max number of importers, and an individual instance number, and then only process those jobs. Should work with a statefulset.

we could also hold a state in the database. But then we'd need to define such things as timeouts, stale detection, etc.

@jcrossley3
Copy link
Contributor Author

If we think StatefulSet is the best option, then I'll re-assign this to you. :)

@ctron
Copy link
Contributor

ctron commented Feb 11, 2025

If we think StatefulSet is the best option, then I'll re-assign this to you. :)

StatefulSet is just a way of ensuring we have X number of importers running and are able to assign ascending numbers to those instances. The input to the importer would be "x of y", which could be transported via env-vars, even when using ansible. Having a default of "1 of 1", which means "process them all".

@jcrossley3 jcrossley3 marked this pull request as ready for review February 12, 2025 22:01
@jcrossley3 jcrossley3 changed the title Make the importer execution more concurrent, i.e. faster Run importers in parallel within a single importer instance Feb 12, 2025
@jcrossley3 jcrossley3 requested a review from ctron February 12, 2025 22:03
@ctron
Copy link
Contributor

ctron commented Feb 13, 2025

To my understanding, this PR simply checks all importers in parallel. With no limitation. Waiting for the joint outcome of a single check run.

I think this would result in the case that one importer runs for a long time, blocking the next check, and thus not really improving the situation. On an initial startup with pending import runs, yes. But afterwards, no.

I think we'd need a strategy first, with the goal mentioned above, before having an implementation.

@jcrossley3
Copy link
Contributor Author

I think this would result in the case that one importer runs for a long time, blocking the next check, and thus not really improving the situation. On an initial startup with pending import runs, yes. But afterwards, no.

True. So no worse than what we have, but does address the primary motivation for this PR: that QE will only have to wait as long as it takes the longest enabled importer to run for them all to be finished.

Plus, better logging.

I think we'd need a strategy first, with the goal mentioned above, before having an implementation.

Agree. Still thinking about that.

@ctron
Copy link
Contributor

ctron commented Feb 13, 2025

I would not want to merge something which only suits some specific QE task. Potentially causing issues for the overall system. Like the case that most of the times it runs sequential, while sometimes it does run in parallel. Without any way of controlling that behavior.

@jcrossley3
Copy link
Contributor Author

I would not want to merge something which only suits some specific QE task. Potentially causing issues for the overall system. Like the case that most of the times it runs sequential, while sometimes it does run in parallel. Without any way of controlling that behavior.

It's not a QE task. They just happen to be the first users who care about how long it takes the app to import data. We already have a means to disable importers. I think there's a cancellation mechanism in there, no?

I don't love that those git-based importers are sync and block threads, but it's not like we expect to have hundreds of them.

Is your reluctance so high that you'd like me to close this PR?

@jcrossley3
Copy link
Contributor Author

To my understanding, this PR simply checks all importers in parallel. With no limitation. Waiting for the joint outcome of a single check run.

I added a limitation. Defaults to the old behavior: concurrency=1.

@jcrossley3 jcrossley3 force-pushed the 1207 branch 2 times, most recently from 55ad1bb to eaa5c2c Compare February 16, 2025 23:24
@@ -294,7 +282,7 @@ where
}
}

#[instrument(skip(self), err)]
#[instrument(skip(self))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep errors. It's ok to reduce the level to e.g. INFO

Copy link
Contributor

@ctron ctron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think that's better. And it should also be possible to easily extend that implementation to a distributed version (running X instances at the same time).

I'd just prefer keeping the err instrumentation. Also converting the ret to err instead. And maybe lowering that level to Info, as it's not an application level error. But that's not a blocker to me.

ctron
ctron previously requested changes Feb 17, 2025
Copy link
Contributor

@ctron ctron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with this in general. But it also feels incomplete due to ongoing discussions on the ADR. Let's finish that discussion first, and then merge implementations.

@jcrossley3
Copy link
Contributor Author

I am ok with this in general. But it also feels incomplete due to ongoing discussions on the ADR. Let's finish that discussion first, and then merge implementations.

Come on, man! They're independent. This PR simply adds the ability for a single instance to run multiple import jobs concurrently. Heck, this PR alone might mitigate the need for distributing work at all. It certainly satisfies the need that motivated the creation of the original downstream issue.

@chirino
Copy link
Contributor

chirino commented Feb 17, 2025

Before we go distributed, we should make sure the DB is not bottlenecking us.

@jcrossley3
Copy link
Contributor Author

Before we go distributed, we should make sure the DB is not bottlenecking us.

Amen. 100%. I think it's also worth looking at the logic within individual importers. The git-based ones are flawed since they're sync, but they only take a few minutes to run. The sbom importer's rate varies widely during a single run, and the CSAF importer just seems broken -- I've never seen it run to completion.

@ctron
Copy link
Contributor

ctron commented Feb 17, 2025

The git-based ones are flawed since they're sync,

IIRC the run on the tokio worker thread pool. Not blocking anything.

CSAF importer just seems broken

I have. Multiple times: https://console-openshift-console.apps.cluster.trustification.rocks/search/ns/trustify-dumps?kind=core~v1~Pod&q=app.kubernetes.io%2Fcomponent%3Dcreate-dump

@jcrossley3
Copy link
Contributor Author

IIRC the run on the tokio worker thread pool. Not blocking anything.

Blocks a clean shutdown when those sync git clone functions are running. Only takes a minute the first time, but 30 minutes the second, so it'a a pain to have to manually kill -9 the trustd process.

@ctron
Copy link
Contributor

ctron commented Feb 17, 2025

IIRC the run on the tokio worker thread pool. Not blocking anything.

Blocks a clean shutdown when those sync git clone functions are running. Only takes a minute the first time, but 30 minutes the second, so it'a a pain to have to manually kill -9 the trustd process.

Maybe that's a local problem then. It looks ok on our OCP instance:

image

@jcrossley3
Copy link
Contributor Author

image

Using the timestamps in your image, if you had attempted to Ctrl+C to break out of a cargo run trustd process at 2/13 15:30:00, the process would hang, I assume waiting for that 30 minute git clone to complete. If instead you hit Ctrl+C at 2/13 19:00, no problem, no hang. It'd even be fine at 18:00, still processing, but post git clone.

Try it for yourself.

@ctron ctron self-requested a review February 17, 2025 16:33
@chirino
Copy link
Contributor

chirino commented Feb 17, 2025

you might want to include an update to https://github.com/trustification/trustify/blob/main/docs/env-vars.md too.

Copy link
Contributor

@chirino chirino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Relates trustification#1207

Allows enabled/disabled state of jobs to be reflected immediately when
manipulated by the UX.

Signed-off-by: Jim Crossley <[email protected]>
Signed-off-by: Jim Crossley <[email protected]>
@jcrossley3
Copy link
Contributor Author

you might want to include an update to https://github.com/trustification/trustify/blob/main/docs/env-vars.md too.

Done, thx

@JimFuller-RedHat
Copy link
Collaborator

JimFuller-RedHat commented Feb 18, 2025

Suggestion:
This change is a 'rubicon' of sorts eg. going from predictable serial processing to something more async was always something we had to cross to get better scalability. So while the changes seem straightforward - would be useful to use a branch and test a bit more at scale to assess impact (maybe this has been done and I am ignorant!). Impl level details (like considering pg skip locking) could be guided by those tests.

@jcrossley3
Copy link
Contributor Author

Suggestion: So while the changes seem straightforward - would be useful to use a branch and test a bit more at scale to assess impact (maybe this has been done and I am ignorant!).

The default value of the new concurrency option is 1, meaning the default "new" behavior should match the "old" sequential behavior. My local tests bear that out, but I've done nothing at scale.

My goal for the PR was to match current behavior by default after merging, so would the concurrency option serve the same purpose as a branch?

@JimFuller-RedHat
Copy link
Collaborator

My goal for the PR was to match current behavior by default after merging, so would the concurrency option serve the same purpose as a branch?

ya I think so!

Copy link
Collaborator

@JimFuller-RedHat JimFuller-RedHat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the caveat that testing might reveal need for more work in this area. I think this should go in.

@jcrossley3 jcrossley3 added this pull request to the merge queue Feb 18, 2025
Merged via the queue into trustification:main with commit 7db1976 Feb 18, 2025
2 checks passed
@jcrossley3 jcrossley3 deleted the 1207 branch February 18, 2025 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants